IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Spamassassin filters, content-based filtering costs

You can do better than this, Norm. Think. And do a modicum of research. OTOH, you're providing a good opportunity to explain why content/context based spamfiltering is so powerful and useful.

\r\n\r\n

Among the rationales behind my article is the fact -- not opinion, not possibility, not future likelihood, but FACT -- that broad-based, multiple-factor, weighted analysis can achieve spam reduction rates of:

\r\n\r\n
    \r\n
  • Spam detection (true positive): 98%
  • \r\n
  • Ham pass (true negative): 99.8%
  • \r\n
  • Spam pass (false negative): 2%
  • \r\n
  • Ham filter (false positive): 0.2%
  • \r\n
\r\n\r\n

...or better. In typical use. Not highly rigged "benchmarketing" tests, but real user experience.

\r\n\r\n

"Spam" is of course spam mail, "ham" is legitimate, desired mail.

\r\n\r\n

A full list of rules and scores assiciated with them as used by Spamassassin is [link|http://www.spamassassin.org/tests.html|here]. Briefly, these include:

\r\n\r\n
    \r\n
  • Header analysis: who the mail is from and to, and what servers it originatef from or was sent from, subject.
  • \r\n\r\n
  • Text analysis: weighted keyword and phrase analysis indicating spam or ham likelihood.
  • \r\n\r\n
  • IP Blacklists: weighted assessments of spamminess based on IP points of origin.
  • \r\n\r\n
  • Known spam checks: Use of Vipul's Razor and other lists which track known spam payloads and reject them.
  • \r\n\r\n
  • Bayesian analysis: A weighted assessment based on the users's own identification of spam and ham messages of whether or not a message is spam.
  • \r\n\r\n
  • Whitelisting / blacklisting: adding of specific addresses to whitelists or blacklist, either automatically or at the users' request, to assure that specific addresses are passed or filtered.
  • \r\n\r\n
\r\n\r\n

Because of this, a direct, highly specific, individual assessment of spam on given messages can be made, rather than relying on very course methods such as rejecting mail from large blocks of IP addresses for no reason specific to an individual email other than that spam has been received from neighboring addresses. In the real-estate business, this practice is known as redlining, and is illegal.

\r\n\r\n

It also means that systems such as [link|http://kmself.home.netcom.com/Rants/challenge-response.html|challenge-response] which "solve" the spam problem by shoving it off on others cannot be justified.. I'm very much trying to get Earthlink to reconsider its plans to offer C-R antispam service as this will likely cause Earthlink as a whole to be considered a spam source itself by many remote systems.

\r\n\r\n

Not only is this possible, but the processing throughput and overhead are relatively low. In maximum efficiency configuration -- running Spamassassin in client-server mode, with no network checks (these take a significant amount of time and don't markedly improve accuracy), it's possible to handle 10-20 messages/second on reasonable x86 hardware. Say, 256 MB, < 1GHz, and standard 7200 RPM ATA disks.

\r\n\r\n

Even in the case of a monster ISP such as AOL, which claims to have blocked more than 2 billion spam mails (thousand million for EU types) in a single day, or 80% of incoming mail, this load could be handled at low per-user cost. If the bulk of the 2.5 billion daily mails arrives in a 10 hour period, and spam filtering takes 0.1 second per message, the load could be handled with a spam filtering cluster of 700 commodity x86 boxes. Estimating deployment costs generously, at $1000 per box, the total cost of the spam filtering would be $700,000. Which sounds like a lot until divided amongst the 35 million customers of AOL, giving a price of $0.02 per customer. Amortized over the life of the equipment. Note that deployment costs would likely be less, net throughput probably higher, and other functionality could be assigned to the equipment. This is just a rough estimate, someone with a better take on capabilities and costs could dial this in tighter. At this cost, and given the negative costs associated with falsely filtering out legitimate mail, the proposal should be a strongly positive one for AOL customers.

\r\n\r\n

Further, with such an online, realtime assessment, AOL would be btter placed to combat spam in real time by identifying high-volume spam sources, and retaliating either by blocking or delaying mail from such locations for the period of the attack, by identifying persistant sources of spam and denying them service or strongly degrading service to these sources, or by taking legal or other actions against the operators of the systems or networks from which the spam was originating.

--\r\n
Karsten M. Self [link|mailto:kmself@ix.netcom.com|kmself@ix.netcom.com]\r\n
[link|http://kmself.home.netcom.com/|http://kmself.home.netcom.com/]\r\n
What part of "gestalt" don't you understand?\r\n
[link|http://twiki.iwethey.org/twiki/bin/view/Main/|TWikIWETHEY] -- an experiment in collective intelligence. Stupidity. Whatever.\r\n
\r\n
   Keep software free.     Oppose the CBDTPA.     Kill S.2048 dead.\r\n[link|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html]\r\n
Expand Edited by kmself Sept. 13, 2003, 11:59:13 PM EDT
New I tried SAProxy
It seems to do a good job, but doesn't block 98% of the Spam, looks more like 60% or 50% of the Spam. I still get messages for the same product links that the SAProxy identified as Spam, coming in as a regular message not identified as Spam. If it kept a database on URL Links used by Spammers, I'm sure it would have caught them. The same Viagra, grow your manhood, weight loss, webcams, green card, etc email that was blocked comes in if it meets a certain criteria but has the same products as the blocked email.

Still it is better than not having a filter. I thank you for explaining it more.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New RTFM, TTFBF, TAFDP, DBADF

Read the fucking manual. Research your problem. Join the mailing list. Find out what you're doing wrong, or if the app works as advertised.

\r\n\r\n

Train the fucking Bayesian features -- they're adaptive, which means you have to give them data to adapt to.

\r\n\r\n

If SAProxy doesn't work, then try [link|http://au.spamassassin.org/where.html|a fucking different product]. There are lots incorporating SpamAssassin and other Bayesian sorters.

\r\n\r\n

Don't be a dumb fuck. Try. Fix. Solve. Get off your ass. Don't whine that it didn't work, you've played that card way too often here, bud.

--\r\n
Karsten M. Self [link|mailto:kmself@ix.netcom.com|kmself@ix.netcom.com]\r\n
[link|http://kmself.home.netcom.com/|http://kmself.home.netcom.com/]\r\n
What part of "gestalt" don't you understand?\r\n
[link|http://twiki.iwethey.org/twiki/bin/view/Main/|TWikIWETHEY] -- an experiment in collective intelligence. Stupidity. Whatever.\r\n
\r\n
   Keep software free.     Oppose the CBDTPA.     Kill S.2048 dead.\r\n[link|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html]\r\n
New IRTFM
I read the manual before installing, I used options to better filter the mail. These include:

#1 Blocking Non-English messages (I get a lot of Russian, Chinese, Korean, Etc Spam)

#2 Using Non-Local Network tests. Takes longer but worth it.

#3 Automatically learn from past Spam. (Maybe I should wait until it gets better and learns from it's mistakes?)

#4 I am looking at setting up some rules. The Bayesian feature learning requires thousands of examples of Spam and Ham in order to be effective. I usually just trash the Spam. Also I am confused, do these examples need to be on my mail server or in my email client? Or do I just export them to a text file that sa-learn can read? The manual doesn't seem to state which. Just that it has to come from the mail server, not where it has to reside.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Spam, spam, spam, spam
Norm wrote:


The Bayesian feature learning requires thousands of examples of Spam and Ham in order to be effective.

Want some? My ~/inboxes/junkmail has 9303 such messages. Er, 9317. Er, 9330. I'd be glad to gzip it and send it to you. Just tell me where.

Rick Moen
rick@linuxmafia.com


If you lived here, you'd be $HOME already.
New Re: Spam, spam, spam, spam
Well I would, except the user's manual specifically states that the Spam and Ham should come from my mail server, or else it might not work. It has to be Spam that the SAProxy did not catch.

Thanks for the offer.

P.S. Yes I did join the mailing list for SA. I am just posting my results and experience with the software. Can't get those numbers out of just a simple install, SA has to be teaked and fed examples before those numbers can be gotten apparently.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Bayesian still useful w/out 1000s to start with
From my experiences posted here:

[link|/forums/render/content/show?contentid=74588|Post #74588] - 97% effective after 1 week

[link|/forums/render/content/show?contentid=75996|Post #75996] - 100% effective after 2 weeks

an occasional SPAM will still slip thru so it's really 99.999...9%


Darrell Spice, Jr.                      [link|http://www.spiceware.org/cgi-bin/spa.pl?album=./Artistic%20Overpass|Artistic Overpass]\n[link|http://www.spiceware.org/|SpiceWare] - We don't do Windows, it's too much of a chore
New Thanks
I'll read through and see what rules/words I need to add to make the filter more effective.

I suppose that "Viagra", "Grow your Manhood", "Greencard", "WebCam", "Herbal", and "Weight Loss" will block a majority of them.

Thanks again.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New don't do that
The whole point of Bayesian filters is that they learn on their own.

The only thing you have to do is correct it when it's wrong.
Darrell Spice, Jr.                      [link|http://www.spiceware.org/cgi-bin/spa.pl?album=./Artistic%20Overpass|Artistic Overpass]\n[link|http://www.spiceware.org/|SpiceWare] - We don't do Windows, it's too much of a chore
New By using sa-learn?
Do I save the email as text, or can sa-learn read my email client?



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Using MBX formats
The mail has to be in a MBX format, the manual didn't mention this. If it did, I wouldn't be so confused. Thanks for the suggestions and help.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Care and feeding

I don't know how your system works. It's your system. Read the manual, ask the right questions, or get something that works.

\r\n\r\n

In my case, I toss the odd spam that rolls through into a folder "spam-learn". Four times a day, I've got a cronjob that rolls through that folder with "sa-learn --spam --dir", updating my Bayesian rules.

\r\n\r\n

If I start getting any significant number of false positives, I'll create an analagous "ham-learn" folder and cronjob.

\r\n\r\n

The result is: I have good filtering which gets better as the odd bit of spam spills through and is added to my rulebase. Yes, I seeded this with an initial thousand or so messages of spam and ham. But if you don't have this sort of base, you can kick things off without it.

\r\n\r\n

The point is this: you, yes, you Norm, not the rest of us, are going to have to figure out where the fuck you feed your false positives and false negatives within your system. Most likely there's a mailbox or aaddress you send this mail to. If your system doesn't provide for this, make sure you aren't fucking up and misunderstanding its functionality, then find something that works. And yes, this means you don't just delete spam -- you feed it to your system for training.

\r\n\r\n

Of course, I'm making painfully clear that self-hosted SA on a box you control with real services, rather than that Microsoft crap you insist on using, is a system that works. If you can handle it.

\r\n\r\n

Of course, if you can't, that's your problem.

--\r\n
Karsten M. Self [link|mailto:kmself@ix.netcom.com|kmself@ix.netcom.com]\r\n
[link|http://kmself.home.netcom.com/|http://kmself.home.netcom.com/]\r\n
What part of "gestalt" don't you understand?\r\n
[link|http://twiki.iwethey.org/twiki/bin/view/Main/|TWikIWETHEY] -- an experiment in collective intelligence. Stupidity. Whatever.\r\n
\r\n
   Keep software free.     Oppose the CBDTPA.     Kill S.2048 dead.\r\n[link|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html|http://www.eff.org/alerts/20020322_eff_cbdtpa_alert.html]\r\n
New I found out how to do it
Yes I am using Outlook 2000, haven't switched to something else yet. I have all my contacts and calandar there and it integrates with my Timex Datalink watch. I have Norton AntiVirus 2003 scanning the email, I'm soon to upgrade to NAV 2004 and Norton Personal Firewall 2004 when it comes out Oct 1st.

From the SA mailing list, someone was kind enough to send me this URL:
[link|http://spambayes.sourceforge.net//windows.html|http://spambayes.sou...net//windows.html]

It will help feed Spam and Ham from Outlook 2000.

I have security turned on in Outlook 2000 to prevent ActiveX controls, VBScript, and other things, plus NAV 2003 has a script blocker. I never open attachments unless I am expecting one and know in advance and scanned it first before opening it.

I currently am saving the Spam that slips through. It is starting to get fewer and fewer now.



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Yeah, right
Norm wrote:

Yes I am using Outlook 2000....

Welcome to the Security Forum. Guess what your first step is?

Rick Moen
rick@linuxmafia.com


If you lived here, you'd be $HOME already.
New ICLRPD (new thread)
Created as new thread #118039 titled [link|/forums/render/content/show?contentid=118039|ICLRPD]
===

Implicitly condoning stupidity since 2001.
New {chortlissimo!}___Sharper than a serpent's tooth.. be thine
Golden Epitaph with DU shielding.. for verily I say unto thee - thine alabaster cities gleam for the countless entrails glistening in the sun.. of those who failed to LookOut.

Two Ears, a Tail and the plucked-out pineal glands of 12 entrepreneurs!
New First step is to get rid of Outlook 2000 and use something
else. A little harder to break the Outlook habit than I thought it would be.

I almost switched to Mozilla Mail, but I need something that can synch with my Timex Datalink watch for contacts and calendar info. Massive PST conversion to Mozilla Mail almost broke the converter. I think I have like 26Megs of old mail. So I keep putting it off. Things keep on happening that takes up my time from doing some things. But isn't that just the way life is?



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Datalink watch and Linux
[link|http://datalink.fries.net/|http://datalink.fries.net/]

Didn't read much, but there is stuff out there integrating the Datalink watch with Linux.

What's your next excuse?
-----
Steve
New Re: Datalink watch and Linux
What's your next excuse?
Linux, presumably.


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New Not Linux
I returned the 160Gig drive because of partition problems with it. I am waiting to buy new parts so I can turn my old system into a Linux system and the new one into Windows XP Pro. I have had nothing but problems trying to get a workable system to run Linux, old systems that seem to have hardware issues or something that stops Linux from loading/installing. I'm back to my older smaller hard drive that only can support Windows 98SE.

I had such major problems that I am about ready to throw all my systems into a recycle center and give up on computers. I am not going to do that even if I feel that way sometimes. I've spent a lot of time trying to get Linux workable, I've spent a lot of time testing and working with the hardware issues. I've given up until I can afford new hardware unless one of you geniuses knows how to make Linux install on bad hardware or hardware that isn't Linux friendly?



"Lady I only speak two languages, English and Bad English!" - Corbin Dallas "The Fifth Element"

New Want some cheese with that?


Peter
[link|http://www.debian.org|Shill For Hire]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Blog]
New Help with Linux problems concerning particular hardware
Norm wrote:

I've given up until I can afford new hardware unless one of you geniuses knows how to make Linux install on bad hardware or hardware that isn't Linux friendly?

1. The question is unfortunately not answerable as posed, for lack of specificity. Help is available (in the Linux Forum, preferably, not the Security Forum), but you have to [link|http://catb.org/~esr/faqs/smart-questions.html|specify] what components you're talking about, and the symptoms you encountered when you tried particular things, detailed in an orderly fashion.

2. If the term "isn't Linux friendly" means MS-Windows-dependent, e.g., modems, printers, USB ADSL bridges, and IDE RAID cards with deliberately omitted circuitry that must then be emulated via proprietary "engine" software hidden inside Windows-only drivers, the answer is that you can't do it, and should cease acquiring such equipment. On the other hand, if it means otherwise acceptable hardware for which Linux kernel, XFree86, or SANE back-end support seems rare, the answer is to either get a very recent Linux distribution release, or be prepared to do driver upgrades after initial installation. On the gripping hand, if it means hardware that requires that you furnish Linux with detailed information about it at installation time (e.g., the very oldest and cheapest ISA ethernet and SCSI cards that lacked option ROMs), then you need to simply go to that additional effort as part of the price you pay for using antique, underdesigned hardware.

3. Don't try to install Linux or any other OS on bad (defective) hardware. Cut it up into small pieces and throw it away.

Rick Moen
rick@linuxmafia.com


If you lived here, you'd be $HOME already.
     Working towards a better Spam filter - (orion) - (23)
         Re: Working towards a better Spam filter - (pwhysall)
         Spamassassin filters, content-based filtering costs - (kmself) - (21)
             I tried SAProxy - (orion) - (20)
                 RTFM, TTFBF, TAFDP, DBADF - (kmself) - (19)
                     IRTFM - (orion) - (18)
                         Spam, spam, spam, spam - (rickmoen) - (1)
                             Re: Spam, spam, spam, spam - (orion)
                         Bayesian still useful w/out 1000s to start with - (SpiceWare) - (4)
                             Thanks - (orion) - (3)
                                 don't do that - (SpiceWare) - (2)
                                     By using sa-learn? - (orion) - (1)
                                         Using MBX formats - (orion)
                         Care and feeding - (kmself) - (10)
                             I found out how to do it - (orion) - (9)
                                 Yeah, right - (rickmoen) - (3)
                                     ICLRPD (new thread) - (drewk)
                                     {chortlissimo!}___Sharper than a serpent's tooth.. be thine - (Ashton)
                                     First step is to get rid of Outlook 2000 and use something - (orion)
                                 Datalink watch and Linux - (Steve Lowe) - (4)
                                     Re: Datalink watch and Linux - (pwhysall) - (3)
                                         Not Linux - (orion) - (2)
                                             Want some cheese with that? -NT - (pwhysall)
                                             Help with Linux problems concerning particular hardware - (rickmoen)

You'll need to find alternative forms of payment.
85 ms