[Python-Dev] Getting started with GBayes testing (original) (raw)

Tim Peters spambayes@python.org
Thu, 05 Sep 2002 13:57:17 -0400

Previous message: [Python-Dev] Getting started with GBayes testing
Next message: [Python-Dev] Getting started with GBayes testing
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Followups directed to spambayes@python.org http://mail.python.org/mailman-21/listinfo/spambayes ]

[Brad Clements]

... My feeling is that the presentation of "the message" is independent of the message itself, so if I get a message in Text, HTML, RTF only the actual content is important, not the markup method.

Everything's A Clue. Everything that gets ignored partly blinds the classifier, so the question isn't whether there's a difference, it's how much of a difference it makes.

Though I suppose using lots of red and large fonts might be an indicator of spam, the text of the message should still suffice.

Indeed, Graham reported that the hex color code for bright red was one of the strongest spam indicators in his database.

Tim's comments in timtest.py hint that stripping tags isn't a catastrophe for f-n's, but he's not planning on doing that for use on technical lists.

When HTML-only email is a 99.99% spam indicator on a tech list, it would be crazy to ignore that clue. But note that the comments also say I'd be delighted to remove HTML tags even there if some other way of slashing the f-n rate is proven to work (and most people who have tried it say that mining more header lines does do it -- but then I haven't seen anything from them about how they do when they ignore the header lines. I was happy to ignore header lines in order to get some kind of handle on how well could be done on "pure content", and turned out that works remarkably well).

# So if a message is multipart/alternative with both text/plain # and text/html branches, we ignore the latter, else newbies would never # get a message through. If a message is just HTML, it has virtually no # chance of getting through

Tells me (spammer hat on) that I can send message with a non-spammish text only part, and a spam html part since most "non-techie" email client users automatically display the html version when available, however Tim's implementation will ignore it.

Sure. It certainly isn't a problem on my test data (as witnessed by the measured error rates). If the nature of the world changes, the code has to adapt along with it. But 90% of the spam I receive (and I get a lot) is still trivial to recognize from a mere glance at the subject line, and I don't buy that spammers are a class of ubergeek with formidable skill. Response rates are a percentage game, and more so than anti-spammers I expect spammers are keen to go for high-percentage wins at the expense of esoterica.

Most "average users" never even see the text-only part of multipart messages. In Tim's application, that's okay since he's going to use the text-only part anyway. But for my purposes, I need to consider both portions. So it's simpler for me to strip html and combine that text with the text-only part and then "test" the combined parts.

Not unreasonable , but testing remains the only way to decide. It's rare you can out-think a fraction of a percent!

Previous message: [Python-Dev] Getting started with GBayes testing
Next message: [Python-Dev] Getting started with GBayes testing
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]