|
Guide
to Fighting Spam. The Latest Tools
By DANIEL CALLOWAY,
TheWorldJournal.com
By all accounts, spam is on the rise. Brightmail, an AntiSpam vendor
estimates that 42 percent of all Internet e-mail is now spam. That's an 8
percent rise since 2001.
The rise in spam, however, has been met with an almost equal rise in
anti-spam vendors who have come up with many solutions to rid your
mailboxes of unwanted email. At least 25 such vendors are currently
offering products they promise will get rid of most, but not all your
spam.
One such product is Network Associates' acquisition of SpamAssassin Pro. I
have used SpamAssassin Pro with Outlook XP and it works very well. Network
Associates is rolling out a new product based on SpamAssassin's heuristics
engine algorithm, which is an enterprise-level software package called
SpamKiller. This product was first released as an individual consumer
product but now has been reengineered for the corporate workplace.
E-mail has become an all-important tool of communication today not only
for the home user, but more so for the business world. As a result,
corporations can no longer tolerate the proliferation of unwanted,
unsolicited mail from vendors, porn sites, and others who will stop at
next to nothing to make certain you see their e-mail.
One of the biggest problems faced by corporations and consumers today,
however, is how do you block the unsolicited e-mail without blocking the
mail you wish to see. Multiple techniques have been developed since it is
essential to understand that e-mail is generated by human beings who want
their e-mail to get through to you because they either want you to buy
their products or they want you to visit their websites. Like viruses and
Internet worm writers, spammers adapt their messages to beat the system.
Some common methods of detecting spam today are:
Keyword Searches
Keyword searches are static filters that scan subject lines and message
body text looking for words that you have identified. If the software or
anti-spam plug-in detects these words or combinations of these words as
you have specified, the e-mail is identified as spam and it is either
blocked, deleted, or moved to a junk mail folder or other folder you
specify.
While keyword searches give the user a very granular control over the
incoming e-mail, there is a high risk of "false positive" which winds up
preventing the user from seeing mail that they want to see. For instance,
if you specify the word "breast" in your keyword list, an email containing
information about breast cancer will be blocked since that word is a part
of the text in a legitimate e-mail for which you may have an absolute
interest.
In addition, keyword searches as a means of filtering spam lends itself
well to being defeated by spammers since they may intentionally misspell
certain words so that they are not detected by your software. For
instance, spammers might intentionally misspell the word "porn" as "p0rn"
using the zero instead of the letter "o" in order to get around the
software's detection process. Spammers can also use HTML or Hypertext
Markup Language, which can be invisible to the reader but detectable by
the software, thus spoofing the filter.
Black Lists
A black list is a list that is created by the software vendor and then
expanded upon by the user of the anti-spam software which blocks all
e-mail from an email address or header that is unwanted.
There is a wide use of black lists available on the Internet. Once such
company is Mail Abuse Prevention Systems (MAPS) located a
www.mailabuse.com.This
particular site has a Realtime Blackhole List (RBL) that is a database of
URL's or IP addresses of mail servers known to be friendly, or at least
neutral to spammers.
Another well-known black list is SpamCop, located at
www.spamcop.net.
And, another is Open Relay Database, located at
www.ordb.org. Many
anti-spam products maintain their own black lists and include optional
subscriptions to third-party black list services. One major drawback to
black lists, however, is that if you block an entire domain, you may be
blocking as much as 90 percent of wanted mail while blocking only 10
percent of unwanted spam.
White Lists
A white list is a collection of trusted e-mail addresses and domains.
White lists will definitely allow mail coming from a trusted site to come
through, but do nothing to block spam. Therefore, this list must be used
in conjunction with black lists to achieve the proper balance so that you
minimize "false positives" while receiving all the wanted e-mail you wish.
The use of White Lists is beneficial because it increases the speed with
which your e-mail comes to you since any e-mail on the lists bypasses all
other filters which typically look for spam.
On the downside, however, white lists require constant maintenance to be
very effective. If not properly maintained, you run a high risk of losing
e-mail from legitimate sources.
Hashes/Signatures
This is a very popular anti-spam technique. A computer program derives the
checksum or cryptographic hash of a known spam message, in effect creating
a signature or fingerprint of that message. Because spammers send tens of
thousands of messages that are identical, the message can be easily
identified from the fingerprint and blocked effectively.
The strength of this type of anti-spam technique is that it is 100%
certain if the fingerprint of the message matches a future message. IT
WILL BE BLOCKED.
However, spammers have discovered workarounds to fingerprints by inserting
random strings of letters or HTML code into the subject line or body text
in the message which are invisible to the recipient of the mail but
readable by the software in its calculations of the checksum mentioned
above. Placing these random characters in messages makes what is
inevitably an identical e-mail circumvent the checksum algorithm in this
technique allowing an email which should be blocked to get through.
Heuristics
Heuristic analysis is another method which involves running a e-mail
message through a variety of tests. These tests include searching for
characteristics that are typically inherent in spam. Each characteristic
is assigned a spam probability, and the message is given a cumulative
probability score based on the overall test results. If a certain
probability threshold is reached, the e-mail is determined to be spam and
is blocked. If not, the e-mail goes through.
By weighing a variety of characteristics, heuristic analysis increases the
confidence that a message with a high spam score is actually junk mail.
However, heuristics can produce "false positives." This has caused
anti-spam vendors to look at developing new tests to reduce the number of
these "false positives," which are usually a result of spammers trying to
workaround the tests themselves. Another downside to heuristics techniques
is that the length of time required to check each e-mail can be laborious
and time-consuming.
Reverse DNS Lookups
This particular method runs DNS queries on the IP addresses of the
incoming e-mails to determine if the host names identified match actual
host names for those IP addresses of the sender. Because many spammers use
misconfigured hosts to disguise the source of the spam, a query that
doesn't recover a matching host and IP address is a good indication that
the message is junk.
On the downside, however, many legitimate e-mail servers are incorrectly
configured, or have intentionally not registered a name with DNS, so a
reverse query that doesn't return a matching host name isn't
incontrovertible proof that spam exists. In addition, running DNS queries
on a large number or e-mails, such as you would find in large
corporations, is very taxing on network resources of that corporation.
Header Analysis
This technique scans for e-mail headers that deviate from specifications
outlined in RFCS. Many spammers, however, will spoof headers to make it
harder for investigators to track down the source of the spam, making
malformed or spoofed headers a strong indicator of unwanted mail. Header
analysis has an advantage over other techniques in that header information
is much shorter than full body text scans.
Bayesian Filtering
This is a form of text classification that can be applied to spam
detection. This type of filtering learns the more you use it. By examining
the language used in a set of spam messages and the language used in
normal messages, this process filters out spam by making the comparison.
As new messages arrive, the filter rounds up the words or phrases that
have the highest probabilities in either direction--spam or not spam. Then
the filter calculates a new probability that the message is spam or not
spam using the individual scores of the collected words.
The creator of a Bayesian filter called CRM114, Bill Yerazunis, claims
that over 99.9 percent of all e-mail is accurately detected as either spam
or not spam. The website for more information on this filter is
http://crm114.sourceforge.net.
The major drawback of the Bayesian filter is it is more computationally
intensive than other methods for detecting spam.
Bayesinan filtering is an open-source project which is currently receiving
a lot of attention due to its high accuracy rates and extremely low "false
positive" rates. CRM114 and ifile (www.nongnu.org/ifile)
are the first tools that applied Bayesian filtering to spam detection.
Spambayes (http://spambayes.sourceforge.net)
is developing new techniques to improve Bayesian filtering. All three are
open-source.
When trying to determine what methods you should look for and employ to
rid your mailboxes of spam, you want to keep in mind that one
consideration that should be first and foremost is granular control.
Filtering can increase the likelihood that spam will be blocked. Choosing
a solution that gives you multiple options for dealing with suspected
e-mails is probably your best bet.
I am currently using a subscription service that I'm running for 30 days
free to see if I like it or not. The subscription services is called
SpamArrest (www.spamarrest.com).
The cost of the service is $19.95 for 6 months or $34.95 for 1 year. This
service can be used with any POP3 mail client, including Outlook Express,
Outlook and others. I happen to be using it with Eudora Pro 5.2.1. I'll
let you know how it tested against some of the other methods I've
investigated earlier in a future report.
Related web sites:
Bayesian filter
CRM114
Brightmail
Mail Abuse Prevention
Systems (MAPS)
Open Relay Database
SpamArrest
SpamAssassin
Spambayes
SpamCop
|
|
|