| @Fernand0 | Our next speaker is José Nazario. |
|---|---|
| @Fernand0 | He holds a Ph.D. In Biochemistry and a very active researcher on |
| @Fernand0 | Information Security topics at Arbor Networks. |
| @Fernand0 | He gave us an interesting talk about "Strategies for the detection of |
| @Fernand0 | internet worms" and today the topic is Spam. |
| @Fernand0 | Today the talk is "Spam Analysis". |
| @Fernand0 | Notes for this talk are at: |
| @Fernand0 | http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis |
| @Fernand0 | |
| @Fernand0 | Thank you to all for comming here and thank you to Dr. Nazario for |
| @Fernand0 | preparing this conference about this hot topic. |
| @Fernand0 | Jose ... |
| @jose_n | thanks fernand0, and thanks mjesus, sarnold, emporer, ismak, and evryone involved in this conference. it's an honor to present among this group. |
| @jose_n | today's talk will be on some bulk analysis of spam i have been doing for the past year or so. |
| @jose_n | this project originally started as a way to attempt to develop better filter methods. |
| @jose_n | originally i was thinking about it as an approach like we have in firewalling. |
| @jose_n | i was thinking about a stateful mail filter, analogous to a stateful firewall. |
| @jose_n | i'll explain this scenario in a bit. |
| @jose_n | but what this analysis has wound up being is a way to investigate some "truths" about spam that we often hear about. |
| @jose_n | hurr |
| @jose_n | where did i leave off? my client got disconnected ... |
| @MJesus | nestplitz |
| @jose_n | si. what was my last comment? |
| @Fernand0 | net problems |
| @Fernand0 | :/ |
| @MJesus | last sentece: |
| @MJesus | last sentence: |
| @MJesus | [19:20] <jose_n> but what this analysis has wound up being is a way to investigate some "truths" about spam that we often hear about. |
| @jose_n | thank you |
| @jose_n | ok ... |
| @jose_n | this includes the idea of what top level domains arethe sources of spam and how effective you can be using static filter methods. |
| @jose_n | as Fernand0 said, my notes and figures are here in my wiki: http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis |
| @jose_n | so, let's start (again) at the top |
| @jose_n | so, the source for much of this work is my own personal collection of inboxes. |
| @jose_n | my email address is widely published on the internet because i post to a lot of mailing lists. |
| @jose_n | it's also on my web page. |
| @jose_n | so i get a lot of spam, which i have harvested and saved. |
| @jose_n | i then scale up this analysis by looking at the spam archive on the internet. |
| @jose_n | so far my numbers have been very accurate when compared to this large, public archive. |
| @jose_n | i did most of this classification by hand, but i have started to use ifile to help me classify the spam |
| @jose_n | i verify all of these results by hand, though. |
| @sarnold | *sigh* |
| @jose_n | the first thing i looked at was the path announce by the spam headers. |
| @jose_n | remember that my primary goal was to attempt to build an smtp firewall which would be stateful and look at the path of the mail. |
| @jose_n | ok, very sorry again ... |
| @jose_n | the first thing i looked at was the path announced by the spam headers |
| @jose_n | remember that my primary goal was to attempt to build an smtp firewall which would be stateful and look at the path of the mail. |
| @jose_n | email headers will contain the list of mail servers which bounced the mail around until it gets to you. |
| @jose_n | while spammers may alter those, we have to trust them becasue this mail firewall would trust them. |
| @jose_n | so i analyzed them using a small script i wrote in Tcl and decided to look at the results using a graph. |
| @jose_n | in this first graph i used the program "neato" (from the graphviz package) to plot the paths the mail took to get to me. |
| @jose_n | http://www.monkey.org/~jose/images/monkey-spam-sm.jpg |
| @jose_n | i have larger postscript versions of this map available. |
| @jose_n | in this graph my mail servers and the blue pieces. |
| @jose_n | what i found was a bit surprising to me. |
| @jose_n | what i expected to see was a few hubs of activity, which would be big open relays. |
| @jose_n | instead, what you see is that there are few servers in common as the mail comes towards me, suggesting that there isn't any big open relay abuse being shared by spammers who are hitting me. |
| @jose_n | also, you see a few long tendrils, which are a lot of hops as the mail comes towards me ... |
| @jose_n | from riel: riel> jose_n: well, spammers often use open proxies or Jeems to abuse |
| @jose_n | ISP mail servers from there |
| @jose_n | they may, but its not detectable by my methods here. |
| @jose_n | remember that i was trying to develop an smtp filter, like a packet filter, which would make decisions based on this advertised routing in the headers. |
| @jose_n | (as an aside, i recently got a mail from a spammer which contained their list of open proxies they use to send their spam) |
| @jose_n | from < ifvoid> also, proxies are probably continuouly appearing and disappearing, which would explain the large number of different hosts |
| @jose_n | again, they may, but that conclusion is not supported by this data and not testable by any designs i was working with. |
| @jose_n | so what i'm finding in this analysis is that these common anecdotes about spam are not supported by my data sets. |
| @jose_n | this doesn't mean they're not true, it just means that they're not supported by the data i gathered. |
| @jose_n | from riel> jose_n: one question though, did your spam analysis result in any useful rules on how to filter spam based on the Received: path ? |
| @jose_n | the answer is nothing appearant. the closest rule i could come up with is that the number of hops in my spam was an average of approximately 1 longer than my normal inbox mail. |
| @jose_n | so if you can get a good hop count for your normal mail, you can probably add this extra hop count as a minor weight in a spam determination. |
| @jose_n | i even made some pretty artwork out of this: http://monkey.org/~jose/crimelabs-spam.jpg for example. |
| @jose_n | this is a hyperbolic plot on a 3d surface, not a GPS positioning of spammers. but it makes the graph easier to read. |
| @jose_n | you can get this viewer at http://graphics.stanford.edu/~munzner/h3/ |
| @jose_n | from ifvoid> jose_n: this is your regular mail, or the spam? |
| @jose_n | its my spam ... i didn't put up any graphs of my regular mail. |
| @jose_n | also i found out i needed a better graphing program than neato :) i keep finding how to break it with big data sets. |
| @jose_n | ok, so the next thing i looked at was the number of spam mails i have been getting per day over the past year or so. |
| @jose_n | there have been big reports about how we're all getting swamped with spam at some huge rate which doubles every year. ] |
| @jose_n | i looked last fall and again at this spring and while the numbers are up, they're not doubled. |
| @jose_n | you can compare these two graphs: http://www.monkey.org/~jose/graphing/spam/spam_by_day.jpg and http://monkey.org/~jose/graphing/spam/spams-by-day2.png |
| @jose_n | so, so far i haven't found any big support for the "everyone is going to drown in spam", but nothing big against it, either. i need more access to bigger logs for this one, though ... |
| @jose_n | now, another common thing for people to say is "all of my spam comes from the asia-pacific region", so i decided to look at the top level domains and second level domains of the spam i have. |
| @jose_n | seeing this perceived trend, people will start to filter all mail form .kr, .jp, or .cn. |
| @jose_n | so, i decided to look at the advertised/listed "from" address and the top level domains (ie .net, .jp, .cn) and the second level domains (ie yahoo.fr, korea.com, etc) and see what i get. |
| @jose_n | the TLDs are very surprising: |
| @jose_n | http://www.monkey.org/~jose/figs/spam-tld.png |
| @jose_n | what we see here is that .com is the biggest source of spam and .kr is wayyyy down the list. |
| @jose_n | i get similar numbers when i analyze the spam archive, where i looked at something like 75000 spam mails |
| @jose_n | from sarnold> jose_n: are those From: or From_ or Received:? |
| @jose_n | these are from the ^From_ (from ... space) addresses. |
| @jose_n | the second level domains (2LD) graph looks like this: http://www.monkey.org/~jose/figs/spam-2ld.png |
| @jose_n | where the biggest abusers are yahoo.com and hotmail.com, not korea.com, not chinanet, etc ... |
| @sarnold | (is there anyone able to translate to spanish for #redes?) |
| @jose_n | again, this is backed up by 75000 mails from spam archive. |
| @jose_n | i cannot find support for this notion that asia-pacific is a hotbed for spam. |
| @jose_n | riel> jose_n: aren't the from addresses forged most of the time ? |
| @jose_n | sure! but your filter can't make an intelligent determinatuion of who really sent it, so you base it on this. |
| @jose_n | but bear in mind anyone can get a yahoo.com, hotmail.com mail address without issue |
| @jose_n | and send a bunch of mail. |
| @jose_n | also, because of what i do, i have friends in .kr and .jp, so i can't block them blindly :) |
| @jose_n | frmo ifvoid> jose_n: can't you use received-headeds to find out where a spam originated? |
| @jose_n | maybe, but again, those may be forged. |
| @jose_n | so you can't trust them, as offtopic and riel note |
| @jose_n | but, if you're building blocking software, you have to work with what you get to keep up with mail loads. |
| @jose_n | you can't spend time digging around. |
| @jose_n | gustavo asks a great question: what would i reccomend for large mail servers to do to block spam? |
| @jose_n | it's important to remember that this is the perspective of an inbox |
| @jose_n | you don't get more than a few spams from any one address at a time. |
| @jose_n | however, if you run a large mail server, you'll see a lot of mails from the same sources before they disconnect and move on. |
| @jose_n | there it's valuable to subscribe to trustworthy known-spammer lists, but i dont trust the RBL, ORBS, or SPEWS. |
| @jose_n | i have personally (and professionally) make too many mistakes. |
| @jose_n | for end users i think that ifile and bmf, bayesian filters, are best. |
| @jose_n | Vegas asks "so on what programs like Spam Assassin base itself to judge if a mail is a spam or not ?" |
| @jose_n | those are hand crafted rulesets which make scoring based judgements per message. |
| @jose_n | riel correctly points out that you can't examine every message at the ISP level. the load is too great. |
| @jose_n | i was trying to state that you need to use trustworthy lists to block based on the sender id/ip and block their connection as fast as possible. |
| @jose_n | that scales for an isp, becasue with each bad ip you block you stop 1000s of mail messages. |
| @jose_n | sarnold correctly points out that SA2.5+ has a bayesian component now. |
| @jose_n | but still has its regex rules. |
| @jose_n | but for the average user, trying to keep up with these lists is known spam addresses, known subject lines etc is a never ending game. |
| @jose_n | more importantly you can't react fast enough to make a difference when its filtering on a small mail stream. |
| @jose_n | from riel> jose_n: and a question for you, which blocklists would you consider trustworthy, and why ? |
| @jose_n | i don't know any more, since i got out of the mail server operation about a year ago. at the time i didn't see any lists of active and rogue spammers that i could trust. known mass mailer marketers are registered with a few companies and you can trust those to some effect. but they dont stop the random spammers. |
| @jose_n | so, it's been about an hour (filled with good questions and comments as well as some netsplits), and i'll point you at some more material in my notes here: http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis |
| @jose_n | there's a bit more there than i had time to talk about, and this page is still being added to. |
| @jose_n | at this time i'll let the translators finish up and i'll take any questions! |
| @jose_n | (in #qc) |
| @jose_n | thank you for your time, i appreciate it :) and thanks to the conference organizers. |
| @jose_n | let's give them a warm round of applause. |
| @riel | jose_n: thank you ! |
| @riel | clap clap clap clap clap |
| garoeda | clap clap clap clap clap clap |
| @sarnold | clap clap clap clap clap :) |
| hans | thank you jose_n |
| hans | clap clap clap clap clap |
| iaiox | thanks jose_n |
| iaiox | ;) |
| @jose_n | from sarnold> jose_n: do you have any opinions on razor and razor2? |
| Vegas | thank you jose_n |
| @jose_n | no opinion :) i'm really happy with ifile, so i haven't looked at much else right now. |
| @jose_n | from jose_n: what do you think more important ? receiving less spam or making sure less spam is sent ? |
| offtopic | jose_n: thank you, clap - very interesting idea. |
| @jose_n | i say stop it at the source, lets stop spammers from sending spams. everything else will flow from there. |
| Vegas | as a conclusion what program do u think is the more efficient to block spams ? |
| @jose_n | but i will say this: i no longer trust anecdotes about spam, i need to see hard numbers behind it :) |
| @jose_n | Vegas: i'm all about bayesian filtering, although crm114 looks pretty cool, too. |
| @jose_n | i'm no longer using static filters. |
| Vegas | k |
| Vegas | thanks |
| @jose_n | the trick is to give it a good body of knowledge to start with. |
| @sarnold | thanks jose_n :) I look forward to your next presentation at umeet (heh heh heh :) -- everyone, please note our next presentation is starting in roughly 40 minutes; offtopic will be presenting on client-side security |
| @jose_n | w00t! go offtopic! |
| @jose_n | thanks sarnold :) |
| @riel | bayesian filters have one big advantage for us ... everybody has a differently trained bayesian filter, so the spammers have no way to tune their message to bypass our filters |
| @sarnold | thanks also to juanca, who has been translating in #redes :) |
| @jose_n | ahh ! great job, and thanks juanca! |
| StartX | http://www.paulgraham.com/better.html - is a good artical on bayesian filtering if anybody wanted to read more about it |
| hensema | shameless plug: the website of the dutch anti-spam foundation spamvrij.nl (spamfree): http://www.spamvrij.nl (unfortunately only in dutch) |
| * hensema is secretary of the foundation BTW | |
| @riel | for people wanting to know about the various dns-based blocklists, http://openrbl.org/ has pointers to a lot of them |
| hans | jose_n: will you attend this channel next friday when riel will talk? I'm curious what you think about it |
| @riel | and a bit of statistics on them |
| @jose_n | hans: i will try :) |
| hans | jose_n: would be nice, thnx so far |
| @riel | jose_n: your analysis showed some surprises .. |
| @riel | I was amazed to see that so much spam came from a few sources |
| @jose_n | riel: yep, me too :) all i can say is that i want to see people measure spam, i think we need to do that and stop basing our techniques on conjecture :) |
| @jose_n | especially if you want to get serious about stopping spam. |
| @riel | especially since I have received spam from over 30000 different IP addresses over the last 3 weeks |
| hans | jose_n: you are right, it might be a good idea to throw together a huge pile of spamboxes and analyze them together |
| @jose_n | hans: spam archive :) use it. |
| ifvoid | jose_n: spam archive? |
| hans | I am |
| @jose_n | http://www.spamarchive.org/ |
| hans | well, I archive spam in mailboxes |
| @jose_n | yes, but you a) can't get enough to make really important conclusions on and b) it will be biased by the visibility of those addresses or domains. |
| @jose_n | you need a large, external source |
| hans | jose_n: well, riel and I are working on it |
| hans | more details will follow |
| @jose_n | good. |
| @jose_n | lemme know what you find :) |
| hans | jose_n: we will, friday evening :-) |
| @jose_n | excellent |