@Fernand0 Our next speaker is José Nazario.
@Fernand0 He holds a Ph.D. In Biochemistry and a very active researcher on
@Fernand0 Information Security topics at Arbor Networks.
@Fernand0 He gave us an interesting talk about "Strategies for the detection of
@Fernand0 internet worms" and today the topic is Spam.
@Fernand0 Today the talk is "Spam Analysis".
@Fernand0 Notes for this talk are at:
@Fernand0 http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis
@Fernand0  
@Fernand0 Thank you to all for comming here and thank you to Dr. Nazario for
@Fernand0 preparing this conference about this hot topic.
@Fernand0 Jose ...
@jose_n thanks fernand0, and thanks mjesus, sarnold, emporer, ismak, and evryone involved in this conference. it's an honor to present among this group.
@jose_n today's talk will be on some bulk analysis of spam i have been doing for the past year or so.
@jose_n this project originally started as a way to attempt to develop better filter methods.
@jose_n originally i was thinking about it as an approach like we have in firewalling.
@jose_n i was thinking about a stateful mail filter, analogous to a stateful firewall.
@jose_n i'll explain this scenario in a bit.
@jose_n but what this analysis has wound up being is a way to investigate some "truths" about spam that we often hear about.
@jose_n hurr
@jose_n where did i leave off? my client got disconnected ...
@MJesus nestplitz
@jose_n si. what was my last comment?
@Fernand0 net problems
@Fernand0 :/
@MJesus last sentece:
@MJesus last sentence:
@MJesus [19:20] <jose_n> but what this analysis has wound up being is a way to investigate some "truths" about spam that we often hear about.
@jose_n thank you
@jose_n ok ...
@jose_n this includes the idea of what top level domains arethe sources of spam and how effective you can be using static filter methods.
@jose_n as Fernand0 said, my notes and figures are here in my wiki: http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis
@jose_n so, let's start (again) at the top
@jose_n so, the source for much of this work is my own personal collection of inboxes.
@jose_n my email address is widely published on the internet because i post to a lot of mailing lists.
@jose_n it's also on my web page.
@jose_n so i get a lot of spam, which i have harvested and saved.
@jose_n i then scale up this analysis by looking at the spam archive on the internet.
@jose_n so far my numbers have been very accurate when compared to this large, public archive.
@jose_n i did most of this classification by hand, but i have started to use ifile to help me classify the spam
@jose_n i verify all of these results by hand, though.
@sarnold *sigh*
@jose_n the first thing i looked at was the path announce by the spam headers.
@jose_n remember that my primary goal was to attempt to build an smtp firewall which would be stateful and look at the path of the mail.
@jose_n ok, very sorry again ...
@jose_n the first thing i looked at was the path announced by the spam headers
@jose_n remember that my primary goal was to attempt to build an smtp firewall which would be stateful and look at the path of the mail.
@jose_n email headers will contain the list of mail servers which bounced the mail around until it gets to you.
@jose_n while spammers may alter those, we have to trust them becasue this mail firewall would trust them.
@jose_n so i analyzed them using a small script i wrote in Tcl and decided to look at the results using a graph.
@jose_n in this first graph i used the program "neato" (from the graphviz package) to plot the paths the mail took to get to me.
@jose_n http://www.monkey.org/~jose/images/monkey-spam-sm.jpg
@jose_n i have larger postscript versions of this map available.
@jose_n in this graph my mail servers and the blue pieces.
@jose_n what i found was a bit surprising to me.
@jose_n what i expected to see was a few hubs of activity, which would be big open relays.
@jose_n instead, what you see is that there are few servers in common as the mail comes towards me, suggesting that there isn't any big open relay abuse being shared by spammers who are hitting me.
@jose_n also, you see a few long tendrils, which are a lot of hops as the mail comes towards me ...
@jose_n from riel:  riel> jose_n: well, spammers often use open proxies or Jeems to abuse
@jose_n ISP mail servers from there
@jose_n they may, but its not detectable by my methods here.
@jose_n remember that i was trying to develop an smtp filter, like a packet filter, which would make decisions based on this advertised routing in the headers.
@jose_n (as an aside, i recently got a mail from a spammer which contained their list of open proxies they use to send their spam)
@jose_n from < ifvoid> also, proxies are probably continuouly appearing and disappearing, which would explain the large number of different hosts
@jose_n again, they may, but that conclusion is not supported by this data and not testable by any designs i was working with.
@jose_n so what i'm finding in this analysis is that these common anecdotes about spam are not supported by my data sets.
@jose_n this doesn't mean they're not true, it just means that they're not supported by the data i gathered.
@jose_n from riel> jose_n: one question though, did your spam analysis result in any useful rules on how to filter spam based on the Received: path ?
@jose_n the answer is nothing appearant. the closest rule i could come up with is that the number of hops in my spam was an average of approximately 1 longer than my normal inbox mail.
@jose_n so if you can get a good hop count for your normal mail, you can probably add this extra hop count as a minor weight in a spam determination.
@jose_n i even made some pretty artwork out of this: http://monkey.org/~jose/crimelabs-spam.jpg for example.
@jose_n this is a hyperbolic plot on a 3d surface, not a GPS positioning of spammers. but it makes the graph easier to read.
@jose_n you can get this viewer at http://graphics.stanford.edu/~munzner/h3/
@jose_n from ifvoid> jose_n: this is your regular mail, or the spam?
@jose_n its my spam ... i didn't put up any graphs of my regular mail.
@jose_n also i found out i needed a better graphing program than neato :) i keep finding how to break it with big data sets.
@jose_n ok, so the next thing i looked at was the number of spam mails i have been getting per day over the past year or so.
@jose_n there have been big reports about how we're all getting swamped with spam at some huge rate which doubles every year. ]
@jose_n i looked last fall and again at this spring and while the numbers are up, they're not doubled.
@jose_n you can compare these two graphs: http://www.monkey.org/~jose/graphing/spam/spam_by_day.jpg and http://monkey.org/~jose/graphing/spam/spams-by-day2.png
@jose_n so, so far i haven't found any big support for the "everyone is going to drown in spam", but nothing big against it, either. i need more access to bigger logs for this one, though ...
@jose_n now, another common thing for people to say is "all of my spam comes from the asia-pacific region", so i decided to look at the top level domains and second level domains of the spam i have.
@jose_n seeing this perceived trend, people will start to filter all mail form .kr, .jp, or .cn.
@jose_n so, i decided to look at the advertised/listed "from" address and the top level domains (ie .net, .jp, .cn) and the second level domains (ie yahoo.fr, korea.com, etc) and see what i get.
@jose_n the TLDs are very surprising:
@jose_n http://www.monkey.org/~jose/figs/spam-tld.png
@jose_n what we see here is that .com is the biggest source of spam and .kr is wayyyy down the list.
@jose_n i get similar numbers when i analyze the spam archive, where i looked at something like 75000 spam mails
@jose_n from sarnold> jose_n: are those From: or From_ or Received:?
@jose_n these are from the ^From_ (from ... space) addresses.
@jose_n the second level domains (2LD) graph looks like this: http://www.monkey.org/~jose/figs/spam-2ld.png
@jose_n where the biggest abusers are yahoo.com and hotmail.com, not korea.com, not chinanet, etc ...
@sarnold (is there anyone able to translate to spanish for #redes?)
@jose_n again, this is backed up by 75000 mails from spam archive.
@jose_n i cannot find support for this notion that asia-pacific is a hotbed for spam.
@jose_n riel> jose_n: aren't the from addresses forged most of the time ?
@jose_n sure! but your filter can't make an intelligent determinatuion of who really sent it, so you base it on this.
@jose_n but bear in mind anyone can get a yahoo.com, hotmail.com mail address without issue
@jose_n and send a bunch of mail.
@jose_n also, because of what i do, i have friends in .kr and .jp, so i can't block them blindly :)
@jose_n frmo ifvoid> jose_n: can't you use received-headeds to find out where a spam originated?
@jose_n maybe, but again, those may be forged.
@jose_n so you can't trust them, as offtopic and riel note
@jose_n but, if you're building blocking software, you have to work with what you get to keep up with mail loads.
@jose_n you can't spend time digging around.
@jose_n gustavo asks a great question: what would i reccomend for large mail servers to do to block spam?
@jose_n it's important to remember that this is the perspective of an inbox
@jose_n you don't get more than a few spams from any one address at a time.
@jose_n however, if you run a large mail server, you'll see a lot of mails from the same sources before they disconnect and move on.
@jose_n there it's valuable to subscribe to trustworthy known-spammer lists, but i dont trust the RBL, ORBS, or SPEWS.
@jose_n i have personally (and professionally) make too many mistakes.
@jose_n for end users i think that ifile and bmf, bayesian filters, are best.
@jose_n Vegas asks "so on what programs like Spam Assassin base itself to judge if a mail is a spam or not ?"
@jose_n those are hand crafted rulesets which make scoring based judgements per message.
@jose_n riel correctly points out that you can't examine every message at the ISP level. the load is too great.
@jose_n i was trying to state that you need to use trustworthy lists to block based on the sender id/ip and block their connection as fast as possible.
@jose_n that scales for an isp, becasue with each bad ip you block you stop 1000s of mail messages.
@jose_n sarnold correctly points out that SA2.5+ has a bayesian component now.
@jose_n but still has its regex rules.
@jose_n but for the average user, trying to keep up with these lists is known spam addresses, known subject lines etc is a never ending game.
@jose_n more importantly you can't react fast enough to make a difference when its filtering on a small mail stream.
@jose_n from riel> jose_n: and a question for you, which blocklists would you consider trustworthy, and why ?
@jose_n i don't know any more, since i got out of the mail server operation about a year ago. at the time i didn't see any lists of active and rogue spammers that i could trust. known mass mailer marketers are registered with a few companies and you can trust those to some effect. but they dont stop the random spammers.
@jose_n so, it's been about an hour (filled with good questions and comments as well as some netsplits), and i'll point you at some more material in my notes here: http://monkey.org/~jose/wiki/wiki.php?page=SpamAnalysis
@jose_n there's a bit more there than i had time to talk about, and this page is still being added to.
@jose_n at this time i'll let the translators finish up and i'll take any questions!
@jose_n (in #qc)
@jose_n thank you for your time, i appreciate it :) and thanks to the conference organizers.
@jose_n let's give them a warm round of applause.
@riel jose_n: thank you !
@riel clap clap clap clap clap
garoeda clap clap clap clap clap clap
@sarnold clap clap clap clap clap :)
hans thank you jose_n
hans clap clap clap clap clap
iaiox thanks jose_n
iaiox ;)
@jose_n from sarnold> jose_n: do you have any opinions on razor and razor2?
Vegas thank you jose_n
@jose_n no opinion :) i'm really happy with ifile, so i haven't looked at much else right now.
@jose_n from jose_n: what do you think more important ?   receiving less spam or making sure less spam is sent ?
offtopic jose_n: thank you, clap - very interesting idea.
@jose_n i say stop it at the source, lets stop spammers from sending spams. everything else will flow from there.
Vegas as a conclusion what program do u think is the more efficient to block spams ?
@jose_n but i will say this: i no longer trust anecdotes about spam, i need to see hard numbers behind it :)
@jose_n Vegas: i'm all about bayesian filtering, although crm114 looks pretty cool, too.
@jose_n i'm no longer using static filters.
Vegas k
Vegas thanks
@jose_n the trick is to give it a good body of knowledge to start with.
@sarnold thanks jose_n :) I look forward to your next presentation at umeet (heh heh heh :) -- everyone, please note our next presentation is starting in roughly 40 minutes; offtopic will be presenting on client-side security
@jose_n w00t! go offtopic!
@jose_n thanks sarnold :)
@riel bayesian filters have one big advantage for us ... everybody has a differently trained bayesian filter, so the spammers have no way to tune their message to bypass our filters
@sarnold thanks also to juanca, who has been translating in #redes :)
@jose_n ahh ! great job, and thanks juanca!
StartX http://www.paulgraham.com/better.html - is a good artical on bayesian filtering if anybody wanted to read more about it
hensema shameless plug: the website of the dutch anti-spam foundation spamvrij.nl (spamfree): http://www.spamvrij.nl (unfortunately only in dutch)
* hensema is secretary of the foundation BTW
@riel for people wanting to know about the various dns-based blocklists, http://openrbl.org/ has pointers to a lot of them
hans jose_n: will you attend this channel next friday when riel will talk? I'm curious what you think about it
@riel and a bit of statistics on them
@jose_n hans: i will try :)
hans jose_n: would be nice, thnx so far
@riel jose_n: your analysis showed some surprises ..
@riel I was amazed to see that so much spam came from a few sources
@jose_n riel: yep, me too :) all i can say is that i want to see people measure spam, i think we need to do that and stop basing our techniques on conjecture :)
@jose_n especially if you want to get serious about stopping spam.
@riel especially since I have received spam from over 30000 different IP addresses over the last 3 weeks
hans jose_n: you are right, it might be a good idea to throw together a huge pile of spamboxes and analyze them together
@jose_n hans: spam archive :) use it.
ifvoid jose_n: spam archive?
hans I am
@jose_n http://www.spamarchive.org/
hans well, I archive spam in mailboxes
@jose_n yes, but you a) can't get enough to make really important conclusions on and b) it will be biased by the visibility of those addresses or domains.
@jose_n you need a large, external source
hans jose_n: well, riel and I are working on it
hans more details will follow
@jose_n good.
@jose_n lemme know what you find :)
hans jose_n: we will, friday evening :-)
@jose_n excellent

Generated by irclog2html.pl 2.1 by Jeff Waugh - find it at freshmeat.net!