Rich Kulawiec | 1 Oct 2010 12:14

Re: Ideas for anti-spam

On Thu, Sep 30, 2010 at 10:02:35AM -0500, mathew wrote:
> No, having end-users vote democratically on what constitutes spam and
> then imposing that decision on everyone is a complete non-starter.

Agreed: user input should *never* be used unless it passes by the
eyeballs of someone senior, experienced, and cynical.

That's harsh, but so is reality.  Actually, reality's worse:

	http://www.mediapost.com/publications/?fa=Articles.showArticle&art_aid=130320#

which reads in part:

	The complaint details how Mizhen and his affiliates allegedly
	manipulated the statistics that Microsoft's anti-spam system
	relies on by creating millions of new email accounts and then
	moving up to 200,000 of their own messages a day from "junk"
	files into inboxes.

	[...]

	An associate of Mizhen allegedly contacted Microsoft and
	said that the messages weren't spam -- as evidenced by the
	statistics showing that people moved the messages into their
	inboxes. Microsoft was taken in by the associate's representations
	and unblocked the spam messages, according to its complaint.

This tells us several things.

First, spammers have demonstrated that they understand how to game
(Continue reading)

Darxus | 21 Oct 2010 03:45
Favicon

Collecting IP reputation data from many people

I'd like your thoughts on collecting reputation data (% spam vs. non-spam
originating at every IP) from everyone willing to submit it.

My goal would be to be able to tell anyone how much spam and non-spam any
IP address has sent recently.  And that they could use that information to
aid spam filtration.  

I realize it wouldn't be easy to make this useful, given the interest of
spammers to corrupt the data.  The example of Mizhen vs. Microsoft posted
here a few months ago was excellent.  Also, the availability of captcha
solving for $2 per 1000.  

I realize that the vast majority of "people" submitting data could be
spammer controlled zombies, repeating back to the system that they've
received spam and non-spam from IPs others have reported, and once in a
while many of them could claim an IP is sending lots of non-spam, and no
spam, in the hopes that this system would identify that IP as a
non-spammer, allowing spam from that IP to get through filters.

I hope I'll be able to identify the quality of the data coming from each
person well enough that this would not be a problem.  Maybe a large number
of zombies could cause some spams to get through for one day, and then
be identified as malicious.  I guess that's my biggest question though.
How many times can the spammers use "a large number of zombies" to get
spam through for a day before they run out?  Obviously I'd take advantage
of things like Spamhaus's XBL (list of zombies).  It's also an interesting
problem of lacking known truth - how do you know which people are
reporting good data when everything you have to compare it to might be
maliciously bad data from spammers?

(Continue reading)

der Mouse | 21 Oct 2010 07:05

Re: Collecting IP reputation data from many people

> I'd like your thoughts on collecting reputation data (% spam vs.
> non-spam originating at every IP) from everyone willing to submit it.
> [...]

You know, I really hate to be a parade-rainer.  I also don't like to
tell anyone to not try experiments.  But I see this as a basically
hopeless task in view of the botnet problem: as soon as it becomes
widespread enough to be noticed, the signal in its input data will be
swamped by gaming attempts.

You are clearly aware of that basic problem, though, so you may be able
to come up with something.  Indeed, I hope you do; anything capable of
resisting such gaming while still doing any kind of crowdsourcing is
interesting for that alone.

I would warn you against inferring from successful preliminary tests
that it would be equally successful if widespread; there are zillions
of techniques that work fine provided few enough people do them that
spammers don't notice or don't think they're worth bothering to
circumvent.  (I depend on a few myself.)

So, sure, go ahead and try it.  Even if it's a failure in the form you
now envision, something useful may end up coming out of it.  Research
is like that.

> So do you think it's worth my effort?  

Dunno.  I don't think I'd put effort into it myself, but that doesn't
mean much in view of how large my to-do list of things I'd enjoy much
more is.  You sound fired up, though, and I suspect something valuable
(Continue reading)

Chris Lewis | 21 Oct 2010 16:53
Favicon

Re: Collecting IP reputation data from many people

On 10/21/2010 1:05 AM, der Mouse wrote:

> I would warn you against inferring from successful preliminary tests
> that it would be equally successful if widespread; there are zillions
> of techniques that work fine provided few enough people do them that
> spammers don't notice or don't think they're worth bothering to
> circumvent.  (I depend on a few myself.)

I think it's worth pointing out that _everybody_ does to one extent or 
another.  Anti-spam is like security in general.  No protection is 
absolute.  You're always relying on setting the bar high enough to not 
being worth the ROI.  Part of that equation is trying to avoid being too 
much alike someone else who it _is_ worth bothering with.

IOW: don't make yourself too much similar to Hotmail, Gmail or AOL, 
because the incentives to break them are far higher than "little ol' 
me", and if they get broke, you're toast.

I'm with der Mouse.  I'm not sure you're going to be able to build a 
prototype with broad enough reach to derive useful conclusions, and if 
you do manage that and it starts to become more broadly used, you'll 
become a target with entirely new massive problems to solve.  I know 
there are things that can help, but I think you'd get swamped so badly 
that even, say, 99% suppression of malicious (and just dumb) input won't 
be enough to yield a useful signal.

Also like der Mouse, I think it might a useful instructive exercise. But 
be prepared for the exercise proving itself to you that it won't work 
and why.  And please publish the result, at least here.
(Continue reading)

Rich Kulawiec | 22 Oct 2010 18:53

Re: Collecting IP reputation data from many people


Others have largely covered what I'd say already, so I'll add this:
you may find it useful to read aboute Credence:

	http://www.cs.cornell.edu/People/egs/credence/

and to consider how a system like that deployed in an IP reputation
project might (or might not) be able to resist accidental and deliberate
manipulation.

---rsk
Dave CROCKER | 22 Oct 2010 19:02

Re: Collecting IP reputation data from many people


On 10/21/2010 7:53 AM, Chris Lewis wrote:
> I think it's worth pointing out that _everybody_ does to one extent or another.

I used to think that, too.  But lots of other folk told me it was not true, so I 
stopped believing it...

d/
--

-- 

   Dave Crocker
   Brandenburg InternetWorking
   bbiw.net
Darxus | 22 Oct 2010 21:19
Favicon

Re: Collecting IP reputation data from many people

On 10/22, Rich Kulawiec wrote:
> Others have largely covered what I'd say already, so I'll add this:
> you may find it useful to read aboute Credence:
> 
> 	http://www.cs.cornell.edu/People/egs/credence/

That's interesting, but hard to apply.  I can't really give everyone a
custom list of reputation data only using data from other people with whom
theirs highly correlates.  Well, maybe I could try.  Probably insane
though.

Yes, finding groups with highly correlating reports will be important, but
I don't expect it to be obvious which groups are the spammers and which
groups are legit.  Or which groups are just incompetent.

Something I'm surprised didn't come up on this list, that was mentioned on
the spamassassin users list:
http://www.mimedefang.org/reputation
An IETF draft of a Reputation Reporting Protocol.

I'm not a huge fan.  It uses UDP which, as it mentions, allows forging of
the sending IP.  Other good stuff in it though.

The spamassassin users list also mentioned Google's paper from 2006 on
their internal reputation system:  http://www.ceas.cc/2006/19.pdf
I'm curious how much the usefulness of authenticated domain based
reputations has changed in the years since.

--

-- 
"I'd rather be happy than right any day."
(Continue reading)

Rich Kulawiec | 23 Oct 2010 00:14

Re: Collecting IP reputation data from many people

On Fri, Oct 22, 2010 at 03:19:47PM -0400, Darxus <at> ChaosReigns.com wrote:
(about Credence)
> That's interesting, but hard to apply.  I can't really give everyone a
> custom list of reputation data only using data from other people with whom
> theirs highly correlates.  Well, maybe I could try.  Probably insane
> though.

Agreed, it *is* hard to apply.  Maybe even insane.  One of the daunting
parts of a task like this is the scope of the problem: the universes of
(IPv4 addresses), (IPv6 addresses), (observations), and (reports)
are all pretty big; and (reporters) might need to be fairly large
in order to be effective.  That's a lot of data of questionable
provenance to juggle, and while some of what's been learned via
Credence may help, I don't think it's a full solution.  I just thought
it might be useful to consider their approach and see if any parts
of it lend themselves to this.

---Rsk
Chris Lewis | 23 Oct 2010 00:29
Favicon

Re: Collecting IP reputation data from many people

On 10/22/2010 1:02 PM, Dave CROCKER wrote:
>
>
> On 10/21/2010 7:53 AM, Chris Lewis wrote:
>> I think it's worth pointing out that _everybody_ does to one extent or another.
>
>
> I used to think that, too.  But lots of other folk told me it was not true, so I
> stopped believing it...

That everybody's spam filters don't have techniques that are subject to 
attack and subversion if the stakes are high enough?

I don't believe that for a second.  I don't think you should either.

That will remain as long as email has the current paradigms.

This is why diversity in filter solutions continues to be so important. 
  I don't want there to be a single methodology that everybody uses, 
because once that's broken, and the stakes are high enough that it will 
be (Hotmail being a case in point), everybody's screwed.
Daryl C. W. O'Shea | 23 Oct 2010 03:57
Picon

Re: Collecting IP reputation data from many people

On 22/10/2010 3:19 PM, Darxus <at> ChaosReigns.com wrote:
> Something I'm surprised didn't come up on this list, that was mentioned on
> the spamassassin users list:
> http://www.mimedefang.org/reputation
> An IETF draft of a Reputation Reporting Protocol.

I actually wrote a plugin for SA to query one such service, April 
Lorenzen's outboundindex.net.  I'm not sure how she ever made out with 
it, or if she's still at it.  Yours, hers, and others, ideas are quite 
similar.  I myself only ever got around to a small scale test of my own 
and registering a handful of domain names.

http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/dos/SIQ.pm

> I'm not a huge fan.  It uses UDP which, as it mentions, allows forging of
> the sending IP.  Other good stuff in it though.

You could avoid bulk amounts of forgery by passing sequence numbers back 
and forth or what not.  I'm thinking we might have been passing a secret 
token back and forth at one point.  UDP for speed can't easily be beat, 
though.  When you're dealing with millions of messages, and thus 
queries, from each clients' mail server each day you need to handle 
massive scaling.

Daryl

Gmane