CS Lee | 1 Jan 13:10 2009
Picon

Managing argus data

Hi hoffman,

Do you compress them when archiving the data, that will save a lot of disk space. On the other hand, it really depends on "what do you want for your site" and your strategy, You can easily reduce the data by stripping the dsr(using rastrip), or if you think that you don't need application bytes(don't use argus with -U), and if you don't need mac address reporting(don't use -m), those can greatly reduce the size of argus data set. Like I said, it really depends on your flow modeling and strategy.

Cheers ;]

--
Best Regards,

CS Lee<geek00L[at]gmail.com>

http://geek00l.blogspot.com
http://defcraft.net

Martijn van Oosterhout | 6 Jan 16:09 2009
Picon

Problem with argus under load not reopening output file.

Hoi,

I'm having difficulty with argus on a high-load machine. The symptom
is that argus archiving stops working. Specifically, normally argus
reopens its output file argus.out once a second. However, in my case
it's stops doing that for some reason. So when the archiving script
moves the file away, a new file is not created again, although argus
keep writing to the file-to-be-archived.

At the same time, argus starts to consume more memory, until at some
time it runs out of memory to allocate and crashes. A watchdog
restarts argus again and everything works again.

I've confirmed with strace that argus normally reopens the file and
after a while stops. After it stops reopening the file, everything
else seems normal, the data being written is correct. To me it feels
like something causes a flag to be set within argus causing it to stop
reopening the output file, though I can't quite see in the source why.

This is argus 3.0.0, though the same behaviour was seen with 3.0.0.rc.36.

Has anyone seen this behavior before?

Thanks in advance,
--

-- 
Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/

Carter Bullard | 6 Jan 20:23 2009

Re: Problem with argus under load not reopening output file.

Hey Martijn,
Well that's not right at all.  There is a timestamp that argus uses to  
realize
when it should perform the fstat(), and I suspect that either that  
timestamp
is getting corrupted, or argus's concept of current time is not right.

I'll take a look at this tonight, so that I can type more  
intelligently about
the possible causes, and we can try to figure out a plan.

Sorry for the inconvenience!!

Carter

On Jan 6, 2009, at 10:09 AM, Martijn van Oosterhout wrote:

> Hoi,
>
> I'm having difficulty with argus on a high-load machine. The symptom
> is that argus archiving stops working. Specifically, normally argus
> reopens its output file argus.out once a second. However, in my case
> it's stops doing that for some reason. So when the archiving script
> moves the file away, a new file is not created again, although argus
> keep writing to the file-to-be-archived.
>
> At the same time, argus starts to consume more memory, until at some
> time it runs out of memory to allocate and crashes. A watchdog
> restarts argus again and everything works again.
>
> I've confirmed with strace that argus normally reopens the file and
> after a while stops. After it stops reopening the file, everything
> else seems normal, the data being written is correct. To me it feels
> like something causes a flag to be set within argus causing it to stop
> reopening the output file, though I can't quite see in the source why.
>
> This is argus 3.0.0, though the same behaviour was seen with  
> 3.0.0.rc.36.
>
> Has anyone seen this behavior before?
>
> Thanks in advance,
> -- 
> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>

Martijn van Oosterhout | 6 Jan 22:58 2009
Picon

Re: Problem with argus under load not reopening output file.

Thanks. The symptom I see is that "strace -e file" on the process
stops showing anything. It's still write()ing but nothing else.

I was looking at the debug levels but it seems that any useful debug
level is going to produce far too much output. However, I note you can
increase the debug level on the fly with SIGUSR1. I'll try that next
time I see it's gotten into that state...

Any tips as to what to look for would be helpful.

Have a nice day,

On Tue, Jan 6, 2009 at 8:23 PM, Carter Bullard <carter <at> qosient.com> wrote:
> Hey Martijn,
> Well that's not right at all.  There is a timestamp that argus uses to
> realize
> when it should perform the fstat(), and I suspect that either that timestamp
> is getting corrupted, or argus's concept of current time is not right.
>
> I'll take a look at this tonight, so that I can type more intelligently
> about
> the possible causes, and we can try to figure out a plan.
>
> Sorry for the inconvenience!!
>
> Carter
>
> On Jan 6, 2009, at 10:09 AM, Martijn van Oosterhout wrote:
>
>> Hoi,
>>
>> I'm having difficulty with argus on a high-load machine. The symptom
>> is that argus archiving stops working. Specifically, normally argus
>> reopens its output file argus.out once a second. However, in my case
>> it's stops doing that for some reason. So when the archiving script
>> moves the file away, a new file is not created again, although argus
>> keep writing to the file-to-be-archived.
>>
>> At the same time, argus starts to consume more memory, until at some
>> time it runs out of memory to allocate and crashes. A watchdog
>> restarts argus again and everything works again.
>>
>> I've confirmed with strace that argus normally reopens the file and
>> after a while stops. After it stops reopening the file, everything
>> else seems normal, the data being written is correct. To me it feels
>> like something causes a flag to be set within argus causing it to stop
>> reopening the output file, though I can't quite see in the source why.
>>
>> This is argus 3.0.0, though the same behaviour was seen with 3.0.0.rc.36.
>>
>> Has anyone seen this behavior before?
>>
>> Thanks in advance,
>> --
>> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>>
>
>

--

-- 
Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/

Martijn van Oosterhout | 8 Jan 11:41 2009
Picon

Re: Problem with argus under load not reopening output file.

I have some more info after managing to attach gdb to argus while it's
in this state (see below). And have at least an explanation for why
it's not reopening the file: the test

asock->lastwrite.tv_sec < output->ArgusModel->ArgusGlobalTime.tv_sec

Never becomes true because ArgusClientData->sock->lastwrite is in the
future! About 42 minutes in the snapshot below.

Which as far as I can tell is actually impossible. So it may not be a
problem with argus directly, but something to do with timekeeping on
the machine?

Have a nice day,

(gdb) p *ArgusOutputTask
$16 = {status = 0, ArgusSrc = 0x4e6ef008, ArgusModel = 0x15bdf1b0,
ArgusWfileList = 0x0, ArgusOutputList = 0x15bdfc80,
  ArgusClients = 0x15bdf6c0, ArgusInitMar = 0x0, ArgusTotalRecords =
36439879, ArgusLastRecords = 36410503, ArgusWriteStdOut = 0,
  ArgusOutputSequence = 36439879, ArgusPortNum = 0, ArgusLfd = {-1,
-1, -1, -1, -1}, ArgusListens = 0, nflag = 0,
  ArgusBindAddr = 0x0, ArgusGlobalTime = {tv_sec = 1231410233, tv_usec
= 322693}, ArgusStartTime = {tv_sec = 1231330382,
    tv_usec = 632431}, ArgusReportTime = {tv_sec = 1231410242, tv_usec
= 0}, ArgusLastMarUpdateTime = {tv_sec = 1231410182,
    tv_usec = 122701}, ArgusMarReportInterval = {tv_sec = 60, tv_usec = 0}}

(gdb) p *((struct ArgusClientData
*)(ArgusOutputTask->ArgusClients->start))->sock
$17 = {ArgusSocketList = 0x15bdfc28, fd = 5, status = 16, cnt = 0,
expectedSize = 0, errornum = 0, ArgusLastRecord = 0,
  ArgusReadState = 0, lastwrite = {tv_sec = 1231412781, tv_usec =
651709}, rec = 0x0, length = 0, writen = 0, sock = {
    sa_family = 0, sa_data = '\0' <repeats 13 times>}, filename =
0x15bdfd80 "/var/log/argus/bridge0/argus.out", obj = 0x0,
  ptr = 0x0,
  buf = "\023 \000"...}

(gdb) p *ArgusOutputTask->ArgusModel
$18 = {state = 0, ArgusSrc = 0x4e6ef008, ArgusStatusQueue =
0x15bdf530, ArgusTimeOutQueues = 0x15bdf578, ArgusTimeOutQueue = {0x0,
    0x0, 0x0, 0x0, 0x0, 0x15be03a8, 0x0, 0x0, 0x0, 0x0, 0x15be03f0,
0x0 <repeats 19 times>, 0x15be0360, 0x0 <repeats 29 times>},
  ArgusOutputList = 0x15bdf478, ArgusHashTable = 0x15bdf4a8,
ArgusThisFlow = 0x15bdf5c0, hstruct = 0x15bdf4c0,
  ArgusTransactionNum = 27638973, ArgusThisInterface = 0,
ArgusThisEncaps = 2, ArgusThisNetworkFlowType = 2048,
  ArgusThisLLC = 0x15b3fbcc, ArgusThisAppFlowType = 0,
ArgusThisMplsLabelIndex = 0, ArgusThisMplsLabel = 0,
  ArgusThisPacket8021QEncaps = 0, ArgusFlowType = 32 ' ', ArgusFlowKey
= 1 '\001', ArgusOptionIndicator = 0, ArgusInProtocol = 1,
  ArgusThisDir = 0, ArgusThisStats = 0x370c742c, ArgusThisEpHdr =
0x4e6927c2, ArgusThisIpHdr = 0x4e6927d0, ArgusThisIpv6Frag = 0x0,
  ArgusThisNetworkHdr = 0x0,
  ArgusThisUpHdr = 0x4e6927f8,
  ArgusThisSnapEnd = 0x4e692820 "\001", ArgusControlMonitor = 0,
ArgusSnapLength = 40, ArgusGenerateTime = 0,
  ArgusGeneratePacketSize = 0, ArgusThisLength = 1380, ArgusThisBytes
= 1434, ArgusTotalPacket = 1139871256, ArgusTotalFrags = 0,
  ArgusTotalIPPkts = 0, ArgusLastIPPkts = 0, ArgusTotalNonIPPkts = 0,
ArgusLastNonIPPkts = 0, ArgusTotalNewFlows = 32432394,
  ArgusLastNewFlows = 32416044, ArgusTotalClosedFlows = 26850027,
ArgusLastClosedFlows = 0, ArgusTotalIPFlows = 32390005,
  ArgusLastIPFlows = 0, ArgusTotalNonIPFlows = 42389,
ArgusLastNonIPFlows = 0, ArgusTotalCacheHits = 1107438862,
  ArgusTotalRecords = 0, ArgusTotalSends = 36439881,
ArgusTotalBadSends = 2686099, ArgusLastRecords = 0,
  ArgusTotalUpdates = 1137764966, ArgusLastUpdates = 0,
ArgusGlobalTime = {tv_sec = 1231410233, tv_usec = 328823},
  ArgusStartTime = {tv_sec = 0, tv_usec = 0}, ArgusNowTime = {tv_sec =
0, tv_usec = 0}, ArgusUpdateInterval = {tv_sec = 0,
    tv_usec = 200000}, ArgusUpdateTimer = {tv_sec = 1231410233,
tv_usec = 522602}, ArgusLastPacketTimer = {tv_sec = 0,
    tv_usec = 0}, ArgusAdjustedTimer = {tv_sec = 0, tv_usec = 0},
ArgusMajorVersion = 3, ArgusMinorVersion = 0, ArgusSnapLen = 96,
  ArgusUserDataLen = 0, ArgusAflag = 0, ArgusTCPflag = 1, Argusmflag =
1, ArgusIPTimeout = 30, ArgusTCPTimeout = 30,
  ArgusICMPTimeout = 5, ArgusIGMPTimeout = 30, ArgusFRAGTimeout = 5,
ArgusIBTimeout = 0, ArgusReportAllTime = 1,
  ArgusResponseStatus = 0, ArgusFarReportInterval = {tv_sec = 5,
tv_usec = 0}, ArgusQueueInterval = {tv_sec = 0, tv_usec = 50000},
  ArgusListenInterval = {tv_sec = 0, tv_usec = 250000}, ArgusID = 18,
ArgusIDType = 32, ArgusSeqNum = 36439882, ArgusLocalNet = 0,
  ArgusNetMask = 0, ArgusLink = 0}
(gdb) cont

--

-- 
Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/

Carter Bullard | 10 Jan 17:01 2009

Re: Problem with argus under load not reopening output file.

Hey Martijn,
Sorry for the delayed response.  I've gone over the code a bit, and I  
don't
see how this jump can occur.   But that is the nature of bugs,  
sometimes.

There are a few things that I need to proceed.  What platform, OS, 64- 
bit?
What are we connecting to.  1Gbps?  10Gbps?   And, what is the processor
load for argus?  And the interrupt rate, any since as to how many  
packets
per second?

When this occurs, are we getting any packets at all? (your strace  
should have
packet reading, since we do a select() )

My suspicion is that if all is as it should be, but we all of a sudden  
get a
leap in our  global time, argus maybe so loaded that it is getting  
behind.
But, that is a very preliminary guess.

Carter

On Jan 8, 2009, at 5:41 AM, Martijn van Oosterhout wrote:

> I have some more info after managing to attach gdb to argus while it's
> in this state (see below). And have at least an explanation for why
> it's not reopening the file: the test
>
> asock->lastwrite.tv_sec < output->ArgusModel->ArgusGlobalTime.tv_sec
>
> Never becomes true because ArgusClientData->sock->lastwrite is in the
> future! About 42 minutes in the snapshot below.
>
> Which as far as I can tell is actually impossible. So it may not be a
> problem with argus directly, but something to do with timekeeping on
> the machine?
>
> Have a nice day,
>
> (gdb) p *ArgusOutputTask
> $16 = {status = 0, ArgusSrc = 0x4e6ef008, ArgusModel = 0x15bdf1b0,
> ArgusWfileList = 0x0, ArgusOutputList = 0x15bdfc80,
>  ArgusClients = 0x15bdf6c0, ArgusInitMar = 0x0, ArgusTotalRecords =
> 36439879, ArgusLastRecords = 36410503, ArgusWriteStdOut = 0,
>  ArgusOutputSequence = 36439879, ArgusPortNum = 0, ArgusLfd = {-1,
> -1, -1, -1, -1}, ArgusListens = 0, nflag = 0,
>  ArgusBindAddr = 0x0, ArgusGlobalTime = {tv_sec = 1231410233, tv_usec
> = 322693}, ArgusStartTime = {tv_sec = 1231330382,
>    tv_usec = 632431}, ArgusReportTime = {tv_sec = 1231410242, tv_usec
> = 0}, ArgusLastMarUpdateTime = {tv_sec = 1231410182,
>    tv_usec = 122701}, ArgusMarReportInterval = {tv_sec = 60, tv_usec  
> = 0}}
>
> (gdb) p *((struct ArgusClientData
> *)(ArgusOutputTask->ArgusClients->start))->sock
> $17 = {ArgusSocketList = 0x15bdfc28, fd = 5, status = 16, cnt = 0,
> expectedSize = 0, errornum = 0, ArgusLastRecord = 0,
>  ArgusReadState = 0, lastwrite = {tv_sec = 1231412781, tv_usec =
> 651709}, rec = 0x0, length = 0, writen = 0, sock = {
>    sa_family = 0, sa_data = '\0' <repeats 13 times>}, filename =
> 0x15bdfd80 "/var/log/argus/bridge0/argus.out", obj = 0x0,
>  ptr = 0x0,
>  buf = "\023 \000"...}
>
> (gdb) p *ArgusOutputTask->ArgusModel
> $18 = {state = 0, ArgusSrc = 0x4e6ef008, ArgusStatusQueue =
> 0x15bdf530, ArgusTimeOutQueues = 0x15bdf578, ArgusTimeOutQueue = {0x0,
>    0x0, 0x0, 0x0, 0x0, 0x15be03a8, 0x0, 0x0, 0x0, 0x0, 0x15be03f0,
> 0x0 <repeats 19 times>, 0x15be0360, 0x0 <repeats 29 times>},
>  ArgusOutputList = 0x15bdf478, ArgusHashTable = 0x15bdf4a8,
> ArgusThisFlow = 0x15bdf5c0, hstruct = 0x15bdf4c0,
>  ArgusTransactionNum = 27638973, ArgusThisInterface = 0,
> ArgusThisEncaps = 2, ArgusThisNetworkFlowType = 2048,
>  ArgusThisLLC = 0x15b3fbcc, ArgusThisAppFlowType = 0,
> ArgusThisMplsLabelIndex = 0, ArgusThisMplsLabel = 0,
>  ArgusThisPacket8021QEncaps = 0, ArgusFlowType = 32 ' ', ArgusFlowKey
> = 1 '\001', ArgusOptionIndicator = 0, ArgusInProtocol = 1,
>  ArgusThisDir = 0, ArgusThisStats = 0x370c742c, ArgusThisEpHdr =
> 0x4e6927c2, ArgusThisIpHdr = 0x4e6927d0, ArgusThisIpv6Frag = 0x0,
>  ArgusThisNetworkHdr = 0x0,
>  ArgusThisUpHdr = 0x4e6927f8,
>  ArgusThisSnapEnd = 0x4e692820 "\001", ArgusControlMonitor = 0,
> ArgusSnapLength = 40, ArgusGenerateTime = 0,
>  ArgusGeneratePacketSize = 0, ArgusThisLength = 1380, ArgusThisBytes
> = 1434, ArgusTotalPacket = 1139871256, ArgusTotalFrags = 0,
>  ArgusTotalIPPkts = 0, ArgusLastIPPkts = 0, ArgusTotalNonIPPkts = 0,
> ArgusLastNonIPPkts = 0, ArgusTotalNewFlows = 32432394,
>  ArgusLastNewFlows = 32416044, ArgusTotalClosedFlows = 26850027,
> ArgusLastClosedFlows = 0, ArgusTotalIPFlows = 32390005,
>  ArgusLastIPFlows = 0, ArgusTotalNonIPFlows = 42389,
> ArgusLastNonIPFlows = 0, ArgusTotalCacheHits = 1107438862,
>  ArgusTotalRecords = 0, ArgusTotalSends = 36439881,
> ArgusTotalBadSends = 2686099, ArgusLastRecords = 0,
>  ArgusTotalUpdates = 1137764966, ArgusLastUpdates = 0,
> ArgusGlobalTime = {tv_sec = 1231410233, tv_usec = 328823},
>  ArgusStartTime = {tv_sec = 0, tv_usec = 0}, ArgusNowTime = {tv_sec =
> 0, tv_usec = 0}, ArgusUpdateInterval = {tv_sec = 0,
>    tv_usec = 200000}, ArgusUpdateTimer = {tv_sec = 1231410233,
> tv_usec = 522602}, ArgusLastPacketTimer = {tv_sec = 0,
>    tv_usec = 0}, ArgusAdjustedTimer = {tv_sec = 0, tv_usec = 0},
> ArgusMajorVersion = 3, ArgusMinorVersion = 0, ArgusSnapLen = 96,
>  ArgusUserDataLen = 0, ArgusAflag = 0, ArgusTCPflag = 1, Argusmflag =
> 1, ArgusIPTimeout = 30, ArgusTCPTimeout = 30,
>  ArgusICMPTimeout = 5, ArgusIGMPTimeout = 30, ArgusFRAGTimeout = 5,
> ArgusIBTimeout = 0, ArgusReportAllTime = 1,
>  ArgusResponseStatus = 0, ArgusFarReportInterval = {tv_sec = 5,
> tv_usec = 0}, ArgusQueueInterval = {tv_sec = 0, tv_usec = 50000},
>  ArgusListenInterval = {tv_sec = 0, tv_usec = 250000}, ArgusID = 18,
> ArgusIDType = 32, ArgusSeqNum = 36439882, ArgusLocalNet = 0,
>  ArgusNetMask = 0, ArgusLink = 0}
> (gdb) cont
>
> -- 
> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>

Carter Bullard
CEO/President
QoSient, LLC
150 E 57th Street Suite 12D
New York, New York  10022

+1 212 588-9133 Phone
+1 212 588-9134 Fax

Carter Bullard | 11 Jan 17:42 2009

Re: Problem with argus under load not reopening output file.

Hey Martijn,
If you don't mind, I'd like to keep the thread in the argus mailing  
list.

Sounds like an ntp or a rdate() issue, where the machine's time is
getting modified to a bogus value.  This is of course an old way
of attacking a machine ;o)

Argus can attempt to catch jumps in time by testing the timestamp
in the packets against the systems global time, or a delta of the
two.  But, how to respond to the jump is a problem, because
we don't know which system (packet capture/host os) is correct.

Carter

On Jan 11, 2009, at 8:17 AM, Martijn van Oosterhout wrote:

> On Sat, Jan 10, 2009 at 5:01 PM, Carter Bullard <carter <at> qosient.com>  
> wrote:
>> Hey Martijn,
>> Sorry for the delayed response.  I've gone over the code a bit, and  
>> I don't
>> see how this jump can occur.   But that is the nature of bugs,  
>> sometimes.
>>
>> There are a few things that I need to proceed.  What platform, OS,  
>> 64-bit?
>> What are we connecting to.  1Gbps?  10Gbps?   And, what is the  
>> processor
>> load for argus?  And the interrupt rate, any since as to how many  
>> packets
>> per second?
>
> The load is about 30%, IIRC. The line is 200-250Mbs, so high but not
> exceptional. In the meantime there has arisen another theory: namely
> that under certain circumstances the Linux kernel version we're
> running (2.6.20) can for a very short time return a time 1 hour 13
> minutes in the future. If argus happens to pick that one as
> GlobalTime, it will cause the symptoms seen.
>
> As these things go, as soon as you try to verify this is the problem,
> the problem stops happening... I hope monday I'll have evidence one
> way or the other.
>
>> When this occurs, are we getting any packets at all? (your strace  
>> should
>> have
>> packet reading, since we do a select() )
>
> It's using the linux PCAP ringbuffer, so there are no kernel calls for
> fetching the packets, only writes.
>
>> My suspicion is that if all is as it should be, but we all of a  
>> sudden get a
>> leap in our  global time, argus maybe so loaded that it is getting  
>> behind.
>> But, that is a very preliminary guess.
>
> That sounds like the most plausible idea so far. It's possible to work
> around this is argus, the question is should we. In ArgusWrite* if you
> change the test with lastwritten to be not-equal (!=) rather than
> less-than (<) the issue shouldn't arise either.
>
> I hope to have more info soon.
>
> Have a nice day,
> -- 
> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>

Carter Bullard
CEO/President
QoSient, LLC
150 E 57th Street Suite 12D
New York, New York  10022

+1 212 588-9133 Phone
+1 212 588-9134 Fax

Martijn van Oosterhout | 11 Jan 18:56 2009
Picon

Re: Problem with argus under load not reopening output file.

Sorry, hit the wrong button. No, its actually a kernel bug:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8bb6f4c1670c8324e4135c61ef07486f7f17379

Causes the time as read by gettimeday() to jump an hour in the future
and back for a packet or two.

At least, I hope it's this, otherwise I don't know.

Have a nice day,

On Sun, Jan 11, 2009 at 5:42 PM, Carter Bullard <carter <at> qosient.com> wrote:
> Hey Martijn,
> If you don't mind, I'd like to keep the thread in the argus mailing list.
>
> Sounds like an ntp or a rdate() issue, where the machine's time is
> getting modified to a bogus value.  This is of course an old way
> of attacking a machine ;o)
>
> Argus can attempt to catch jumps in time by testing the timestamp
> in the packets against the systems global time, or a delta of the
> two.  But, how to respond to the jump is a problem, because
> we don't know which system (packet capture/host os) is correct.
>
> Carter
>
> On Jan 11, 2009, at 8:17 AM, Martijn van Oosterhout wrote:
>
>> On Sat, Jan 10, 2009 at 5:01 PM, Carter Bullard <carter <at> qosient.com>
>> wrote:
>>>
>>> Hey Martijn,
>>> Sorry for the delayed response.  I've gone over the code a bit, and I
>>> don't
>>> see how this jump can occur.   But that is the nature of bugs, sometimes.
>>>
>>> There are a few things that I need to proceed.  What platform, OS,
>>> 64-bit?
>>> What are we connecting to.  1Gbps?  10Gbps?   And, what is the processor
>>> load for argus?  And the interrupt rate, any since as to how many packets
>>> per second?
>>
>> The load is about 30%, IIRC. The line is 200-250Mbs, so high but not
>> exceptional. In the meantime there has arisen another theory: namely
>> that under certain circumstances the Linux kernel version we're
>> running (2.6.20) can for a very short time return a time 1 hour 13
>> minutes in the future. If argus happens to pick that one as
>> GlobalTime, it will cause the symptoms seen.
>>
>> As these things go, as soon as you try to verify this is the problem,
>> the problem stops happening... I hope monday I'll have evidence one
>> way or the other.
>>
>>> When this occurs, are we getting any packets at all? (your strace should
>>> have
>>> packet reading, since we do a select() )
>>
>> It's using the linux PCAP ringbuffer, so there are no kernel calls for
>> fetching the packets, only writes.
>>
>>> My suspicion is that if all is as it should be, but we all of a sudden
>>> get a
>>> leap in our  global time, argus maybe so loaded that it is getting
>>> behind.
>>> But, that is a very preliminary guess.
>>
>> That sounds like the most plausible idea so far. It's possible to work
>> around this is argus, the question is should we. In ArgusWrite* if you
>> change the test with lastwritten to be not-equal (!=) rather than
>> less-than (<) the issue shouldn't arise either.
>>
>> I hope to have more info soon.
>>
>> Have a nice day,
>> --
>> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>>
>
> Carter Bullard
> CEO/President
> QoSient, LLC
> 150 E 57th Street Suite 12D
> New York, New York  10022
>
> +1 212 588-9133 Phone
> +1 212 588-9134 Fax
>
>
>
>

--

-- 
Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/

Carter Bullard | 11 Jan 19:17 2009

Re: Problem with argus under load not reopening output file.

Hmmm,
Well, if the mailing list can decide how to respond to the bug,  then  
I'll put a fix in this week.
Should we for packet streams that are "PCAP_OPEN_LIVE" interfaces:
    1) Adjust the time for packets that are way out of scope (>  5-10  
seconds away from real time)
        to current time.

    2) drop packets that are way out of scope

In both cases, we should not adjust our notion of global time?

Carter

On Jan 11, 2009, at 12:56 PM, Martijn van Oosterhout wrote:

> Sorry, hit the wrong button. No, its actually a kernel bug:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8bb6f4c1670c8324e4135c61ef07486f7f17379
>
> Causes the time as read by gettimeday() to jump an hour in the future
> and back for a packet or two.
>
> At least, I hope it's this, otherwise I don't know.
>
> Have a nice day,
>
> On Sun, Jan 11, 2009 at 5:42 PM, Carter Bullard <carter <at> qosient.com>  
> wrote:
>> Hey Martijn,
>> If you don't mind, I'd like to keep the thread in the argus mailing  
>> list.
>>
>> Sounds like an ntp or a rdate() issue, where the machine's time is
>> getting modified to a bogus value.  This is of course an old way
>> of attacking a machine ;o)
>>
>> Argus can attempt to catch jumps in time by testing the timestamp
>> in the packets against the systems global time, or a delta of the
>> two.  But, how to respond to the jump is a problem, because
>> we don't know which system (packet capture/host os) is correct.
>>
>> Carter
>>
>> On Jan 11, 2009, at 8:17 AM, Martijn van Oosterhout wrote:
>>
>>> On Sat, Jan 10, 2009 at 5:01 PM, Carter Bullard <carter <at> qosient.com>
>>> wrote:
>>>>
>>>> Hey Martijn,
>>>> Sorry for the delayed response.  I've gone over the code a bit,  
>>>> and I
>>>> don't
>>>> see how this jump can occur.   But that is the nature of bugs,  
>>>> sometimes.
>>>>
>>>> There are a few things that I need to proceed.  What platform, OS,
>>>> 64-bit?
>>>> What are we connecting to.  1Gbps?  10Gbps?   And, what is the  
>>>> processor
>>>> load for argus?  And the interrupt rate, any since as to how many  
>>>> packets
>>>> per second?
>>>
>>> The load is about 30%, IIRC. The line is 200-250Mbs, so high but not
>>> exceptional. In the meantime there has arisen another theory: namely
>>> that under certain circumstances the Linux kernel version we're
>>> running (2.6.20) can for a very short time return a time 1 hour 13
>>> minutes in the future. If argus happens to pick that one as
>>> GlobalTime, it will cause the symptoms seen.
>>>
>>> As these things go, as soon as you try to verify this is the  
>>> problem,
>>> the problem stops happening... I hope monday I'll have evidence one
>>> way or the other.
>>>
>>>> When this occurs, are we getting any packets at all? (your strace  
>>>> should
>>>> have
>>>> packet reading, since we do a select() )
>>>
>>> It's using the linux PCAP ringbuffer, so there are no kernel calls  
>>> for
>>> fetching the packets, only writes.
>>>
>>>> My suspicion is that if all is as it should be, but we all of a  
>>>> sudden
>>>> get a
>>>> leap in our  global time, argus maybe so loaded that it is getting
>>>> behind.
>>>> But, that is a very preliminary guess.
>>>
>>> That sounds like the most plausible idea so far. It's possible to  
>>> work
>>> around this is argus, the question is should we. In ArgusWrite* if  
>>> you
>>> change the test with lastwritten to be not-equal (!=) rather than
>>> less-than (<) the issue shouldn't arise either.
>>>
>>> I hope to have more info soon.
>>>
>>> Have a nice day,
>>> --
>>> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>>>
>>
>> Carter Bullard
>> CEO/President
>> QoSient, LLC
>> 150 E 57th Street Suite 12D
>> New York, New York  10022
>>
>> +1 212 588-9133 Phone
>> +1 212 588-9134 Fax
>>
>>
>>
>>
>
>
>
> -- 
> Martijn van Oosterhout <kleptog <at> gmail.com> http://svana.org/kleptog/
>

Carter Bullard
CEO/President
QoSient, LLC
150 E 57th Street Suite 12D
New York, New York  10022

+1 212 588-9133 Phone
+1 212 588-9134 Fax

Peter Van Epp | 11 Jan 22:49 2009
Picon
Picon

Re: Problem with argus under load not reopening output file.

On Sun, Jan 11, 2009 at 01:17:49PM -0500, Carter Bullard wrote:
> Hmmm,
> Well, if the mailing list can decide how to respond to the bug,  then  
> I'll put a fix in this week.
> Should we for packet streams that are "PCAP_OPEN_LIVE" interfaces:
>    1) Adjust the time for packets that are way out of scope (>  5-10  
> seconds away from real time)
>        to current time.
>
>    2) drop packets that are way out of scope
>
> In both cases, we should not adjust our notion of global time?
>
> Carter
>

	I'd be in favor of 1) (so as to not lose any packets) and not adjusting
global time at least immediately. Thats an ugly bug :-), it looks like it will
affect everything except a DAG with internal time keeping turned on as the
kernel's sense of time is being messed with, so the packet could be ahead by
an hour (if it hits during interrupt when the packet is getting tine stamped)
or argus could get hit if it happens when argus asks for time of day. I have
a horrible suspicion we need to do a reasonablness check on time of day calls
if they are ahead an hour suddenly do it again and see if it drops back and if
so ignore it, but thats likely to add a lot of overhead (and only be needed on 
multicore systems and then only sometimes). 
	I can't say that we have seen this so far, but argus prod is still
2.0.6 running on FreeBSD the 3.0 system on Linux (with Phil Wood's pcap code
rather than PF ring this time) hasn't been up much lately. Hopefully that will
change soon as our new hire is learning by building a new argus archiving box
that will run 3.0 hopefully in production soon. Although that said we have run
both PF-ring on multiprocessor systems and the new pcap code on a 200 meg 
link for at least weeks at a time without problem.
	Debugging will be interesting (as in may you live in interesting times
:-)). Setting argus packet capture on (preferably writing to a ram disk for
speed) would be one obvious choice. If you see a time jump forward and then 
back in sequential packets in the pcap then you likely have a kernel level 
time problem.
	We may need to do add a debug level 1 or 2 check of timeofday calls
and log sudden jumps forward and back again to verify it is system time and 
not something in argus thats broken.

Peter Van Epp / Operations and Technical Support 
Simon Fraser University, Burnaby, B.C. Canada


Gmane