Carter Bullard | 1 Nov 15:18

Re: new clients rc.62 on the server - description of rastream()

Hey Terry,
So I'm on this.  I should have a fix today.
Carter

On Oct 31, 2007, at 2:47 PM, Terry Burton wrote:

> On Oct 31, 2007 2:25 PM, Terry Burton <tez <at> terryburton.co.uk> wrote:
>> On 10/31/07, Carter Bullard <carter <at> qosient.com> wrote:
>>> Sorry for the inconvenience!!  I haven't seen this, but I'll try  
>>> to reproduce.  A few questions.
>>> Platform (intel/ppc/sparc), available memory, estimated record  
>>> load and how many
>>> probes?
> <...snip...>
>> The platform is i686 GNU/Linux using pretty standard Debian Etch.
>> ~40,000 records/min aggregated from three probes (2 netflow + 1 SPAN)
>> using radium. Host has 2GB RAM + 1.5GB swap. Attached is last night's
>> memory plot for the host. Guess what time I switched on rastream ;-)
>>
>> You may want to wait for the result of my testing before putting too
>> much effort into this yourself.
>
> Unfortunately I didn't find the time today to have a deep look into
> the problem, however I did get a chance to run the rastream process
> through valgrind for about 15 mins (output below).
>
> Tomorrow is another day... :-)
>
>
> Warm regards,
(Continue reading)

Carter Bullard | 1 Nov 18:39

Re: new clients rc.62 on the server - description of rastream()

Hey Terry,
OK, so looking at your graph and the valgrind output and all  
information so far,
the system is not hurting for memory.  I'm working on the potential  
leak and
may have found some things to clean up, but I'm not thinking that its  
the
cause of your issues.   It maybe that we are running too many concurrent
processes, and the first complaint by fork() (EAGAIN) just maps to an  
error
messages that sez there is not enough memory.  I'm going to change the
script scheduling and patch up the memory issues, and we'll try again.

Carter

On Oct 31, 2007, at 2:47 PM, Terry Burton wrote:

> On Oct 31, 2007 2:25 PM, Terry Burton <tez <at> terryburton.co.uk> wrote:
>> On 10/31/07, Carter Bullard <carter <at> qosient.com> wrote:
>>> Sorry for the inconvenience!!  I haven't seen this, but I'll try  
>>> to reproduce.  A few questions.
>>> Platform (intel/ppc/sparc), available memory, estimated record  
>>> load and how many
>>> probes?
> <...snip...>
>> The platform is i686 GNU/Linux using pretty standard Debian Etch.
>> ~40,000 records/min aggregated from three probes (2 netflow + 1 SPAN)
>> using radium. Host has 2GB RAM + 1.5GB swap. Attached is last night's
>> memory plot for the host. Guess what time I switched on rastream ;-)
>>
(Continue reading)

Terry Burton | 2 Nov 02:57
Picon
Favicon

Re: new clients rc.62 on the server - description of rastream()

On Nov 1, 2007 5:39 PM, Carter Bullard <carter <at> qosient.com> wrote:
> Hey Terry,
> OK, so looking at your graph and the valgrind output and all
> information so far,
> the system is not hurting for memory.  I'm working on the potential
> leak and
> may have found some things to clean up, but I'm not thinking that its
> the
> cause of your issues.   It maybe that we are running too many concurrent
> processes, and the first complaint by fork() (EAGAIN) just maps to an
> error
> messages that sez there is not enough memory.  I'm going to change the
> script scheduling and patch up the memory issues, and we'll try again.

Hi Carter,

After performing some more basic tests I have found some information
that may help to find the leak. I'm not sure whether this correlates
with your current thinking on the problem or not...

I run the following collectors:

/opt/argus/sbin/argus -X -d -A -i eth2 -P 561
/opt/argus/sbin/radium -X -d -C -S 1006 -P 564
/opt/argus/sbin/radium -X -d -C -S 1007 -P 565

I have another process that aggregates these:

/opt/argus/sbin/radium -X -d -S localhost:561 -S localhost:564 -S
localhost:565 -P 569
(Continue reading)

Carter Bullard | 2 Nov 03:32

Re: new clients rc.62 on the server - description of rastream()

Hey Terry,
We're very close to releasing argus-3.0, and its going to be
difficult to say the code is good if we have a known memory
leak in a key component, so its important to me to get this fixed.

So the question is, "is there a .threads file in your root directory".
If so, try removing it and doing the "./configure;make clean;make"
again, to see if that makes a difference.

There is some clutter that valgrind() will report on that is
not critical, such as the port names hash table memory, or
a few strdup'd strings that are left behind.  I'm not worried
about these.  But real memory leaks, that keep you from
running these programs for a year at a time are very
important to fix, so thanks for helping me out on this.

Carter

On Nov 1, 2007, at 9:57 PM, Terry Burton wrote:

> On Nov 1, 2007 5:39 PM, Carter Bullard <carter <at> qosient.com> wrote:
>> Hey Terry,
>> OK, so looking at your graph and the valgrind output and all
>> information so far,
>> the system is not hurting for memory.  I'm working on the potential
>> leak and
>> may have found some things to clean up, but I'm not thinking that its
>> the
>> cause of your issues.   It maybe that we are running too many  
>> concurrent
(Continue reading)

Carter Bullard | 2 Nov 04:24

Re: new clients rc.62 on the server - description of rastream()

Hey Terry,
Ok, one thing that I've discovered in my tests with fprobe()
as a netflow record source, is that the hold time for rastream
may need to be very large.  Possibly in the order of 2-5 minutes,
rather than 10s.   This is because of netflow's very poor cache
management strategies.

If a record comes in that is outside the range of the "-B secs"
option, rastream() will toss it.  To test, compile the clients with
debug support ( "touch .devel .debug; ./configure; make clean; make")
and run rastream() with a -D2 and see if it complains about
the range of the input records.  I did find a leak where some
of these out of range records were dropped without being de-
allocated, so that may have been our problem.

I could have rastream() adjust its range timer to accomodate
records that come in way out of range, but, I'm not comfortable
with these types of dynamic behaviors, as you find after
some time that the rastream() stops outputting records, is
getting huge, because the hold time has increased to
some ridiculous value, like 1.5 years (not good).

I'll have new code up in the morning, and we'll see if that
doesn't help.

Carter

On Nov 1, 2007, at 10:32 PM, Carter Bullard wrote:

> Hey Terry,
(Continue reading)

Terry Burton | 2 Nov 11:47
Picon
Favicon

Re: new clients rc.62 on the server - description of rastream()

On Nov 2, 2007 2:32 AM, Carter Bullard <carter <at> qosient.com> wrote:
<...snip...>
> So the question is, "is there a .threads file in your root directory".
> If so, try removing it and doing the "./configure;make clean;make"
> again, to see if that makes a difference.

Hi Carter,

For completeness...

I have rebuilt having removed the .threads file from the root of the
build directory, make a clean build and invoked rastream as follows:

/opt/argus/bin/rastream -X -S localhost:569 -M time 5m -B 10s -f
/bin/true -w /srv/argus/archive/%Y-%m-%d/\$srcid-%H:%M:%S.arg

No observable difference: still leaking and valgrind gives essentially
the same results for the two cases, with/out CFLAGS="-g -O
-fno-inline"

I will now investigate using the suggestions from your more recent email.

Thanks,

Tez

Carter Bullard | 2 Nov 15:33

Re: new clients rc.62 on the server - description of rastream()

Hey Terry,
Well, the tool of choice for this situation is rasplit(), and
to have an independent cron job process the files an
hour or so after the fact.  rasplit() is rastream() but without
the buffering or task spawning.

rastream() is really trying to bring this stuff into a near
realtime time frame, and netflow data isn't really appropriate
for this as it is not well scheduled.

I have found the bug in rastream(), and will have a new clients
release, rc.63, up in about an hour or so.

Give rasplit() a try, you shouldn't have any memory leak
problems with it, as it doesn't have any memory
requirements.

Carter

On Nov 2, 2007, at 8:48 AM, Terry Burton wrote:

> On Nov 2, 2007 3:24 AM, Carter Bullard <carter <at> qosient.com> wrote:
>> Ok, one thing that I've discovered in my tests with fprobe()
>> as a netflow record source, is that the hold time for rastream
>> may need to be very large.  Possibly in the order of 2-5 minutes,
>> rather than 10s.   This is because of netflow's very poor cache
>> management strategies.
>
> Hi Carter,
>
(Continue reading)

Carter Bullard | 2 Nov 16:42

argus-clients rc.63 - possible final code release

Gentle people,
A new clients distribution is available for testing.  This release
fixes a memory leak in rastream(), and faults in both radium()
and ratop().   I now think that all of the core routines in argus- 
clients
are finalized and ready to go, so this maybe the last code
change prior to argus-3.0 release.

    ftp://qosient.com/dev/argus-3.0/argus-clients-3.0.0.rc.63.tar.gz

Please give this set of code a run over, and if you find anything
that is in need of attention, please send email to the list, or
directly to me!!!

Hope all is most excellent, and thanks for all the help!!!

Carter

Terry Burton | 2 Nov 19:03
Picon
Favicon

Re: new clients rc.62 on the server - description of rastream()

On Nov 2, 2007 2:33 PM, Carter Bullard <carter <at> qosient.com> wrote:
> Well, the tool of choice for this situation is rasplit(), and
> to have an independent cron job process the files an
> hour or so after the fact.  rasplit() is rastream() but without
> the buffering or task spawning.
<...splice...>
> Give rasplit() a try, you shouldn't have any memory leak
> problems with it, as it doesn't have any memory
> requirements.

Hi Carter,

Yes, I'm am now successfully using rasplit + cron which works exactly
as I need. rasplit does not have support for running as a daemon using
the "-d" command line argument - is it trivial to add this before the
final release? It certainly helps when you use the right tool for the
job :-)

> rastream() is really trying to bring this stuff into a near
> realtime time frame, and netflow data isn't really appropriate
> for this as it is not well scheduled.
>
> I have found the bug in rastream(), and will have a new clients
> release, rc.63, up in about an hour or so.

I've tested the latest rastream and it does appear to be much better,
however it may still be leaking slowly. I will leave this running
overnight to see whether it is just the initial ramp up of memory or
whether the leak is genuine.

(Continue reading)

Carter Bullard | 2 Nov 21:00

Re: new clients rc.62 on the server - description of rastream()

Yes I can add daemon support for rasplit().  That is a great idea.
It also can have "reliable connection" support, where it will retry
the connection to the data source if it gets dropped, just like
radium.   I'll document that as well.

Have a great weekend, and thanks for all the help!!!!

Carter

On Nov 2, 2007, at 2:03 PM, Terry Burton wrote:

> On Nov 2, 2007 2:33 PM, Carter Bullard <carter <at> qosient.com> wrote:
>> Well, the tool of choice for this situation is rasplit(), and
>> to have an independent cron job process the files an
>> hour or so after the fact.  rasplit() is rastream() but without
>> the buffering or task spawning.
> <...splice...>
>> Give rasplit() a try, you shouldn't have any memory leak
>> problems with it, as it doesn't have any memory
>> requirements.
>
> Hi Carter,
>
> Yes, I'm am now successfully using rasplit + cron which works exactly
> as I need. rasplit does not have support for running as a daemon using
> the "-d" command line argument - is it trivial to add this before the
> final release? It certainly helps when you use the right tool for the
> job :-)
>
>> rastream() is really trying to bring this stuff into a near
(Continue reading)


Gmane