Gergely Nagy | 1 Jan 2011 15:30
Gravatar

Re: MongoDB destination driver

A little update on the state of the driver: last night, I arrived to a
state where I consider it good enough for my own purposes (already
using it in production), today I did some benchmarking (completely
unscientific, mind you) to see if and where I can improve the driver.

A standard setup, logging to a file resulted in 24k message/sec, we'll
use that for comparsion. Logging the same data to a capped (at 1000
messages) mongodb collection netted 18k messages/sec, while logging to
an uncapped and unindexed mongodb collection is around 13k
messages/sec.

All tests were run on the same computer, using the same loggen
commandline, the only change is the destionation in the syslog-ng
config. Each test ran for 10 minutes.

The numbers could probably be upped with suitable configuration and a
more appropriate test environment, but I'm not really into that stuff,
the current performance fits my needs perfectly well.

I haven't tested an SQL destination, but my gut feeling is, mongodb's
a lot faster already.

And there's obviously a lot of cases I haven't tested: query speed
while writes are flowing in; how indexing affects it all, and so on,
since those scenarios are either not part of my use case, or I don't
feel knowledgable enough to draw the proper conclusions. I'll let
someone else do proper benchmarking, I'll stick to coding :)

Now, the next thing I explored is if I can speed things up easily: for
this reason, I had a look at callgrind's output, and concluded that
(Continue reading)

Martin Holste | 1 Jan 2011 21:18
Picon

Re: MongoDB destination driver

> We should also point out that grabbing these kinds of locks and making
> these kinds of manipulations should be done as part of careful planning
> since it can render the table inaccessible for long-ish periods through
> normal means such as queries and could require some potentially time
> intensive index rebuilding since indexing is turned off during some of
> these manipulations. (Not sure what percentage of this applies to
> MongoDB since it's a bit unique).
>

For instance, using "LOAD DATA CONCURRENT INFILE" will allow reads to
occur while doing the bulk imports in MySQL.  The manual says there is
a slight performance hit, but it is unnoticeable in my experience.  I
haven't tested to see what actual locking occurs during mongoimport.

> Perhaps it would be good if we could work together (several of us have
> been experimenting with optimum buffering, database and index setups,
> etc.) to figure out what the best practices are in terms of initial
> storage, indexing, retention, archiving, etc.
>

Absolutely.  The biggest challenge I've come across is how to properly
do archiving.  I've been using the ARCHIVE storage engine in MySQL
because the compact row format actually compresses blocks of rows, not
columnar data, giving you a 10:1 (or more) compression ratio on log
data while still maintaining all of the meta data.  The main drawback
is that the archive storage engine is poorly documented: specifically,
if MySQL crashes while an archive table is open, it will mark that
table as crashed and rebuild the entire table on startup.  It will
usually have to do this for all archive tables under normal operation,
which means that time to recover is on the order of many hours on even
(Continue reading)

Martin Holste | 1 Jan 2011 21:24
Picon

Re: MongoDB destination driver

Super cool!  At those rates, I think few will benefit from the bulk
insert benefits, so I'd put that low on the feature priority list,
especially with the opportunity to create bugs with the complexity.
My main feature to add (aside from the two you mentioned already on
the roadmap) would be a way to use the keys from a patterndb database
so that the db and collection in Mongo stay the same, but the key
names change with every patterndb rule.  That's really the big payoff
with Mongo--you don't have to define a rigid schema, so you don't have
to know the column names ahead of time.  That's a big deal considering
that the patterndb can change on the fly.  Being confined to
predefined templates in the config limits the potential.  Bazsi, any
idea how to do this?

On Sat, Jan 1, 2011 at 2:18 PM, Martin Holste <mcholste <at> gmail.com> wrote:
>> We should also point out that grabbing these kinds of locks and making
>> these kinds of manipulations should be done as part of careful planning
>> since it can render the table inaccessible for long-ish periods through
>> normal means such as queries and could require some potentially time
>> intensive index rebuilding since indexing is turned off during some of
>> these manipulations. (Not sure what percentage of this applies to
>> MongoDB since it's a bit unique).
>>
>
> For instance, using "LOAD DATA CONCURRENT INFILE" will allow reads to
> occur while doing the bulk imports in MySQL.  The manual says there is
> a slight performance hit, but it is unnoticeable in my experience.  I
> haven't tested to see what actual locking occurs during mongoimport.
>
>> Perhaps it would be good if we could work together (several of us have
>> been experimenting with optimum buffering, database and index setups,
(Continue reading)

Matthew Hall | 1 Jan 2011 22:55
Gravatar

Re: MongoDB destination driver

On Sat, Jan 01, 2011 at 02:24:10PM -0600, Martin Holste wrote:
> My main feature to add (aside from the two you mentioned already on 
> the roadmap) would be a way to use the keys from a patterndb database 
> so that the db and collection in Mongo stay the same, but the key 
> names change with every patterndb rule. That's really the big payoff 
> with Mongo-- you don't have to define a rigid schema, so you don't 
> have to know the column names ahead of time. That's a big deal 
> considering that the patterndb can change on the fly. Being confined 
> to predefined templates in the config limits the potential.

This is why I asked in my earlier mail if it's possible to set up the 
mongo driver to log all vars in a message or a subset of vars in a 
message. I was hoping it'd be possible for the schema to change somewhat 
dynamically based on what's present in the messages.

Matthew.
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy | 2 Jan 2011 00:51
Gravatar

Re: MongoDB destination driver

> This is why I asked in my earlier mail if it's possible to set up the
> mongo driver to log all vars in a message or a subset of vars in a
> message. I was hoping it'd be possible for the schema to change somewhat
> dynamically based on what's present in the messages.

You can set it up to log a set of vars, and it will only actually
insert the non-empty values.

Say, if you have something like this:

destination d_mongo {
  mongodb(
    keys("host", "program", "pid", "message")
    values("$HOST", "$PROGRAM", "$PID", "$MSGONLY")
  );
};

If a message does not contain a PID, then that will not be added to
the document, only the rest.

Thus, if you set a maximum of vars, that'll do just what you need, and
only add those that do have a value.

To the best of my knowledge it is not possible to log all available
variables (that would be bad too, since there are overlapping macros),
but you can set up a selected maximum set, and the driver will Do The
Right Thing, and only store those parts of it, that are set.
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
(Continue reading)

Gergely Nagy | 2 Jan 2011 14:26
Gravatar

Re: MongoDB destination driver

Another little update: I ported the mongodb destination driver from
using the mongodb C driver to the C++ driver, for a few reasons:

* The C driver had to be bundled with the source, which I dislike with
a passion.
* The C++ driver is much more mature, and a lot more tested aswell.

At the moment, there's a small bridge between the C and the C++ code,
neatly separated into two little files. Functionality remained the
same, stability hopefully improved, and there's less code to maintain
within the syslog-ng driver.

It's available on the algernon/dest/mongodb-cpp branch in my
repository - I haven't merged it onto the main algernon/dest/mongodb
branch just yet, there's a few little things I want to iron out first.
Not to mention that I'm not really sure about introducing a (partly)
C++ module to syslog-ng (even if it's optional, and not compiled by
default).
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.campin.net/syslog-ng/faq.html

Balint Kovacs | 3 Jan 2011 08:52
Favicon

Re: MongoDB destination driver

On 01/02/2011 12:51 AM, Gergely Nagy wrote:
>> This is why I asked in my earlier mail if it's possible to set up the
>> mongo driver to log all vars in a message or a subset of vars in a
>> message. I was hoping it'd be possible for the schema to change somewhat
>> dynamically based on what's present in the messages.
> You can set it up to log a set of vars, and it will only actually
> insert the non-empty values.
>
> Say, if you have something like this:
>
> destination d_mongo {
>    mongodb(
>      keys("host", "program", "pid", "message")
>      values("$HOST", "$PROGRAM", "$PID", "$MSGONLY")
>    );
> };
>
> If a message does not contain a PID, then that will not be added to
> the document, only the rest.
>
> Thus, if you set a maximum of vars, that'll do just what you need, and
> only add those that do have a value.
>
> To the best of my knowledge it is not possible to log all available
> variables (that would be bad too, since there are overlapping macros),
> but you can set up a selected maximum set, and the driver will Do The
> Right Thing, and only store those parts of it, that are set.
Hi,

first of all, thanks for the great work.
(Continue reading)

Gergely Nagy | 3 Jan 2011 10:38
Picon
Gravatar

Re: MongoDB destination driver

> I agree with Matthew, that it would be really important to make this 
> driver "dynamic", as it would be a great tool combined with patterndb 
> for reporting without the need to pre-define fields and a dozen of 
> destination statements.

Aha! Apologies for being confused before: I have to admit, I never used
patterndb before, and totally forgot about it.

> It is actually not that hard to achieve (again, syslog-ng is a breeze), 
> pdbtool does quite the same when emitting all variables, the 
> nv_table_foreach() function is there to iterate over all of the 
> name-value pairs.
> 
> However the NVTable struct stores the builtin and dynamic values 
> separately and with a small copy-paste coding in nvtable.c you can grab 
> only the dynamic values.
> 
> Please find a patch attached that introduces the flags() option for the 
> mongodb driver and the auto_nvpairs flag, that inserts all dynamic 
> name-value pairs into the DB as well. I'm sure that there's a better way 
> to implement some parts of it, so please somebody review and clean up if 
> possible :)

The patch looks good on first read, but I'll have a closer look tonight,
and run a quick benchmark aswell, if all goes well.

Thanks!

______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
(Continue reading)

Gergely Nagy | 3 Jan 2011 22:02
Picon
Gravatar

Re: MongoDB destination driver

> The patch looks good on first read, but I'll have a closer look tonight,
> and run a quick benchmark aswell, if all goes well.

The patch looked fine on the second read too, and I integrated it, with
a few changes:

Instead of using a flag, I introduced a patterndb_key("foo") setting,
which, if turned on, will put the patterndb results under the specified
key, as a sub-document. If not specified, it will do nothing extra.

In my opinion, this solution is clearer, and results in a better
structured log entry.

Usage is like this:

destination d_mongo {
  mongodb(
    patterndb_key("patterndb")
  );
};

The resulting log entry in mongodb looks something like this:

> db.logs.find()
{ "_id" : ObjectId("4d2235525edd07af78f648f9"), "date" : "2011-01-03 21:45:06", "facility" : "auth", 
  "level" : "info", "host" : "localhost", "program" : "sshd", "pid" : "12674", 
  "message" : "Accepted publickey for algernon from ::1 port 59690 ssh2", 
  "patterndb" : { 
      ".classifier.class" : "system", 
      ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", 
(Continue reading)

Martin Holste | 3 Jan 2011 22:14
Picon

Re: MongoDB destination driver

Great idea to have a dedicated, user-configurable sub-key.  One
suggestion: I think that key names cannot contain dots in Mongo.  They
don't really make sense because this:

"patterndb" : {
     ".classifier.class" : "system",
     ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c",
     "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2",
     "usracct.username" : "algernon from ::1 port 59690 ssh2",
     "usracct.device" : "::1 port 59690 ssh2",
     "usracct.service" : "ssh2",
     "usracct.type" : "login",
     "usracct.sessionid" : "12674",
     "usracct.application" : "sshd",
     "secevt.verdict" : "ACCEPT"
 }

should really look like this:

"patterndb" : {
     "classifier": {
        "class" : "system",
        "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c"
      },
     "usracct": {
       "authmethod" : "publickey for algernon from ::1 port 59690 ssh2",
       "username" : "algernon from ::1 port 59690 ssh2",
       "device" : "::1 port 59690 ssh2",
       "service" : "ssh2",
       "type" : "login",
(Continue reading)


Gmane