libnmsg CRC32C implementation
Robert Edmonds <edmonds <at> fsi.io>
2013-12-01 00:40:30 GMT
The NMSG protocol incorporates a per-payload CRC32C checksum for error
detection. The CRC32C calculation function is currently one of the top
functions when profiling libnmsg (roughly comparable to the protobuf
decoder), so I've added a faster CRC32C implementation that will be
released in the nmsg 0.8.x series.
On x86-64 CPUs with SSE4.2, there is hardware support for calculating
the CRC32C checksum which is about an order of magnitude faster than the
best software implementation. On x86-64, libnmsg will now do runtime
CPU feature detection at library startup  and use these hardware
instructions if they are available . CPUs with the required SSE4.2
instructions include modern CPUs from Intel (including the Haswell,
Sandy Bridge, and Nehalem microarchitectures, but not the Core or Atom
microarchitectures) and AMD (the Bulldozer microarchitecture).
On all other architectures, and on x86-64 CPUs without SSE4.2, libnmsg
falls back to the efficient "slicing-by-8" software implementation of
CRC32C . This implementation is about 20-30% or so faster than the
previous implementation used in libnmsg.
I found Evan Jones' CRC32C blog post and source code  on this topic
to be highly informative.
In microbenchmarks (decoding and re-encoding files containing NMSG
payloads with an average size of 100-200 bytes or so) on my desktop,
which has a CPU which supports the SSE4.2 instructions, the new
implementation cut about 10-20% off the total nmsgtool runtimes.