Olly Betts | 3 Jan 03:25 2005

Re: dividing B-trees

I've been thinking about this more.

I realised a while ago that where we know all possible keys in a block
have a common prefix, we don't need to store all of each key.  For
example, if the relevant dividing keys in the level above are "mutilate"
and "muzzle", all keys in the block must start with "mu", so we can
avoid storing these 2 characters repeatedly.  In fact we don't need to
explicitly store that we've removed them, or even explicitly store how
many we've not store - this can easily be seen from the dividing keys.

With this scheme, when we split a block, we'll quite often get a longer
common prefix on one or both subblocks.  In that case, we need to scan
through the keys and trim off the extra byte or bytes.  But this means
if we split a block which would end up over 100% full, we can easily
end up with two blocks both < 50% full.  This probably gives us more
latitude for picking a good split point, and indeed the choice of split
point may be able to assist the prefix removal.

The delete case is also interesting - when we remove an empty block,
we remove a dividing key in the level above, and this may result in
the adjacent block having a shorter common prefix (not in keys that
are currently in it, but in keys which could be).  Which could result
in it becoming over 100% full, and needing to be split!

Sequential mode seems trickier though.  If we're inserting at the end of
the table (a very common case), we don't have a dividing key to bound
above, so unless the preceding diving key starts with '\xff', we must
store keys whole.  When the block fills, we insert a dividing key, at
which point we'll often be able to remove a prefix for all keys in the
block at which point it won't be full.  With 8K blocks (the default),
(Continue reading)