2 Mar 03:33
Design note: Btree index block life cycle
Daniel Phillips <phillips <at> phunq.net>
2009-03-02 02:33:25 GMT
2009-03-02 02:33:25 GMT
Hi Matt, This note examines specifics of the Tux3 btree index block life cycle, including caching, redirection, flushing and reconstruction during log replay. This addresses some points that you raised during our earlier discussion, for which my response at that time could fairly be described as hand waving. One such point was potential complexity compared to working with a more "uniform" btree design like the one you have adopted for Hammer. Hirofumi is CCed because he has bravely volunteered to complete the implementation of this part of the Tux3 atomic commit mechanism for userspace and kernel, thus providing me with the necessary motivation to state the details precisely. I think that we can see at this point that the complexity required to implement the Tux3 atomic commit strategy is well bounded, especially as it is now mostly implemented and can be seen not to amount to a great deal of code. Mind you, some of it has not yet seen the light of day. For example, I still have not addressed all the details of allocation bitmap replay, which will require an additional, logical replay pass not described here. Oh well, that will be another note, and hopefully all of this will be up and running fairly soon. By the way, congratulation on your apparent success with Hammer. I hope to foment a movement to port it to Linux, as it would seem to address a niche that is not well covered in Penguin land and is unlikely to be in the forseeable future, by my project or any other I know of, which is to say: cluster replication and continuous fine grained delta logging for high capacity servers. Natural metadata position For file index blocks, the "natural" position is near the file's inode table block, and near the beginning of the file data blocks. Roughly speaking, we would like to place index blocks on the same track as both the associated inode table block and the first few file data blocks, so that the disk drive will pick these up together into its track cache when it first seeks to that track. Similarly, we want each inode table index block to lie near the beginning of the group of inode table leaf blocks it references. This is the ideal(Continue reading)
RSS Feed