11 Aug 07:52
Proposed changes to omindex
Michael Trinkala <mdt <at> trinkala.com>
2006-08-11 05:52:59 GMT
2006-08-11 05:52:59 GMT
Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the documents last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e. 0969492426 b. However, for maximum space savings it could be stored as a 4 byte string in big endian format with a get/set utility function to handle the conversion if necessary. 3) Add the documents MD5 to the value table as a 16 byte string (binary representation of the digest) (ID 1). This could be used as a secondary check for incremental indexing (i.e. if the file was touched but not changed dont replace it) and also to collapse duplicates (COLLAPSE=1). The md5 source code is from the GNU testutils-2.1 package. 4) For files that require command line utility processing (i.e. pdftotext) I have added a --copylocal option. This allows the file to be digested while being copied to the local drive and then the command line utility processes the local file saving multiple reads across the network. If we want to expand this it could be used to build a local cache/backup/repository. For my use I was thinking of putting the files under source control (svn) but that is another discussion thread. 5) I would also recommend storing the full filename in the document data. file=/mnt/vol1/www/sample.html. I have a purge utility that cleans out documents that are no longer found on the file system using this information. FYI: I am currently migrating to a MySQL(Continue reading)
Most filters would accept a patch to work from stdin if they don't
already, and it wouldn't be too difficult to do. That would benefit
everyone, if we run into some common ones.
I've no idea whether it actually will help, in practice. I suspect
that in most cases, it's not actually going to win you much because
the file buffering will do the right thing already.
> > One idea I've talked to someone about is separating omindex into
> > something that drives scriptindex, which in theory would allow you to
> > use the file spider in omindex with whatever indexing strategy you
> > wanted.
>
> Perhaps that was me, or possibly we've both discussed it with Richard
> separately?
I've no idea
RSS Feed