Michael Trinkala | 11 Aug 07:52

Proposed changes to omindex

Proposed changes to omindex

Currently Available Items
=========================

1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
indexing.

2) Add the document’s last modified time to the value table (ID 0).  This would allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e.
0969492426
b. However, for maximum space savings it could be stored as a 4 byte string in big endian format
with a get/set utility function to handle the conversion if necessary.

3) Add the document’s MD5 to the value table as a 16 byte string (binary representation of the
digest) (ID 1).  This could be used as a secondary check for incremental indexing (i.e. if the
file was touched but not changed don’t replace it) and also to collapse duplicates (COLLAPSE=1). 
The md5 source code is from the GNU testutils-2.1 package.

4) For files that require command line utility processing (i.e. pdftotext) I have added a
--copylocal option.  This allows the file to be digested while being copied to the local drive and
then the command line utility processes the local file saving multiple reads across the network. 
If we want to expand this it could be used to build a local cache/backup/repository.  For my use I
was thinking of putting the files under source control (svn) but that is another discussion
thread.

5) I would also recommend storing the full filename in the document data.
file=/mnt/vol1/www/sample.html.  I have a purge utility that cleans out documents that are no
longer found on the file system using this information.  FYI: I am currently migrating to a MySQL
(Continue reading)

Reini Urban | 11 Aug 08:45
Picon
Gravatar

Re: Proposed changes to omindex

Michael Trinkala schrieb:
> Proposed changes to omindex
> 
> Currently Available Items
> =========================
> 
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
> indexing.
> 
> 2) Add the document’s last modified time to the value table (ID 0).  This would allow incremental
> indexing based on the timestamp and also sorting by date in omega (SORT=0)
> a. Currently I store the timestamp as a 10 byte string (left zero padded UNIX time string) i.e.
> 0969492426
> b. However, for maximum space savings it could be stored as a 4 byte string in big endian format
> with a get/set utility function to handle the conversion if necessary.
> 
> 3) Add the document’s MD5 to the value table as a 16 byte string (binary representation of the
> digest) (ID 1).  This could be used as a secondary check for incremental indexing (i.e. if the
> file was touched but not changed don’t replace it) and also to collapse duplicates (COLLAPSE=1). 
> The md5 source code is from the GNU testutils-2.1 package.
> 
> 4) For files that require command line utility processing (i.e. pdftotext) I have added a
> --copylocal option.  This allows the file to be digested while being copied to the local drive and
> then the command line utility processes the local file saving multiple reads across the network. 
> If we want to expand this it could be used to build a local cache/backup/repository.  For my use I
> was thinking of putting the files under source control (svn) but that is another discussion
> thread.

I already have a cache_dir option in my omega.conf and successfully use 
it in omindex for recursive local zip/rar/msg/pst "virtual directories", 
(Continue reading)

James Aylett | 19 Aug 20:22

Re: Proposed changes to omindex

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:

> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.

I don't think this is generally useful, for reasons previously given:
omega/omindex are really targetted to indexing and searching web
sites, where the URI is the identifier. A filename used to provide a
representation of that resource isn't at all interesting to omega, and
is only partly interesting to omindex (ie: there are other ways of
doing it). omindex is pretty limited in any case, and if you're doing
anything funky you'll be using scriptindex or your own indexer. Within
that, how you generate Q-terms and manage your documents is of course
entirely up to you.

> 4) For files that require command line utility processing
> (i.e. pdftotext) I have added a --copylocal option.  This allows the
> file to be digested while being copied to the local drive and then
> the command line utility processes the local file saving multiple
> reads across the network. If we want to expand this it could be used
> to build a local cache/backup/repository.  For my use I was thinking
> of putting the files under source control (svn) but that is another
> discussion thread.

This is neat. I agree that for anything more complex it's not actually
going to solve all the requirements, but for remote files it can
work. (Although any decent network fs has built-in caching, and in any
case you could rely on the OS buffers - if you open() first, then dup
the filedes, then use fdopen() to turn it into a FILE* - twice -
there's very little reason you'll have to hit the network twice, even
(Continue reading)

Reini Urban | 20 Aug 20:33
Picon
Gravatar

omindex patch

Attached is my rather largish omindex.cc patch with ChangeLog.

It needs autoreconf to update configure and the Makefiles.
Note that unrar is not patent infected, only rar, the compressor.
I've put some AC_PATH_PROG checks into configure for all helpers.

The patch is not yet complete.

2006-08-18 15:13:32 Reini Urban <reinhard.urban <at> avl.com>

	omega-0.9.6b:
	* omindex.cc: last_mod as value. Add HAVE_UNRAR,
         HAVE_MSGCONVERT, HAVE_READPST, HAVE_CATDOC checks.
	Add options --verbose, --silent
	* configure.ac: Add HAVE_CATDOC
	
2006-08-17 18:06:26 Reini Urban <reinhard.urban <at> avl.com>

	omega-0.9.6a:
	* omindex.cc: Added last_mod check, cache_dir, libtextcat,
	cached virtual directories (zip,msg,pst,...).
	New options: -c/--nocleanup, -i/--ignore-time.
	Add MS-Office mimetypes (word, excel, powerpoint, outlook)
	* configure.ac: Add HAVE_TEXTCAT, HAVE_UNRAR, HAVE_MSGCONVERT,
	HAVE_READPST, HAVE_CATDOC
	* commonhelp.cc: Update stemmer help with HAVE_TEXTCAT (lang
         autodetection)
	* configfile.cc: New cache_dir
	* Makefile.am: Prepare for omindex_test. Link omindex against
         configfile.
(Continue reading)

Olly Betts | 24 Aug 23:46
Favicon
Gravatar

Re: weight scheme with document values

On Tue, Jul 25, 2006 at 11:54:15AM -0400, Eric Langevin wrote:
> Is it the good way to do this or there is an easier way ??

It sounds like you're really wanting the not-yet-really-implemented
MatchBiasFunctor.  This would allow an arbitrary extra weight to be
given for each document, so using BoolWeight() with a suitable
MatchBiasFunctor would allow you to rank purely by distance.

Unfortunately the current MatchBiasFunctor is a proof-of-concept
hard-wired to a particular use.

Rusty Conover was working on an ExternalPostList which would provide
most of the infrastructure for this, but I don't think I've seen
anything since his initial patch.  Gmane's threaded view seems to
be broken right now, but you can see the thread via the search:

http://search.gmane.org/?query=externalpostlist&group=&sort=revdate

It's something I'd like to see, but I'm not likely to find time to
work on it myself for a while.

Cheers,
    Olly
richard | 24 Aug 23:51

Re: weight scheme with document values

On Thu, Aug 24, 2006 at 10:46:39PM +0100, Olly Betts wrote:
> It's something I'd like to see, but I'm not likely to find time to
> work on it myself for a while.

No promises, but I think I'm likely to need this functionality shortly, so
I'm hoping to get round to looking at this soon myself.

--

-- 
Richard
Olly Betts | 26 Aug 23:56
Favicon
Gravatar

Re: Proposed changes to omindex

On Thu, Aug 10, 2006 at 10:52:59PM -0700, Michael Trinkala wrote:
> Proposed changes to omindex

One suggestion before I go into details - even if some of these patches
may not be things we'd want to include in the mainstream releases right
now, they may still be of interest to some other users.  So I'd
encourage you to offer them for download, or just post them here if they
aren't too big.  The same goes for other people with patches they're
happy to share.

> Currently Available Items
> =========================
>
> 1) Have the Q prefix contain the 16 byte MD5 of the full file name
> used for document lookup during indexing.

There are two issues here really.

The first is if the unique id should be based on the file path or the
URL.  Currently omindex uses the URL, but the file path could be used
instead.  The main difference I can see is that it would allow the URL
mappings to be changed without a reindex (providing the omega CGI
applied the mappings at search time) but I'm not sure how useful that
really is - I can't remember the last time I reconfigured the url to
file mappings on any webserver I maintain.

On the flip side, currently you can move the physical locations of
files around and change the URL mappings in the web server so the URLs
remain the same, and omindex won't have to reindex a thing.  That
actually seems a more likely scenario to me (though again I can't
(Continue reading)

Olly Betts | 27 Aug 00:13
Favicon
Gravatar

Re: Proposed changes to omindex

On Sat, Aug 19, 2006 at 07:22:10PM +0100, James Aylett wrote:
> (Although any decent network fs has built-in caching, and in any
> case you could rely on the OS buffers - if you open() first, then dup
> the filedes, then use fdopen() to turn it into a FILE* - twice -
> there's very little reason you'll have to hit the network twice, even
> on a lame net fs.

Some of the format conversion filters want a filename for the input, so
you can't open the file once and dup the file descriptor (pdftotext for
example).  Those that can read from stdin (e.g. antiword) could be
handled this way if it actually helps.

> One idea I've talked to someone about is separating omindex into
> something that drives scriptindex, which in theory would allow you to
> use the file spider in omindex with whatever indexing strategy you
> wanted.

Perhaps that was me, or possibly we've both discussed it with Richard
separately?

Anyway, it's an interesting idea, though it might add measurable
overhead.  A step towards it is that I've recently added a "load"
command to scriptindex which allows you to write an index script which
takes a filename to read and index the contents of.

> I'd certainly favour having a way of running the query parser that
> didn't need R-terms, [...]

There already is: QueryParser::set_stemming_strategy() can be called
with STEM_NONE or STEM_ALL (the default is STEM_SOME).
(Continue reading)

Michael Trinkala | 27 Aug 10:24

Re: Proposed changes to omindex

The tar file can be found here: https://www.trinkala.com/xapian/sort_collapse.tgz

Change summary for omega
------------------------
- Added the document’s last modified time to the value table (ID 0). It is stored as a 4 byte
string in big endian format
- Added the document’s MD5 to the value table (ID 1) as a 16 byte string and C term prefix to
allow collapsed documents to be easily expanded/searched

Tar file contents
-----------------
diff.txt - SVN diff of all the changes against revision 7156

Added the following files from the GNU testutils-2.1 package
md5.c
md5.h
unlocked-io.h

utils.h
utils.cc
   - added an enum for the value id constants
   - added md5 and big endian date conversion functions

omindex.cc
   - add the two new value items and the new C prefix term during indexing

docs/omegascript.txt
query.cc
   - added $md5 and $valuedate commands/documentation

(Continue reading)

James Aylett | 27 Aug 16:27

Re: Proposed changes to omindex

On Sat, Aug 26, 2006 at 11:13:37PM +0100, Olly Betts wrote:

> Some of the format conversion filters want a filename for the input, so
> you can't open the file once and dup the file descriptor (pdftotext for
> example).  Those that can read from stdin (e.g. antiword) could be
> handled this way if it actually helps.

Well, strictly speaking we can LD_PRELOAD filters that can't act as
stream filters to death, although that only works on modern Unices. We
shouldn't really rely on that, though :-)

Most filters would accept a patch to work from stdin if they don't
already, and it wouldn't be too difficult to do. That would benefit
everyone, if we run into some common ones.

I've no idea whether it actually will help, in practice. I suspect
that in most cases, it's not actually going to win you much because
the file buffering will do the right thing already.

> > One idea I've talked to someone about is separating omindex into
> > something that drives scriptindex, which in theory would allow you to
> > use the file spider in omindex with whatever indexing strategy you
> > wanted.
> 
> Perhaps that was me, or possibly we've both discussed it with Richard
> separately?

I've no idea :-)

> Anyway, it's an interesting idea, though it might add measurable
(Continue reading)


Gmane