Bill Hutten | 1 Apr 02:36

Newbie question: How to extract 'date modified' from path when indexing?

Hi all:

I've successfully set up Xapian/Omega as the search engine on a client 
website. So far, so good. :)  However, the client would like to be able 
to search by date. This is not a problem - the START and END cgi 
parameters work fine, except that omindex is using (of course) the 
datestamp of the HTML files as the "Date Modified".  This datestamp is 
not accurate, as the files have been moved between servers, backed up 
and restored, etc etc over time.

The files are stored in a consistent structure, for instance file 
"foo.html" might be in "archives/2006/07/foo.html"  In this example, I 
would like to be able to extract the 2006/07 value from the path during 
indexing and use that as the date that Xapian/Omega uses to search on.

Can anyone give me a few pointers as to how I would accomplish this? 
Right now my indexing is simply done with omindex - I assume this will 
not be sufficient.

Thanks for any help you can offer.

- bill
--

-- 
Bill Hutten bill <at> hutten.org
Deron Meranda | 1 Apr 05:55
Picon
Gravatar

Re: Newbie question: How to extract 'date modified' from path when indexing?

On Tue, Mar 31, 2009 at 8:36 PM, Bill Hutten <bill <at> hutten.org> wrote:
> I've successfully set up Xapian/Omega as the search engine on a client
> website. ...
>
> The files are stored in a consistent structure, for instance file
> "foo.html" might be in "archives/2006/07/foo.html"  In this example, I
> would like to be able to extract the 2006/07 value from the path during
> indexing and use that as the date that Xapian/Omega uses to search on.

Do you have access to the webserver files at all?  Because the best
solution is simply to change the timestamp of the underlying files.  That
would benefit not only your Xapian indexing, but also all the other HTTP
goodness; such as working with whatever other types of spiders or
indexers may be crawling the site, HTTP proxies and caches, etc.

If it's Unix/Linux, changing the file timestamps would be quite easy.
You want to look at the "time" command.  Or I could provide you
a little script to do that.

As a second choice, if say this is an Apache webserver and you
can add some configuration (either the main config file or the
per-directory .htaccess files); then you can force Apache to
lie about the file's date.  This is easiest though if you only have a
few directories (which if it's one directory per month is doable).
Again, since the webserver would be sending out the correct
date, it also benefits other spiders, indexers, HTTP caches, etc.

As a last resort, you're going to have to modify the indexer itself
to overrule what it learns from the HTTP date, and instead extract
a date out of the URL pattern.
(Continue reading)

Olly Betts | 1 Apr 08:27
Favicon
Gravatar

Re: binary db compatibility between versions

On Tue, Mar 31, 2009 at 04:09:25PM +0100, Richard Boulton wrote:
> On Tue, Mar 31, 2009 at 03:50:01PM +0100, Ben Campbell wrote:
> > Is there a particular policy about if/when the on-disk format for xapian 
> > indexes changes between versions?
> > 
> > (in specific, I'm looking at generating an index offline on a machine 
> > using xapian 1.0.5, then upload the files for use on a machine with 
> > xapian 1.0.7)
> 
> Hmm - I thought we'd documented this somewhere, but I can't find it right
> now.
> 
> The general idea is that on-disk database formats for a stable backend
> (such as "flint" during the 1.0.x release series) shouldn't change between
> releases.  However, we weren't able to maintain that during the 1.0 release
> series due to needing to fix various bugs.

0.9 flint and 1.0 flint aren't compatible due to a bug fix.

But the changes in 1.0.2 were to add support for spelling and synonyms
(and while we were at it, we made the value and position tables lazily
created).  And 1.0.3 added support for user metadata.  So it was new
features not bugs which drove these format changes.

As Richard noted, 1.0.3 can read databases created by 1.0.0-1.0.2 (but
databases created or updated by 1.0.3 can't be read by earlier
versions).  Similarly for 1.0.2 and 1.0.0-1.0.1.

In hindsight doing this was probably a mistake.  We'd assumed that this
would work well for users, as they'd upgrade and not try to downgrade
(Continue reading)

Olly Betts | 1 Apr 09:52
Favicon
Gravatar

Re: Newbie question: How to extract 'date modified' from path when indexing?

On Tue, Mar 31, 2009 at 11:55:47PM -0400, Deron Meranda wrote:
> On Tue, Mar 31, 2009 at 8:36 PM, Bill Hutten <bill <at> hutten.org> wrote:
> > The files are stored in a consistent structure, for instance file
> > "foo.html" might be in "archives/2006/07/foo.html"  In this example, I
> > would like to be able to extract the 2006/07 value from the path during
> > indexing and use that as the date that Xapian/Omega uses to search on.
> 
> Do you have access to the webserver files at all?  Because the best
> solution is simply to change the timestamp of the underlying files.  That
> would benefit not only your Xapian indexing, but also all the other HTTP
> goodness; such as working with whatever other types of spiders or
> indexers may be crawling the site, HTTP proxies and caches, etc.

Yes, this is a very sensible approach.

> If it's Unix/Linux, changing the file timestamps would be quite easy.
> You want to look at the "time" command.  Or I could provide you
> a little script to do that.

Actually, "time" times how long a command takes - see "touch" for
changing file timestamps.

> As a second choice, if say this is an Apache webserver and you
> can add some configuration (either the main config file or the
> per-directory .htaccess files); then you can force Apache to
> lie about the file's date.  This is easiest though if you only have a
> few directories (which if it's one directory per month is doable).
> Again, since the webserver would be sending out the correct
> date, it also benefits other spiders, indexers, HTTP caches, etc.

(Continue reading)

Henry | 1 Apr 15:38
Picon

Re: binary db compatibility between versions

Quoting "Olly Betts" <olly <at> survex.com>:
> ... Note that cases
> which aren't compatible will throw DatabaseVersionError, so just trying
> it is the simplest way to check.

I ran into this a few days ago (r12236, I think), requiring a re-index  
on my test data (using chert) :)

> ... but I'm interested to hear feedback from users about what sort  
> of compatibility promises they find useful.

How about always bumping the minor version number when the index  
format changes?
That way it's obvious (at a glance) that a significant change has  
occurred, triggering (hopefully) a closer inspection of the changelog  
by the user.

Regards
Henry
Olly Betts | 2 Apr 10:22
Favicon
Gravatar

Re: binary db compatibility between versions

On Wed, Apr 01, 2009 at 03:38:59PM +0200, Henry wrote:
> Quoting "Olly Betts" <olly <at> survex.com>:
> > ... Note that cases
> > which aren't compatible will throw DatabaseVersionError, so just trying
> > it is the simplest way to check.
> 
> I ran into this a few days ago (r12236, I think), requiring a re-index  
> on my test data (using chert) :)

Yeah, for the development backend (chert currently) the rules are much
looser.  Essentially we'll bump the format version when the format
changes (so you'll get an exception), and we'll try to group changes
which bump the format version.

> > ... but I'm interested to hear feedback from users about what sort  
> > of compatibility promises they find useful.
> 
> How about always bumping the minor version number when the index  
> format changes?

The problem with forcing a minor version bump here is that we already
have the deprecation policy expressed in terms of minor versions, and
documentation, ticket, and the wiki refer to things happening in future
versions.

So we'd need to have a different way to refer to future versions in a
minor version independent way, somehow.  Or else we'd need to go through
and update all such references whenever the stable database format
changed.  Deciding to make 1.1 a development release has required doing
pretty much this, and it was rather a pain.
(Continue reading)

tata 668 | 6 Apr 01:18
Picon

TermGenerator question for the single quote character

Hi,

I use the TermGenerator to index the french text "Cela m'excite" 
(without the quotes). When I do a search for "excite" after this 
indexation, I need it to be found. "excite" is a word on is own.

Currently "excite" is not found but "m'excite" is...

Is there a setting I'm missing so that the single quote character act as 
a word delimiter?

Thanks for the help!

Julien
Eric Voisard | 6 Apr 12:45
Picon

omindex => Unknown extension

Hi all,

I'm having a recurrent problem with Omega's indexing.
When I run omindex, it sometimes misses to recognize the extension of
some files (.doc, .pdf) and skips them. In the same run, omindex is
otherwise perfectly able to index other files with same extensions. The
reason is not clear but it should occur before it selects a content
converter since for example, if I manually run antiword on a .doc file
that failed, it works...

Running omindex:
Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping

Manual conversion:
host:/srv # antiword "/srv/xapian/targets/dir/subdir/file name.doc"
<..plain text content of the file...>
host:/srv #

Note that the target directory is a CIFS mount of a remote Windows
shared directory. Charset is UTF-8.
I don't think it has to do with the whitespace in the file name since
other .doc filenames with whitespaces work.

Any idea?...

Thanks in advance, Eric
ATIS Uher S.A. 
CH 2046 Fontaines
________________________________________________________________________________________________

(Continue reading)

Cedric Jeanneret | 6 Apr 13:42
Gravatar

Re: omindex => Unknown extension

Hello!
having the same here. Solved by adding some ram in my server.
Maybe external calls can't be done properly, and omindex crashes when launching programs such as antiwor,
pdftotext and so on...

Hope this can help you...

Regards,

C.

On Mon, 06 Apr 2009 12:45:40 +0200
"Eric Voisard" <eric.voisard <at> atisuher.ch> wrote:

> Hi all,
> 
> I'm having a recurrent problem with Omega's indexing.
> When I run omindex, it sometimes misses to recognize the extension of
> some files (.doc, .pdf) and skips them. In the same run, omindex is
> otherwise perfectly able to index other files with same extensions. The
> reason is not clear but it should occur before it selects a content
> converter since for example, if I manually run antiword on a .doc file
> that failed, it works...
> 
> Running omindex:
> Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping
> 
> Manual conversion:
> host:/srv # antiword "/srv/xapian/targets/dir/subdir/file name.doc"
> <..plain text content of the file...>
(Continue reading)

Olly Betts | 6 Apr 14:29
Favicon
Gravatar

Re: omindex => Unknown extension

On Mon, Apr 06, 2009 at 12:45:40PM +0200, Eric Voisard wrote:
> Running omindex:
> Unknown extension: "/srv/xapian/targets/dir/subdir/file name.doc" - skipping

That message means that "doc" isn't in the mimemap.

Since it is by default, unless you removed it with "-Mdoc:" on the
command line this means that a previous attempt to run antiword failed
with exit status 127, which is what happens if it antiword isn't found.

So either you don't have the PATH set correctly when running omindex,
or antiword can exit with status 127, or ... something else odd.

Try looking in the output for:

    Filter for "application/msword" not installed - ignoring extension "doc"

Cheers,
    Olly

Gmane