James Aylett | 1 Aug 12:15 2008

Re: [Xapian-discuss] Dealing with image PDF's

On Thu, Jul 31, 2008 at 12:54:07PM +0100, Richard Boulton wrote:

> > I solved the preconfigured binary package problem with packaging
> > dependencies.
> 
> But that requires all the potential helper packages to be installed 
> whenever omega is.  That would be fairly annoying for a user who, say, 
> just wanted to index some HTML pages.  It also doesn't help if a helper 
> program isn't available as a package (which could certainly be the case 
> for a new helper program, but I don't know if we're using any such 
> helpers currently).

It also doesn't help when we allow simple helpers to be added on the
command line or similar, since these may be ad hoc scripts rather than
packaged whatevers. Although using the same failure mode as we have
now would be fine.

> > I cache would be overkill.
> 
> A simple (in-memory) cache of helper program paths is hardly 
> heavyweight.  But, in any case, I'm not convinced that such a cache is 
> needed - I don't expect the time taken to look through $PATH for files 
> to be a bit part of the indexing time.

The OS has a cache for this.

> I'm definitely opposed to hardcoding the location of files, incidentally 
> - there are all sorts of reasons that a user might want to use an 
> alternative helper file, and allowing them to simply place such a file 
> somewhere early on PATH is a good way to do this.
(Continue reading)

Olly Betts | 2 Aug 01:59 2008

Re: [Xapian-discuss] Dealing with image PDF's

On Fri, Aug 01, 2008 at 11:15:20AM +0100, James Aylett wrote:
> On Thu, Jul 31, 2008 at 12:54:07PM +0100, Richard Boulton wrote:
> > I'm definitely opposed to hardcoding the location of files, incidentally 
> > - there are all sorts of reasons that a user might want to use an 
> > alternative helper file, and allowing them to simply place such a file 
> > somewhere early on PATH is a good way to do this.
> 
> We just want the expected execvp() behaviour, don't we?

Yes, I think so (execvp() is documented as doing it like the shell
does).

> We could also use something similar to mailcap + mime.types, on
> systems that support them.

The standard mailcap file entries are slanted too much towards human
viewability rather than provided text in a suitable form for indexing
without caring much about formatting.  And for images and video we
want the meta-data rather than the content.  But the format might be
a sane choice.

Recoll uses filter system which seems to be taken from Estraier.  It
uses a shell script which does the work for each format, but it has to
output HTML which often seems to require a run through sed to escape
'<', '>', and '&', and then the indexer has to parse the HTML, which all
seems a bit unnecessary.  But it might be nice to support such filter
scripts as an option.

Cheers,
    Olly
(Continue reading)

James Aylett | 2 Aug 18:08 2008

Re: [Xapian-discuss] Dealing with image PDF's

On Sat, Aug 02, 2008 at 12:59:29AM +0100, Olly Betts wrote:

> > We could also use something similar to mailcap + mime.types, on
> > systems that support them.
> 
> The standard mailcap file entries are slanted too much towards human
> viewability rather than provided text in a suitable form for indexing
> without caring much about formatting.  And for images and video we
> want the meta-data rather than the content.  But the format might be
> a sane choice.

Yes, I meant more a text file configuration system in that style.

> Recoll uses filter system which seems to be taken from Estraier.  It
> uses a shell script which does the work for each format, but it has to
> output HTML which often seems to require a run through sed to escape
> '<', '>', and '&', and then the indexer has to parse the HTML, which all
> seems a bit unnecessary.  But it might be nice to support such filter
> scripts as an option.

If there's a well defined format it'd be nice, but it sounds a bit
messy. What we could do is write a script that takes that as input and
spits out whatever it is that we want, so you can chain any
Estraier/Recoll filters with the converter.

We went a little way down this road years ago with the XML indexer
system, but it was never a pleasant thing to use or work with. We
could probably get away with something along the lines of YAML as an
encapsulation; it's not like we need huge numbers of distinct blobs of
data to work with.
(Continue reading)

Richard Boulton | 3 Aug 02:10 2008

UUIDs for databases

The replication code (which is in trunk only currently, not in any 
release) uses the concept of a UUID for each database to identify 
whether a database at a particular location is the same database as 
produced by a previous replication (possible with alterations), or has 
been replaced with a new database.

Currently, the UUIDs generated aren't exactly universal, or unique: 
they're just the timestamp at which the database was first created.

The semantics required for the replication process to function correctly 
are just about satisfied by this; however, part of the result of this is 
that the replicated database has a different UUID to the source 
database, because it was created at a different time.  It also admits 
the possibility of replication getting confused if two databases are 
created at the same time, and one is then moved into place to replace 
the old database.

I'd like to change the UUID implementation to something more robust, but 
I've been thinking about what semantics should be.

UUIDs aren't currently externally exposed (other than as part of an 
opaque blob of data specifying a particular database/revision 
combination), so we can probably just focus on providing UUIDs which are 
good enough to allow the replication process to function robustly, 
rather than thinking about any other uses which they may be put to.

My thinking so far is that:

  - A new UUID should be created when a database is created from scratch.

(Continue reading)

Olly Betts | 3 Aug 08:40 2008

Re: UUIDs for databases

On Sun, Aug 03, 2008 at 01:10:07AM +0100, Richard Boulton wrote:
> UUIDs aren't currently externally exposed (other than as part of an 
> opaque blob of data specifying a particular database/revision 
> combination), so we can probably just focus on providing UUIDs which are 
> good enough to allow the replication process to function robustly, 
> rather than thinking about any other uses which they may be put to.

I think they would be useful to expose though, so only thinking about
what replication needs isn't helpful either.

> My thinking so far is that:
> 
>   - A new UUID should be created when a database is created from scratch.
> 
>   - A new UUID should be created when a database is replaced (eg, with 
> one of the OVERWRITE options of the open method).
> 
>   - If a database is modified in any way (which will always involve a 
> transaction occurring), the UUID should be left unchanged.  This allows 
> a UUID/revision_number combination to identify a particular revision of 
> a database.
> 
>   - If a database is replicated, the replica should have the same UUID 
> as the original (because, to all intents and purposes, it _is_ the 
> original).

OK so far.

>   - If the replica of the database is then modified (eg, by adding a 
> document to it), the UUID of the replica should be changed.  This is 
(Continue reading)

Richard Boulton | 3 Aug 11:00 2008

Re: UUIDs for databases

Olly Betts wrote:
>>   - The replication protocol currently works by a client sending a 
>> message to the server saying "please give me the necessary updates to 
>> transform a database with a given UUID and a given revision number into 
>> a more recent version of the database".  Therefore, for efficient 
>> updates, the server needs to remember that the old revision number 
>> applied for databases before a particular revision number.
> 
> Do you mean "the old UUID" not "old revision number"?

Yes.

>> (3) is less important, because it would simply result in full-database 
>> copies occurring when changesets could be applied, leading to an 
>> efficiency drop, rather than corrupt databases.  However, it would be 
>> best to fix the changeset files to hold the information required before 
>> a release, even if we don't actually write the code to search through 
>> them if the database UUID doesn't match, so that incompatible changes 
>> aren't required to implement this in future.
> 
> I think this is only relevant if you're changing the UUID?

Yes, that's right.

--

-- 
Richard
Alessandro Pasotti | 19 Aug 09:17 2008
Picon

Fwd: Strange error with PHP bindings [some more details]


Finally I noticed something suspect:

[2008-08-19 09:11:25] [DEBUG] DAO_Articles::add_xindex() - document added id : 255, title : Gli anelli con sigil...

this is a debug line from my application, add_xindex function simply adds the document to xapian database,  the error always happens when I try to add an article with id = 255, this can not be a casualty (I also tried to change the order of documents, it always stop to 255).




---------- Forwarded message ----------
From: Alessandro Pasotti <apasotti <at> gmail.com>
Date: 2008/8/18
Subject: Strange error with PHP bindings
To: Xapian Discussion <Xapian-discuss <at> lists.xapian.org>


Hello,

I'm using version 1.0.7 and for the first time I'm seeing this error without being able to understand what causes it to show up:

 Fatal error: Uncaught exception 'Exception' with message 'AssertionError: backends/flint/flint_table.cc:2080: c >= 11' in /usr/local/share/php5/xapian.php:1391


Any hint?

The same tests run without errors on 1.0.6 on my testing machine.


--
Alessandro Pasotti
w3: www.itopen.it



--
Alessandro Pasotti
w3: www.itopen.it
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Olly Betts | 20 Aug 02:38 2008

Re: Fwd: Strange error with PHP bindings [some more details]

On Tue, Aug 19, 2008 at 09:17:09AM +0200, Alessandro Pasotti wrote:
> Finally I noticed something suspect:
> 
> [2008-08-19 09:11:25] [DEBUG] DAO_Articles::add_xindex() - document added id
> : 255, title : Gli anelli con sigil...
> 
> this is a debug line from my application, add_xindex function simply adds
> the document to xapian database,  the error always happens when I try to add
> an article with id = 255, this can not be a casualty (I also tried to change
> the order of documents, it always stop to 255).

You're still going to need to show us how to reproduce this.  This works
for me:

    $db = new XapianWritableDatabase("tmp.db", Xapian::DB_CREATE_OR_OVERWRITE);
    $db->replace_document(255, new XapianDocument());

Cheers,
    Olly

Gmane