Something to think about
Richard Boulton <richard <at> lemurconsulting.com>
2007-10-10 01:09:51 GMT
I'm planning to add multiple-database support for searches to my "Xappy"
python wrapper (more on this wrapper later, but for now, see
http://code.google.com/p/xappy for details). This is reasonably
straightforward, because Xapian supports this nicely: except that
"Xappy" generates a "fieldname->prefix" mapping automatically. The
prefix which corresponds to a particular field is therefore hidden from
the user, and crucially, it may be different in different databases.
My current plan is to add a "databaseID" term to each document, and
construct a composite query. For example, the search "author:foo"
across databases with ids "db1" and "db2", where the prefix for author
in db1 is "A" and the prefix for author in db2 is "B", would become:
(Afoo FILTER db1) OR (Bfoo FILTER db2)
This should give the right sort of results, but the statistics for the
terms will be a bit broken. (Actually, I'm not totally convinced
they'll be broken in a harmful way, because if the term is more frequent
in one collection than another, this could correspond to it being more
significant when it occurs in the collection in which it is less
frequent.) At some point it would be nice to add the ability to have a
mapping from "human-readable field name" to "prefix code" inside xapian,
so the multidatabase stuff could be aware of this issue and generate the
prefixes correctly for each database. However, that's not urgent, and
not what I'm thinking about right now.
It would also be nice to have a "virtual" posting list, which
effectively returned a list of all the document IDs in a particular
database, so I didn't have to explicitly store the "databaseID" terms.
But that's also not what I'm thinking about right now.
(Continue reading)