Peter Karman | 1 Jul 04:15
Favicon
Gravatar

OT: index compression

This came across my queue from another list.

http://www2008.org/papers/pdf/p387-zhangA.pdf

Looks at index compression schemes. Would love to hear comments from those on 
this list.

--

-- 
Peter Karman  .  http://peknet.com/  .  peter <at> peknet.com
Olly Betts | 1 Jul 07:23
Favicon
Gravatar

Re: Improving indexing speed

On Thu, Jun 26, 2008 at 04:33:25PM -0700, Robert Kaye wrote:
> I am going to take this route -- I can see the disk usage creeping up  
> once it gets past 20% of my index and the rows/second starts degrading  
> past this point. Besides, dividing this task into chunks lets me  
> offload the process to multiple cores in my machines and then glue  
> things together at the end.

If you're I/O limited (which is usually the case), then trying to split
the load over multiple cores by indexing in parallel probably won't
help.  It may make things slower overall, as it will tend to increase
the VM pressure, and also tend to mean disk writes will be split between
more files.

I'd also be a bit wary of the idea of trying to use a ram disk to hold
the index.  Depending how your OS's VM system works, this might mean
you end up trying to hold two copies of the index in RAM - one in the RAM
disk, plus a cached copy in the file cache.  Or perhaps the VM system
knows about RAM disks and is smart enough not to try to cache blocks
from them, but it's something you ought to check.

Cheers,
    Olly
Olly Betts | 1 Jul 07:36
Favicon
Gravatar

Re: searching on individual fields

On Fri, Jun 20, 2008 at 03:22:32PM +0200, james cauwelier wrote:
>     $parser->set_stemming_strategy (XapianQueryParser::STEM_ALL, null,
> 'X'.strtoupper($field_name));

As James says, this only takes one parameter.

>     $query[] = '('.$parser->parse_query ($form[$field_name], null,
> 'X'.strtoupper($field_name)).')';

And parse_query()'s second parameter is meant to be a bitmask of flags -
`null' isn't a sensible thing to pass there.

> $query = implode (' AND ', $query);

Umm, XapianQueryParser::parse_query() returns a XapianQuery object,
which you can't (usefully) concatenate with a string.

To combine XapianQuery objects, stick them all in the array $query, and
then:

    $query = new XapianQuery(XapianQuery::OP_AND, $query);

Cheers,
    Olly
Olly Betts | 1 Jul 07:56
Favicon
Gravatar

Re: Has anyone gotten Xapian java bindings to work with Eclipse?

On Wed, Jun 18, 2008 at 03:11:24PM -0400, Jim wrote:
> OK so I'm convinced I should be using java-swig even though the top 
> level readme doesn't even mention it.

The "java-swig" bindings are a work in progress.  They are included in
the tarballs for the convenience of those who are happy to use them, but
not yet mentioned in the README to avoid people using them without
fully realising their status, and then being disappointed when they
discover the rough edges, or find the Java API changes underneath them.

> I did a make in the java-swig 
> directory and then listed the contents of the jar file.  To my 
> amazement, I found they were not in a package. 

Yes, this is one of the issues.  SWIG has an option to put them in a
package, but there was some problem with using it which is as yet
unresolved (I don't recall the problem, but if you get an SVN tree
and bootstrap it, then you can restore the commented out "-package
org.xapian" in java-swig/Makefile.am and the problem should be evident).

Patches to improve the java-swig bindings are most welcome.  Otherwise
if they don't work for you currently, you'll probably just have to wait
for me to find time to finish them.

> I don't know how to import them.  I moved the jar to the WEB-INF/lib 
> directory in Eclipse, refreshed the project and the compiler complains 
> about "Stem cannot be resolved to a type" and "Document cannot be 
> resolved to a type", etc.
>
> So I'm guessing that I have to perform some magic to get these classes 
(Continue reading)

Robert Kaye | 1 Jul 08:07
Gravatar

Re: Improving indexing speed


On Jun 30, 2008, at 10:23 PM, Olly Betts wrote:

> If you're I/O limited (which is usually the case), then trying to  
> split
> the load over multiple cores by indexing in parallel probably won't
> help.  It may make things slower overall, as it will tend to increase
> the VM pressure, and also tend to mean disk writes will be split  
> between
> more files.
>
> I'd also be a bit wary of the idea of trying to use a ram disk to hold
> the index.  Depending how your OS's VM system works, this might mean
> you end up trying to hold two copies of the index in RAM - one in  
> the RAM
> disk, plus a cached copy in the file cache.  Or perhaps the VM system
> knows about RAM disks and is smart enough not to try to cache blocks
> from them, but it's something you ought to check.

I've been watching the performance of xapian indexing in the past few  
days and I would concur with your assessment now.

However, given a sufficiently beefy machine, I think I could use  
multiple cores to get this job done in a hurry. Given the memory use/ 
disk IO trade-off I've observed, I think I could have my indexing  
machine, run 2-4 indexers if I dedicated all 8G of RAM to the task. :)  
Performance improves drastically when you give Xapian 500-700MB of RAM  
to play with. About 1.5G per process would probably result in a well  
loaded machine -- I'm guessing.

(Continue reading)

Olly Betts | 1 Jul 08:40
Favicon
Gravatar

Re: Has anyone gotten Xapian java bindings to work with Eclipse?

Jarrod Roberson wrote:
> you are using the old hand coded bindings, probably not going to get 
> much help on those

That's not the case - I've no plans to update them to wrap all the newer
API features (because writing JNI wrappers by hand is a lot of tedious
work).

But I'll do my best to fix bugs, and if you want to wrap particular
features and submit a patch, I'll apply it if it looks suitable.

I still don't know about Eclipse though!

Cheers,
    Olly
Olly Betts | 1 Jul 08:57
Favicon
Gravatar

Re: Improving indexing speed

On Mon, Jun 30, 2008 at 11:07:19PM -0700, Robert Kaye wrote:
> However, given a sufficiently beefy machine, I think I could use  
> multiple cores to get this job done in a hurry. Given the memory use/ 
> disk IO trade-off I've observed, I think I could have my indexing  
> machine, run 2-4 indexers if I dedicated all 8G of RAM to the task. :)  
> Performance improves drastically when you give Xapian 500-700MB of RAM  
> to play with. About 1.5G per process would probably result in a well  
> loaded machine -- I'm guessing.
> 
> Would anyone be interested in the results of this? I've got these 8  
> core/8G machines sitting in my office, waiting to go to the colo and I  
> could try it out. (Not that I would do this in a production  
> environment -- I'm not that pressed for time)

Yes, it's always interesting to hear performance reports.

Cheers,
    Olly
Olly Betts | 1 Jul 11:05
Favicon
Gravatar

Re: xapian perl bindings: aborted on numbers

On Thu, Jun 19, 2008 at 02:00:21PM +0900, Josef Novak wrote:
>   I went ahead and rebuilt all of indexes from scratch and now the
> problem seems to have disappeared.  I used exactly the same scripts,
> so I'm a bit confused as to what could possibly be different.

The "query from a list of terms" constructor does currently use SvPOK()
which will reject perl scalars which aren't represented as a string
internally.  But this should only mean that you can't use a number
which was expressed as a number - strings which happen to consist
entirely of digits should be fine.

And my understanding is that in this example Perl will store 2004 as a
string:

    @terms = qw / who won the 2004 superbowl /

But perhaps that check doesn't make sense anyway.  Looking at the
history, it's been there since the code was added in 0.8.0.1, but
Perl will coerce the value to a string if necessary, and in general
it doesn't distinguish between numbers and strings like this.

So I'm going to remove that check anyway, but I'm not sure if it is
actually related to your issue.

Cheers,
    Olly
Olly Betts | 1 Jul 12:04
Favicon
Gravatar

Re: Ubuntu Hardy Heron build available?

On Tue, Jun 17, 2008 at 01:55:28PM +0100, Paul Dixon wrote:
> Is there a xapian build available for Ubuntu 8.04 "Hardy Heron"?

Yes, the main Ubuntu repo has it (1.0.5, except for omega which is
1.0.4).

> http://www.xapian.org/debian/dists only seems to go to gutsy....

I haven't packaged 1.0.6 at all (lack of time compounded by some
hardware problems).  Once 1.0.7 is out, I'll build it for hardy too.

Cheers,
    Olly
Olly Betts | 1 Jul 12:22
Favicon
Gravatar

Re: Change document relevance on user feedback

On Fri, Jun 13, 2008 at 04:55:51PM +0100, Benjamin Hille wrote:
> I cannot find anywhere in API if it is possible to store back the relevance
> into the database. It looks like RSet does not write back into the db when
> you add a document to it. What really I need is once a user A as choosen the
> document as relevant for a query the same query by user B should have the
> document choosen by A as relevant.

Note that it may not be true that a document which is relevant to a
query for user A is also relevant to the exact same query for user B.

Disambiguation pages on wikipedia nicely illustrate one reason for this
- a word or phrase can have multiple meanings, and different users may
be looking for different meanings which have the same expression.  For
example, a query for "stock" could mean many things:

http://en.wikipedia.org/wiki/Stock_(disambiguation)

Even if users are actually expressing the same meaning, they may not
like the same pages.  A Java guru might search for a standard class name
to check some obscure detail and want reference material; a Java novice
might search for the same class, but want tutorials.

Cheers,
    Olly

Gmane