Olly Betts | 2 Nov 06:04
Favicon
Gravatar

Re: database corrupt after harddisk full

On Fri, Oct 30, 2009 at 02:41:44PM +0800, ouwind wrote:
> i run the program continue to insert and query record until to handdisk full
> then xapian throws a exception that write file failed.
> i close the program, delete some files to left some disk space. but when i
> open the writeable database, it throws exception, the exception description
> is DatabaseError: Couldn't read enough (EOF)

What Xapian version are you using?

Assuming it's a recent version, please show us the output of ls -l on the
database directory.

Cheers,
    Olly
Olly Betts | 2 Nov 06:18
Favicon
Gravatar

Re: use TermGenerator to get the term

On Thu, Oct 29, 2009 at 02:26:58PM -0500, Ying Liu wrote:
> I have a question about use the TermGenerator alone by Perl. Someone  
> asked this question before and his code is in C++.  
> (http://lists.xapian.org/pipermail/xapian-discuss/2008-November/006109.html). 
> My code is in Perl. Can I get the term by the position just by  
> TermGenerator?
>
>    my $analyzer = Search::Xapian::TermGenerator->new;
>    $analyzer->index_text("hello Xapian world");
>    my $curr_position = $analyzer->get_termpos();     
> $analyzer->set_termpos(2);     $curr_position = $analyzer->get_termpos();
>    $analyzer->increase_termpos(1);
>    $curr_position =  $analyzer->get_termpos();
>
> If I set the document and then use $doc to iterate the term list, terms  
> are order  alphabetically. I don't know how to use the positer().
>
> $analyzer->set_document($doc);
> my $termlist_begin = $doc->termlist_begin();
> $termlist_begin++;
> my $term = $termlist_begin->get_document();

Um, that line is wrong as TermIterator doesn't have a get_document()
method.  I think you mean:

    my $term = $termlist_begin->get_termname();

To iterator the positions for TermIterator $term_itor you'd do:

    my $pos_itor = $term_itor->positionlist_begin();
(Continue reading)

Olly Betts | 2 Nov 06:20
Favicon
Gravatar

Re: use TermGenerator to get the term

On Thu, Oct 29, 2009 at 06:37:31PM -0500, Ying Liu wrote:
> Well, the ESet perl package is not available now.

I'm not sure what you mean.  Xapian::ESet is wrapped as Search::Xapian::ESet
for Perl.

Cheers,
    Olly
Ben Campbell | 2 Nov 11:25
Gravatar

Grouping Results (again)

I've got a bunch of indexed documents (newspaper articles).
Each document has 0 or more authors.

I want to show search results grouped by author.

(it's a somewhat similar situation to the one posted a couple of weeks 
ago by Torsten Bronger)

Here are the solutions I can think of so far:

1) pick a single author for each article, and put them in a valueno 
slot, then use set_collapse_key() to do the grouping.
cons: doesn't handle articles with more than one credited author very well.

2) slurp the top N results out into the calling code (I'm using PHP in 
this case) and do the grouping there. Need some metric to rank authors - 
either by taking their most relevant document (as set_collapse_key does) 
or maybe even by summing up the relevance scores of all their documents 
- and multiple matching documents probably means an author is more relevant.

cons: doesn't scale up well to large result sets.

3) maintain a separate xapian database which has single uberdocument for 
each author (by concatinating all their articles)
I've got nearly 2 million documents, but only about 20000 authors. Maybe 
a second database would be quite small...
cons: _another_ database to maintain and contend for RAM

Any other suggestions or advice?

(Continue reading)

Richard Boulton | 2 Nov 11:58
Gravatar

Re: Grouping Results (again)

2009/11/2 Ben Campbell <ben <at> scumways.com>:
> I've got a bunch of indexed documents (newspaper articles).
> Each document has 0 or more authors.
>
> I want to show search results grouped by author.

What does that mean, where there is more than 1 author?  Would you
want a search result to appear in multiple groups?  Or somehow pick
one of the author groups to appear under?

The separate database approach you mention might well be a good idea -
though if you need to do incremental updates, that might be a killer.

Alternatively, the facet support (on trunk, and improved to allow
multiple facet values for a particular category in the matchspy
branch) might be what you want.  This will let you get a list of all
the authors which are in your result set.

--

-- 
Richard
Ben Campbell | 2 Nov 12:32
Gravatar

Re: Grouping Results (again)

Richard Boulton wrote:
> 2009/11/2 Ben Campbell <ben <at> scumways.com>:
>> I've got a bunch of indexed documents (newspaper articles).
>> Each document has 0 or more authors.
>>
>> I want to show search results grouped by author.
> 
> What does that mean, where there is more than 1 author?  Would you
> want a search result to appear in multiple groups?  Or somehow pick
> one of the author groups to appear under?

/Ideally/, I'd want an article to appear in multiple groups, eg:

Bob Smith
   Article A
   Article B (with Fred Bloggs)
   Article C

Fred Blogs
   Article B (with Bob Smith)
   Article D

etc...

> The separate database approach you mention might well be a good idea -
> though if you need to do incremental updates, that might be a killer.

Well, I've got 2000-3000 new articles coming in each day, so yeah, there 
probably would be some incrememtal updating involved :-(
I need to update 1000-2000 authors per day, I'd guess. Which might not 
(Continue reading)

Ying Liu | 2 Nov 19:31
Picon

Re: use TermGenerator to get the term

Hi Olly,

Thanks for your reply.

The reason I asked the question is I didn't understand that this is an 
information retrieval system. I can only start from a word and get its 
position, but not the opposite direction, just like the one way street. 
And that is the reason it is so fast.

For the PositionIterator, positionlist_begin() and positionlist_end() 
will point to the same place only when the list is empty.

I learn a lot from archived emails. Thank you very much!

-Ying

Olly Betts wrote:
> On Thu, Oct 29, 2009 at 02:26:58PM -0500, Ying Liu wrote:
>   
>> I have a question about use the TermGenerator alone by Perl. Someone  
>> asked this question before and his code is in C++.  
>> (http://lists.xapian.org/pipermail/xapian-discuss/2008-November/006109.html). 
>> My code is in Perl. Can I get the term by the position just by  
>> TermGenerator?
>>
>>    my $analyzer = Search::Xapian::TermGenerator->new;
>>    $analyzer->index_text("hello Xapian world");
>>    my $curr_position = $analyzer->get_termpos();     
>> $analyzer->set_termpos(2);     $curr_position = $analyzer->get_termpos();
>>    $analyzer->increase_termpos(1);
(Continue reading)

Ryan Bates | 3 Nov 00:36
Favicon
Gravatar

Clearing Database For Tests

I am using Xapian through the Ruby bindings and am wondering the best
way to use it in a unit test suite where we must start with an empty
database every time (so records generated in one test don't interfere
with another). I am currently deleting the database directory and
regenerating it every time, but I find this to be very slow (nearly
half a second). This results in even a small number of tests taking a
long time to run.

Is there some fast way to clear an existing database of all content
(terms, values, spellings, etc.)? If not, what is the best practice
for resetting a database for each test?

Regards,

Ryan
David P. Novakovic | 3 Nov 00:46
Picon
Gravatar

Fwd: Clearing Database For Tests

I forgot to cc the mailing list...

---------- Forwarded message ----------
From: David P. Novakovic <davidnovakovic <at> gmail.com>
Date: Tue, Nov 3, 2009 at 9:45 AM
Subject: Re: [Xapian-discuss] Clearing Database For Tests
To: Ryan Bates <ryan <at> railscasts.com>

i know this doesn't answer your question, but it is definitely worth
mentioning:

I've found that I actually don't need that many tests for interface code to
xapian.  Once you have tests that prove your base wrapper functions work as
expected wouldn't you just want to test your code not xapian?

Use mocks and dep injection to avoid actually touching the xapian code...
I've forgotten the technical term for it, but you want to avoid testing code
that is outside of the scope of what you are trying to test.

On Tue, Nov 3, 2009 at 9:36 AM, Ryan Bates <ryan <at> railscasts.com> wrote:

> I am using Xapian through the Ruby bindings and am wondering the best
> way to use it in a unit test suite where we must start with an empty
> database every time (so records generated in one test don't interfere
> with another). I am currently deleting the database directory and
> regenerating it every time, but I find this to be very slow (nearly
> half a second). This results in even a small number of tests taking a
> long time to run.
>
> Is there some fast way to clear an existing database of all content
(Continue reading)

Ryan Bates | 3 Nov 00:56
Favicon
Gravatar

Re: Fwd: Clearing Database For Tests

Hi David,

Thanks for the response. Unfortunately mocking isn't a solution here.
I'm actually building a high-level wrapper around the Xapian bindings
so nearly every test interacts heavily with the Xapian database:
http://github.com/ryanb/xapit

Would love to hear more ideas though.

Regards,

Ryan

On Mon, Nov 2, 2009 at 3:46 PM, David P. Novakovic
<davidnovakovic <at> gmail.com> wrote:
> I forgot to cc the mailing list...
>
> ---------- Forwarded message ----------
> From: David P. Novakovic <davidnovakovic <at> gmail.com>
> Date: Tue, Nov 3, 2009 at 9:45 AM
> Subject: Re: [Xapian-discuss] Clearing Database For Tests
> To: Ryan Bates <ryan <at> railscasts.com>
>
>
> i know this doesn't answer your question, but it is definitely worth
> mentioning:
>
> I've found that I actually don't need that many tests for interface code to
> xapian.  Once you have tests that prove your base wrapper functions work as
> expected wouldn't you just want to test your code not xapian?
(Continue reading)


Gmane