Olly Betts | 1 Mar 2006 01:06
Favicon
Gravatar

Re: Working Demo for WWW Search Engine

On Tue, Feb 28, 2006 at 11:55:27PM +0000, Olly Betts wrote:
> On Tue, Feb 28, 2006 at 01:22:21PM -0800, Kevin SoftDev wrote:
> > url   : unique=Q weight=2 field=url
> > title : index    weight=3 field=title
> > body  : index    weight=1 field=body

Oh, and you probably want to add a boolean term as well as checking for
one (that probably should be warned about, though it's possible you could
use "unique" without a corresponding "boolean" in some clever way).  Plus
"weight" is only useful for a probabilistic field.  So the index script
would probably be:

url   : unique=Q boolean=Q field=url
title : weight=3 index     field=title
body  : weight=1 index     field=body

If title and body can be arbitrarily long you might also want to add
a truncate, e.g. to index the whole of body but store at most 300
characters of it in the field:

body  : weight=1 index truncate=300 field=body

Cheers,
    Olly
Olly Betts | 1 Mar 2006 03:34
Favicon
Gravatar

Re: Working Demo for WWW Search Engine

On Wed, Mar 01, 2006 at 12:06:12AM +0000, Olly Betts wrote:
> > > url   : unique=Q weight=2 field=url

> you probably want to add a boolean term as well as checking for
> one (that probably should be warned about [...])

I've now added a warning for this too.

Cheers,
    Olly
Kevin SoftDev | 1 Mar 2006 08:08
Picon

Re: Working Demo for WWW Search Engine

Olly,
 
Thanks for your help. I was able to deploy simple search engine http://nitra.net in Czech and Slovak language. I still need to figure out how can I get multiple terms search and paging. I know the stemmer is in English, but do you think they will notice? :-)  ... it runs pretty fast.
 
Thanks,
Kevin
 

 
On 2/28/06, Olly Betts <olly <at> survex.com> wrote:
On Wed, Mar 01, 2006 at 12:06:12AM +0000, Olly Betts wrote:
> > > url   : unique=Q weight=2 field=url

> you probably want to add a boolean term as well as checking for
> one (that probably should be warned about [...])

I've now added a warning for this too.

Cheers,
   Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
Olly Betts | 1 Mar 2006 16:27
Favicon
Gravatar

Re: Working Demo for WWW Search Engine

On Tue, Feb 28, 2006 at 11:08:52PM -0800, Kevin SoftDev wrote:
> Thanks for your help. I was able to deploy simple search engine
> http://nitra.net in Czech and Slovak language. I still need to figure out
> how can I get multiple terms search and paging.

There seems to be a content-type bug - the results are served as
text/plain so I get the HTML source (in Firefox at least):

http://nitra.net/cgi-bin/hladaj.cgi?a=q&q=prague

> I know the stemmer is in English, but do you think they will notice?

It's actually probably better to not stem than to stem in a different
language, unless the languages are very similar morphologically.  I
doubt English and Czech or Slovak are similar enough for it to be
beneficial, and it may be harmful.

Czech and Slovak may be similar enough to each other for a Slovak
stemmer to be beneficial on Czech text and vice versa but snowball
doesn't include either at present.

Cheers,
    Olly
Kevin SoftDev | 1 Mar 2006 16:41
Picon

Re: Working Demo for WWW Search Engine

Olly,
 
It works except the city is spelled praha, prague is the english version
 
I fix the content-type becaused I forgot to print the header from Perl script. One bug is still there that it works only with one term based on the Perl demo script that came with Xapian. As soon as user type two terms nothing come up. I am not sure if this is bug of Perl API or is mine.
 
example with no results:
 
 
---- one term is called from the Perl script example like this:
    my $enq = $db->enquire( 'Praha' );  
 
--- two terms is called like this?
my $enq = $db->enquire( 'Praha Hrad' );
 
I do not get any result back, I am wondering if there is another API I suppose to use .
 
Thanks, Xapian was very easy to implement and it is the fastest from all search engine I build, that includes MySQL 5.0,  MS SQL 2005, Lucene and some other custom commercial.
 
 
 
#!/usr/bin/perl
#-----------------------------------------#
use Search::Xapian;

  my $db = Search::Xapian::Database->new( '/path/to/database' );

  my $enq = $db->enquire( 'Praha' );

  printf "Running query '%s'\n", $enq->get_query()->get_description();

  my <at> matches = $enq->matches(0, 20);

  print scalar( <at> matches) . " results found\n";

  foreach my $match ( <at> matches )
  {
      my $doc = $match->get_document();
      printf "ID %d %d%% [ %s ]\n", $match->get_docid(), $match->get_percent(), $doc->get_data();
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 


 
On 3/1/06, Olly Betts <olly <at> survex.com> wrote:
On Tue, Feb 28, 2006 at 11:08:52PM -0800, Kevin SoftDev wrote:
> Thanks for your help. I was able to deploy simple search engine
> http://nitra.net in Czech and Slovak language. I still need to figure out
> how can I get multiple terms search and paging.

There seems to be a content-type bug - the results are served as
text/plain so I get the HTML source (in Firefox at least):

http://nitra.net/cgi-bin/hladaj.cgi?a=q&q=prague

> I know the stemmer is in English, but do you think they will notice?

It's actually probably better to not stem than to stem in a different
language, unless the languages are very similar morphologically.  I
doubt English and Czech or Slovak are similar enough for it to be
beneficial, and it may be harmful.

Czech and Slovak may be similar enough to each other for a Slovak
stemmer to be beneficial on Czech text and vice versa but snowball
doesn't include either at present.

Cheers,
   Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
Olly Betts | 1 Mar 2006 17:47
Favicon
Gravatar

Re: Working Demo for WWW Search Engine

On Wed, Mar 01, 2006 at 07:41:40AM -0800, Kevin SoftDev wrote:
> It works except the city is spelled praha, prague is the english version
> http://nitra.net/cgi-bin/hladaj.cgi?a=q&q=praha

Yeah, I'm aware that's the anglicised spelling - it was just the first
thing that came into my head to search for.  The bug I was pointing out
was the content-type, not the lack of results.

> One bug is still there that it works only with one term based on the
> Perl demo script that came with Xapian. As soon as user type two terms
> nothing come up. I am not sure if this is bug of Perl API or is mine.
> [...]
> --- two terms is called like this?
> my $enq = $db->enquire( 'Praha Hrad' );

No, that produces a one term query with a space in.  Try this:

my $qp = Search::Xapian::QueryParser->new();
# Set any options you want on $qp...
my $enq = $db->enquire($qp->parse_query('Praha Hrad'));

Cheers,
    Olly
Kevin SoftDev | 1 Mar 2006 18:59
Picon

Re: Working Demo for WWW Search Engine

Olly,
 
Thank for the suggestion to parse the query to multiple terms. Previously I implemented this same search engine using FullText Index in MySQL 5.0 having almost 1 million records (web pages) size of the table was approaching 3GB.
 
Running on the Suse 10.0 Pentium 2.8 GHz with 2 GB memory the search started to slow down using MySQL 5.0 where some results were coming after 10-15 seconds and the CPU usage was approaching 99% and memory usage 25%.
 
With Xapian I see CPU usage between 3-4% per search and memory usage only 0.3%.
 
Check the Xapian performance for your self. :-)
 
Thanks.
Kevin Duraj
 
 


 
On 3/1/06, Olly Betts <olly <at> survex.com> wrote:
On Wed, Mar 01, 2006 at 07:41:40AM -0800, Kevin SoftDev wrote:
> It works except the city is spelled praha, prague is the english version
> http://nitra.net/cgi-bin/hladaj.cgi?a=q&q=praha

Yeah, I'm aware that's the anglicised spelling - it was just the first
thing that came into my head to search for.  The bug I was pointing out
was the content-type, not the lack of results.

> One bug is still there that it works only with one term based on the
> Perl demo script that came with Xapian. As soon as user type two terms
> nothing come up. I am not sure if this is bug of Perl API or is mine.
> [...]
> --- two terms is called like this?
> my $enq = $db->enquire( 'Praha Hrad' );

No, that produces a one term query with a space in.  Try this:

my $qp = Search::Xapian::QueryParser->new();
# Set any options you want on $qp...
my $enq = $db->enquire($qp->parse_query('Praha Hrad'));

Cheers,
   Olly

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
jarrod roberson | 1 Mar 2006 19:58
Picon

Need a suggestion on implementation.

I am still working on my filesystem indexer, thanks for all the previous help, Xapian is awesome!

On to my delimna, I want to index arbitarly long logical paths. And I have run up on the ~240 character term limit way more than once so far.
So I am trying to decide the best way to index path information.

My ideas are as follows:

/usr/jarrod/very/long/path/to/a/file.txt

use prefixes like P000:usr, P001:jarrod, P002:very P003:long . . . you get the idea

the other idea is to use positional information using add_posting( usr, 0 ), add_posting( jarrod, 1 ), add_posting( very, 2 ), add_posting( long, 3 )

which way or which other way would you guys suggestion starting out with?

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss
Olly Betts | 1 Mar 2006 20:54
Favicon
Gravatar

Re: Need a suggestion on implementation.

On Wed, Mar 01, 2006 at 01:58:50PM -0500, jarrod roberson wrote:
> On to my delimna, I want to index arbitarly long logical paths. And I have
> run up on the ~240 character term limit way more than once so far.
> So I am trying to decide the best way to index path information.
> 
> My ideas are as follows:
> 
> /usr/jarrod/very/long/path/to/a/file.txt
> 
> use prefixes like P000:usr, P001:jarrod, P002:very P003:long . . . you get
> the idea

There's no need for each term to correspond to a directory level - you
could make them a fixed number of characters long, which would reduce
the number needed, which should make finding a particular existing entry
more efficient - if you make the length 240 characters then many files
will only need a single term.  Also, this'll work even if you have a
directory name which is 300 characters long...

> the other idea is to use positional information using add_posting( usr, 0 ),
> add_posting( jarrod, 1 ), add_posting( very, 2 ), add_posting( long, 3 )

That'll be less efficient that encoding the position into the term.

You could hash the overlong part of the path like omindex does, but
that carries a small chance that two paths may collide and you'll only
index one, which you may find unacceptable.

Or you could use an external database of some sort to track the pathname
-> xapian docid mapping.

Cheers,
    Olly
Sam Liddicott | 2 Mar 2006 05:05

Re: Working Demo for WWW Search Engine

Kevin SoftDev wrote:
Olly,
 
Thank for the suggestion to parse the query to multiple terms. Previously I implemented this same search engine using FullText Index in MySQL 5.0 having almost 1 million records (web pages) size of the table was approaching 3GB.
 
Running on the Suse 10.0 Pentium 2.8 GHz with 2 GB memory the search started to slow down using MySQL 5.0 where some results were coming after 10-15 seconds and the CPU usage was approaching 99% and memory usage 25%.
 
Using xapian is the right thing, but with only 25% memory used I wonder if you needed to allocate more buffer space or more index space to improve search times.

Sam
With Xapian I see CPU usage between 3-4% per search and memory usage only 0.3%.
 
Check the Xapian performance for your self. :-)
 
Thanks.
Kevin Duraj
 
 


 
On 3/1/06, Olly Betts <olly <at> survex.com> wrote:
On Wed, Mar 01, 2006 at 07:41:40AM -0800, Kevin SoftDev wrote:
> It works except the city is spelled praha, prague is the english version
> http://nitra.net/cgi-bin/hladaj.cgi?a=q&q=praha

Yeah, I'm aware that's the anglicised spelling - it was just the first
thing that came into my head to search for.  The bug I was pointing out
was the content-type, not the lack of results.

> One bug is still there that it works only with one term based on the
> Perl demo script that came with Xapian. As soon as user type two terms
> nothing come up. I am not sure if this is bug of Perl API or is mine.
> [...]
> --- two terms is called like this?
> my $enq = $db->enquire( 'Praha Hrad' );

No, that produces a one term query with a space in.  Try this:

my $qp = Search::Xapian::QueryParser->new();
# Set any options you want on $qp...
my $enq = $db->enquire($qp->parse_query('Praha Hrad'));

Cheers,
   Olly

_______________________________________________ Xapian-discuss mailing list Xapian-discuss <at> lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss

Gmane