Olly Betts | 2 Mar 03:01 2009

Re: Omega Script

On Sun, Mar 01, 2009 at 02:32:37AM +1030, Frank J Bruzzaniti wrote:
> Thomas Viehmann wrote:
> > Frank J Bruzzaniti wrote:
> >   
> >> When I return the names of files from a search result I use $field{url} 
> >> which returns \\server\docs\foo.doc
> >   
> >> What should I do if I only want foo.doc to be displayed?
> >>     
> > If you want to do it with the current set of commands, using $transform
> > might be a good approach. Of course, that might be more difficult if you
> > have to cater for lots of different path conventions.

$transform requires trunk currently though, but yes you could do this
with:

$transform{.*[/\\],,$field{url}}

> >> And how hard would it be to create new script index commands like 
> >> $filename and $path.  If extending omega script is easy/"a good idea" 
> >> I'd be happy
> >> to write some documentation once I learn how.

You ask about new "script index commands" (scriptindex is used at
indexing time), but then talk about extending omega script, and use
examples starting with a "$" as OmegaScript does (OmegaScript is used at
search time).

So I'm not totally sure which you're suggesting...

(Continue reading)

Olly Betts | 2 Mar 04:09 2009

Re: Swish3

On Fri, Feb 27, 2009 at 10:14:16PM -0600, Peter Karman wrote:
> Olly Betts wrote on 2/27/09 12:00 AM:
> > On Thu, Feb 26, 2009 at 08:25:43AM -0600, Peter Karman wrote:
> >> Swish3 is similar to Omega, but as I have outlined[3] on the
> >> Swish-e list, it offers some different features, notably robust XML
> >> parsing using libxml2.
> > 
> > Are you implying that Omega's XML parser is not robust?
> > 
> > I don't recall seeing any bug reports about this...
> 
> on the contrary. I did not intend to disparage Omega. I actually based the
> swish_xapian example program on omindex.[0]
> 
> What I was trying to say instead was that Swish3 is oriented toward XML
> explicitly, uses libxml2, and is DOM-aware. You seemed to suggest omindex's
> XmlParser "just strips all the tags"[1].

Ah, I see the confusion.  There's a (perhaps poorly named) HtmlParser
subclass in the omindex source code call XmlParser which does
essentially just strip the tags (since that's what's needed to extract
text from OpenDocument and some other XML formats), but there other
subclasses for parsing other XML formats.

The rather odd hierarchy here (XML being derived from HTML is backwards)
is really just a historical relic.

> Take this example. omindex must be told to recognize the .xml file extension:
> 
> [karpet <at> pekmac:~/projects/search_bench]$ omindex --db omega.db test
(Continue reading)

Olly Betts | 2 Mar 06:27 2009

Re: Tag-based filesystem with xapian, advice?

On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote:
> Now my 1th question is, what is the path of a file? The content of a  
> document? A term? A value? I need to be able to use the path when  
> searching as I need to be able to limit the file-results to files in a  
> certain directory. Thus for example, only files that have a path of / 
> photo's/2008/*. Or do I have to work with a relevance-set or something?

I would put the path in the document data for reading when you get
results, and also index all the directories which the file is in as
terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008).

> I tried using the path as a tag itself, but when I do a query for "/ 
> photo's/2008/*", it is automatically translated to 2 separated terms I  
> think? (a file tagged as 2008 also showed up for example)

I don't think you want to use QueryParser here - just build your Query
objects up by hand.

If you want to allow "free text queries" after +FIND, then you can parse
that part with QueryParser and then filter the result using the
appropriate "P"-prefixed term, e.g. in C++:

    Xapian::Query q = qp.parse_query(query_string);
    q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's"));

> My 2th question is, what is the easiest way to get a list of all the  
> tags associated with files in the resultset? I want to have a list of  
> all tags associated with files in /photo's/2008. One method would be  
> to do a search for all files in /photo's/2008, or any subdirectory,  
> loop all the results, and per document, loop the terms associated with  
(Continue reading)

Olly Betts | 2 Mar 08:27 2009

Re: Emulating SQL JOIN in xapian

On Wed, Feb 25, 2009 at 09:26:49PM +0000, James Aylett wrote:
> On Wed, Feb 25, 2009 at 09:20:07PM +0000, Simon Roe wrote:
> 
> > 1) Simply duplicate the recipient information on every payment record.
> > This is possible as there aren't likely to be more than 20 payments
> > per recipient and I have the disk space.  I can see this speeding
> > everything up too.
> 
> This will work, and I believe is the most obvious way of attacking the
> problem in Xapian, if you must do so.

I wonder if there's a way to achieve this with PostingSource and
MatchSpy.

While I don't think Xapian should aspire to be an RDBMS, if we can
support this sort of thing mostly or entirely using existing features,
that would probably be worthwhile.

Cheers,
    Olly
Picon

Resolv the matchcondition

is it possible to find out, which condition was the "hit"?

when i search with 3 keywords, OP_OR linked, can i find out which one 
was the "hit"?

And we need the ability to find similar and fuzzy-matchted keywords. 
What should be the best way for that?

Thanks!
Felix
James Aylett | 2 Mar 11:07 2009

Re: Resolv the matchcondition

On Mon, Mar 02, 2009 at 10:31:43AM +0100, Felix Antonius Wilhelm Ostmann wrote:

> is it possible to find out, which condition was the "hit"?

Check out Xapian::Enquire::get_matching_terms_begin() and friends.

> And we need the ability to find similar and fuzzy-matchted keywords. 
> What should be the best way for that?

I'm not entirely sure what you're asking. If you want spelling
correction, check out <http://xapian.org/docs/spelling.html>. If you
want Xapian to suggest terms that might be related to your existing
query, you want to generate an ESet using Xapian::Enquire::get_eset();
this works by taking a list of documents considered relevant (the
RSet), and using this context to generate an ordered list of possible
terms that can be added back in to the query (the ESet).

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
James Aylett | 2 Mar 11:17 2009

Re: Emulating SQL JOIN in xapian

On Mon, Mar 02, 2009 at 07:27:25AM +0000, Olly Betts wrote:

> > > 1) Simply duplicate the recipient information on every payment record.
> > > This is possible as there aren't likely to be more than 20 payments
> > > per recipient and I have the disk space.  I can see this speeding
> > > everything up too.
> > 
> > This will work, and I believe is the most obvious way of attacking the
> > problem in Xapian, if you must do so.
> 
> I wonder if there's a way to achieve this with PostingSource and
> MatchSpy.

I've never used or even looked particularly hard at PostingSource, but
I'm guessing half of it (feeding the relevant recipient record for
every payment record) could be done providing there's a way of
determining the Document id of the recipient for each payment
Document.

Does that allow that pair of documents to contribute weight to the
same output document? I can't tell from the posting source docs. A
match decider could then get rid of recipient records as needed?

Or I may be misunderstanding what you're suggesting entirely.

J

--

-- 
  James Aylett

(Continue reading)

Bruno at Badiliz | 2 Mar 18:34 2009
Picon

Xapian, PHP bindings and

Hello, 

With PHP, I try to get spelling corrections but after 2 days of trying I can
make it work !!!

Here is my < simple > code :

<?php

require_once( '../global.info.php' );

require_once( 'xapian.class.php' );

require_once( 'xapianencode.func.php' );

// Open the database for searching.

try {

                $database = new XapianDatabase(
$_PATH.'/_inc/cache/xapian/pa' );

                // Start an enquire session.

                $enquire = new XapianEnquire($database);

                // Combine the rest of the command line arguments with
spaces between

                // them, so that simple queries don't have to be quoted at
(Continue reading)

Bruno at Badiliz | 2 Mar 18:36 2009
Picon

Xapian, PHP bindings and spellings correction

Hello, 

With PHP, I try to get spelling corrections but after 2 days of trying I can
make it work !!!

Here is my < simple > code :

<?php

require_once( '../global.info.php' );

require_once( 'xapian.class.php' );

require_once( 'xapianencode.func.php' );

// Open the database for searching.

try {

                $database = new XapianDatabase(
$_PATH.'/_inc/cache/xapian/pa' );

                // Start an enquire session.

                $enquire = new XapianEnquire($database);

                // Combine the rest of the command line arguments with
spaces between

                // them, so that simple queries don't have to be quoted at
(Continue reading)

James Aylett | 2 Mar 18:44 2009

Re: Xapian, PHP bindings and spellings correction

On Mon, Mar 02, 2009 at 06:36:01PM +0100, Bruno at Badiliz wrote:

>                 $query = $qp->parse_query( $query_string ); //,
> XapianQueryParser::FLAG_SPELLING_CORRECTION);

Uncomment this, and then check $query->get_corrected_query_string().

>                 $i = $database->spellings_begin();
>                 while( ! $i->equals( $database->spellings_end())) {
>                                var_dump( $i );
>                 }

This doesn't do what you expect it to do, and is probably not helpful
to you at all. See
<http://xapian.org/docs/apidoc/html/classXapian_1_1Database.html#2f78aa0fe39a3b68e10ca865dc0f86ef>.

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org

Gmane