Picon

Re: Using synonyms and order of results

I found a simple solution by using OP_AND_MAYBE and OP_SCALE_WEIGHT!

The new query:
[QUERY: Xapian::Query((0.5 * ((Zstempel:(pos=1) SYNONYM
Zamtszeich:(pos=1) SYNONYM Zgrubenholz:(pos=1) SYNONYM
Zkennzeich:(pos=1) SYNONYM Zpoststempel:(pos=1) SYNONYM
Zpragestempel:(pos=1) SYNONYM Zpunz:(pos=1) SYNONYM Zsiegel:(pos=1))
FILTER QMde) AND_MAYBE Zstempel:(pos=1)))]

It also works with multiple terms.

Again, xapian is simple and fast!

Am 30.01.2012 10:28, schrieb Websuche :: Felix Antonius Wilhelm Ostmann:
> We are using FLAG_AUTO_SYNONYMS and it works like a charm (+stemmer),
> but we currently have a problem with the order of the results. We think,
> the best result will be a result without a synoym.
> 
> We search for stempel (german for chop) and after FLAG_AUTO_SYNONYMS
> (+STEM_SOME as stemming strategy) we get the following query:
> 
> [QUERY: Xapian::Query(((Zstempel:(pos=1) SYNONYM Zamtszeich:(pos=1)
> SYNONYM Zgrubenholz:(pos=1) SYNONYM Zkennzeich:(pos=1) SYNONYM
> Zpoststempel:(pos=1) SYNONYM Zpragestempel:(pos=1) SYNONYM Zpunz:(pos=1)
> SYNONYM Zsiegel:(pos=1)) FILTER QMde))]
> 
> [POS: 0] [PERCENT: 100%] [WEIGHT:10.153175] [ID: 64977] [PID: 8876897]
> ...
> [POS: 38] [PERCENT: 100%] [WEIGHT:8.763471] [ID: 125701] [PID: 9023761]
> ...
(Continue reading)

Liam | 10 Feb 00:50
Favicon

Re: Mime2Text library, derived from omindex

On Tue, Nov 22, 2011 at 10:26 PM, Liam <xapian <at> networkimprov.net> wrote:

>
> load_file() in omega/loadfile.cc (part of the pending Mime2Text lib) calls
>
>   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
>
> once, before closing the fd. In order to minimally impact the filesystem
> cache, I suspect it should call that after each read()?
>
> Also, the read buffer is only 4KB. It might be considerably more efficient
> if sized to the filesystem block size?
>

I believe doing a posix_fadvise() per-read is wise, as 100MB PDFs are not
uncommon, and would pollute the filesystem cache. If, given the benchmarks
below, you'd agree, I'll commit my edits to loadfile.cc and test program to
my github branch.

Here are benchmarks from a test program that walks a tree calling
   load_file(pathname, output_string, NOCACHE | NOATIME)
test machine is a Core 2 Duo with low-end disk, Linux kernel
2.6.32-32-generic
Note: the pattern of alternating slower/faster runs repeats over many tries

Current loadfile.cc, with 4K buffer
  buffers of 8K 16K 32K 64K showed only a 1-2s speedup

$ time ./loadfile-test ~
total bytes read: 627344268
(Continue reading)

Liam | 13 Feb 09:33
Favicon

Re: Mime2Text library, derived from omindex

On Sun, Jan 15, 2012 at 8:34 PM, Olly Betts <olly <at> survex.com> wrote:

>
> > The existing code expects a filename, so I feel we should stick with that
> > for the first version of this, although I agree it should take a stream
> > type eventually.
>
> For code only used in omindex, that's fine - we control the caller(s)
> and can just change how things work if needed.
>
> But if we're going to split this off into a library, we're committing to
> support the API we provide for a significant length of time, and to
> providing sane upgrade paths for any changes which later get made.
>

OK, true, but we shouldn't define the stream-oriented API until we know
exactly what a future omindex needs in that area. So I'd suggest starting
with an omindex-internal library with pathname API. When we switch to
streams, we can move the library to its own package.

>
> > Why is "command" in Mime2Text::Fields?  It doesn't seem to be a field.
> >
> > It's for informational purposes, what external command produced these
> > results, if any.
>
> It isn't a field though.
>

Shall I make it a second output argument?
(Continue reading)

Olly Betts | 16 Feb 06:38
Favicon
Gravatar

GSoC 2012

Google have announced their "Summer of Code" for this year - for
background info see:

http://code.google.com/soc/

We took part last year with great success, and after a brief discussion
with those who mentored last year, we concluded it was worthwhile
applying to take part again.

I'm happy to act as admin again and submit the application.

I've updated of the list of project ideas for students on the wiki from
last year, removing those done tackled last year, and updating those
where work has been done outside GSoC:

http://trac.xapian.org/wiki/GSoCProjectIdeas

If you're interested in acting as a mentor for one of the ideas there,
or have an idea for a project with a scope suitable for a student to
complete in about 12 weeks, please update that page.  Ideas without a
potential mentor aren't very useful though, so being willing to mentor
your new idea is helpful.

Ideas don't have to be for work on Xapian itself - projects related to
Xapian in other software are within scope.  A wider range of project
ideas will give us a broader appeal to students.

Mentoring organisation applications open on 27th Feb, close on March 9th,
and are reviewed until 15th with accepted orgs announced on 16th, so
getting the ideas list into excellent shape before March 9th is the
(Continue reading)

Liam | 16 Feb 07:37
Favicon

Re: GSoC 2012

On Wed, Feb 15, 2012 at 9:38 PM, Olly Betts <olly <at> survex.com> wrote:

> Google have announced their "Summer of Code" for this year - for
> background info see:
>
> http://code.google.com/soc/
>
> We took part last year with great success, and after a brief discussion
> with those who mentored last year, we concluded it was worthwhile
> applying to take part again.
>
> I'm happy to act as admin again and submit the application.
>
> I've updated of the list of project ideas for students on the wiki from
> last year, removing those done tackled last year, and updating those
> where work has been done outside GSoC:
>
> http://trac.xapian.org/wiki/GSoCProjectIdeas
>

Re: Text-Extraction Libraries, starting a new process isn't expensive (on
the order of 40usec for Linux, I believe), and prevents crashing the main
program. So the benefit of libraries vs apps would be saving any
extractor-specific initialization time, which I'd guess would be pretty
low. If init time is a factor for some extractors, one could rev those
programs (if source available) to accept a sequence of filenames via stdin
or other input stream.

Wouldn't handling archive files (tar, zip) would be the more pressing need
in this area?
(Continue reading)

Justin Finkelstein | 16 Feb 11:29
Favicon

Re: GSoC 2012

There're two things I'd like to suggest that aren't on the list:

    1. An queueing system, to eliminate or work around the one-writer at
a time issue
    2. A web service front-end, handling queries via GET, CRUD
operations via POST containing XML

The idea being to bring Xapian a bit more in-line with some of the other
search appliances and to make adoption easier.
I'm not sure how these would fit into the Xapian ethos, but it's
something I'd like to see developed.

On Thu, 2012-02-16 at 05:38 +0000, Olly Betts wrote:

> Google have announced their "Summer of Code" for this year - for
> background info see:
> 
> http://code.google.com/soc/
> 
> We took part last year with great success, and after a brief discussion
> with those who mentored last year, we concluded it was worthwhile
> applying to take part again.
> 
> I'm happy to act as admin again and submit the application.
> 
> I've updated of the list of project ideas for students on the wiki from
> last year, removing those done tackled last year, and updating those
> where work has been done outside GSoC:
> 
> http://trac.xapian.org/wiki/GSoCProjectIdeas
(Continue reading)

Charlie Hull | 16 Feb 11:34

Re: GSoC 2012

On 16/02/2012 10:29, Justin Finkelstein wrote:
> There're two things I'd like to suggest that aren't on the list:
>
>      1. An queueing system, to eliminate or work around the one-writer at
> a time issue

yes, a good plan

>      2. A web service front-end, handling queries via GET, CRUD
> operations via POST containing XML

We did this a while ago although we didn't take it very far:
http://code.google.com/p/flaxcode/source/browse/#svn%2Ftrunk%2Fflax_search_service

and I know Richard has also been working on this kind of thing subsequently.

Cheers

Charlie

>
> The idea being to bring Xapian a bit more in-line with some of the other
> search appliances and to make adoption easier.
> I'm not sure how these would fit into the Xapian ethos, but it's
> something I'd like to see developed.
>
> On Thu, 2012-02-16 at 05:38 +0000, Olly Betts wrote:
>
>> Google have announced their "Summer of Code" for this year - for
>> background info see:
(Continue reading)

Andrew Betts | 17 Feb 13:54
Picon
Favicon

DatabaseModifiedError on get_data - best practice?

Hi,

I have previously had a problem with getting this error on a get_mset
call, and solved it by subclassing XapianEnquire with a
backoff-and-retry algorithm (as suggested by this list, many thanks!).
 However, I now get it intermittently when calling get_data on a
XapianDocument.  The same solution doesn't seem to be quite as easy in
this case, because:

1. The document is not instantiated by my code, it's returned from the
Iterator, so I can't easily subclass it without editing the bindings.

2. The document doesn't have a reference to the database, so I can't
reopen it from that scope.

So, first is it necessary to reopen the database in these situations,
or could I simply call get_data on the same document object after a
brief delay?  And second, how/where would you suggest I insert the
retry procedure?  Currently I can only see a few options, none of
which seem very good, and the first two don't solve the reopen
problem):

A) Subclass XapianDocument, and in order to make the bindings use the
subclass, also subclass the iterator, matchset and enquire.
B) Hack the bindings to insert the retry into the existing
XapianDocument::get_data method.
C) Add retry at the application level (need to add to several dozen projects!)

Any ideas much appreciated.

(Continue reading)

Olly Betts | 19 Feb 21:42
Favicon
Gravatar

Re: GSoC 2012

On Wed, Feb 15, 2012 at 10:37:16PM -0800, Liam wrote:
> Re: Text-Extraction Libraries, starting a new process isn't expensive (on
> the order of 40usec for Linux, I believe), and prevents crashing the main
> program. So the benefit of libraries vs apps would be saving any
> extractor-specific initialization time, which I'd guess would be pretty
> low. If init time is a factor for some extractors, one could rev those
> programs (if source available) to accept a sequence of filenames via stdin
> or other input stream.

If you look at the prototype patch, you'll see this is pretty much what
it already does.

There's a small helper program which links to libwv2 and takes a
filename on stdin and sends back the text for the title, body, etc
(which is better than we can achieve with an external extractor unless
we run a separate command for the metadata, or can get it to output HTML
which we then have to parse).

The helper program is a separate process, so we don't crash omindex if
the extractor crashes, and the helper is restarted automatically if we
come to reuse it and find it isn't running.

> Wouldn't handling archive files (tar, zip) would be the more pressing need
> in this area?

I would say "more pressing" is a subjective assessment, but feel free
to add suitable project ideas to the list if you are (or have) someone
willing to mentor them.  Try to write the idea up so that it is easy
to understand for a student who isn't intimately familiar with the
area already, with some "resources" for further reading and a list of
(Continue reading)

Olly Betts | 19 Feb 22:04
Favicon
Gravatar

Re: GSoC 2012

On Thu, Feb 16, 2012 at 10:29:29AM +0000, Justin Finkelstein wrote:
> There're two things I'd like to suggest that aren't on the list:
> 
>     1. An queueing system, to eliminate or work around the one-writer at
> a time issue
>     2. A web service front-end, handling queries via GET, CRUD
> operations via POST containing XML

Isn't (2) essentially what Richard's restpose (http://restpose.org/)
aims to do, except it's JSON not XML (which seems to be the modern
trend)?

> The idea being to bring Xapian a bit more in-line with some of the other
> search appliances and to make adoption easier.
> I'm not sure how these would fit into the Xapian ethos, but it's
> something I'd like to see developed.

These seem like projects on top of Xapian to me, and that seems a
sensible separation (like how solr is a web services layer on top of
lucene).

I'm happy to include work on projects like that, but starting a new
project is potentially problematic.

If the student is engaged enough to stay involved in the longer term, it
would work OK, but if the student doesn't hang around much after GSoC
you have an orphaned project, which isn't really good for anyone
involved.

Also, most students will probably do better working within some sort of
(Continue reading)


Gmane