Simon de la Court | 1 Oct 14:51
Picon

Processing the Xapian results

Dear Xapian list members,

My first post ever on this list, so please don't go hard on me ;)

I have recently started to work with Xapian, I run a forum for an  
opensource game project, and we have trouble with searching in our  
topics. MySQLs fulltext does not give the right results, very easy to  
setup though, and Google did not index older topics. So I read the  
xapian docs and it looked pretty cool to me, a lot more precise than  
fulltext and can index everything I want.

But I've got a problem, indexing is easy (well, not entirely, instead  
of indexing single messages I have to index whole topics). Not  
everybody is allowed to see every topic. And topics are in different  
subfora. And I have not yet found a way to get Xapian showing me only  
the topics I want, so only the topics for one subforum (if asked) and  
only the topics my visitor is allowed to see. Xapian just gives me  
the results of all the results found.

The other option for me is to process these results in PHP, but that  
would take quite some processing power, and would require caching the  
results, because I want these to be spread over several pages.

Has anyone here done a similar implementation of Xapian? I hope you  
can help me out.

Thanks in advance and greetings,

Simon de la Court
(Continue reading)

Favicon

Re: Processing the Xapian results

Afaik you pre-process your form-input data using php?

In that case you can construct a list of all forums a user is allowed to 
see. That list can be easily combined with the forums the user wants to 
include in the search.

On the indexing-site you can add a boolean-term indicating the topic's 
forumid.

And in the search-code you can than add a boolean-OR-construction for 
that list of forums a user may see. (in SQL-terms you'd add something 
like: AND topic.forumId IN (f1, f2, f3) which you'll have to expand to:
AND (topic.forumId = f1 OR topic.forumId = f2 OR topic.forumId = f3)).

Best regards,

Arjen

Simon de la Court wrote:
> Dear Xapian list members,
> 
> My first post ever on this list, so please don't go hard on me ;)
> 
> I have recently started to work with Xapian, I run a forum for an 
> opensource game project, and we have trouble with searching in our 
> topics. MySQLs fulltext does not give the right results, very easy to 
> setup though, and Google did not index older topics. So I read the 
> xapian docs and it looked pretty cool to me, a lot more precise than 
> fulltext and can index everything I want.
> 
(Continue reading)

Charlie Hull | 1 Oct 17:05

Re: Xapian 1.0.3 released

Olly Betts wrote:
> I've uploaded Xapian 1.0.3, which as usual you can download from:
> 
> http://www.xapian.org/download.php
> 
I've now made build files for MSVC, available from the usual place:
http://www.lemurconsulting.com/Products/Xapian/Overview.shtml

Cheers

Charlie
Kevin Duraj | 1 Oct 23:01
Picon

How to beat Google aka Xapian & Natural Language Processing.

Xapians!

If tomorrow Xapian search engine would achieved the same performance
and result in searches as Google we would not be able to beat Google,
because we would create only a copy of the searches that already
exists from Google search engine. However there is a way to beat
anyone, and there is a way to beat Google successfully as well just do
not give up. Some see it as implementing Ajax, or some cool interface,
marketing or some other nonsense. However  as I see it, the one way
how to beat Google, is to implement Natural Language Processing to
enable user to ask a question in natural human sentence and received
different results, based on the way the question in natural human
sentence was asked.

What is interesting is that the simplest thing to do for human is the
most difficult to do for computer, to recognize the meaning of a
sentence. You have easier time to recognize my misspellings and mal
form sentences than computer could recognize the meaning of a perfect
sentence. Natural Language Processing is not a new thing and there has
been lot of work done that yield inconsistent results.

What I am trying to point out is that we need start to think about
using natural language processing when placing infrastructure for
Xapian.  So far we have the following OP_AND, OP_AND_MAYBE,
OP_AND_NOT, OP_ELITE_SET, OP_FILTER, OP_NEAR, OP_OR, OP_PHRASE,
OP_VALUE_RANGE, OP_XOR search operators and we could add one more
OP_NLP.

What we can do now is to implement OP_NLP  to tagged nouns,
adjectives, adverbs, punctuations, foreign words etc. Calculate
(Continue reading)

James Aylett | 2 Oct 13:11

Re: How to beat Google aka Xapian & Natural Language Processing.

On Mon, Oct 01, 2007 at 02:01:50PM -0700, Kevin Duraj wrote:

> Search query example: What is Kevin Duraj doing?
> OP_NLP  would analyze sentence as follow:
> [what =  pronoun, question|is =
> werb|kevin=noun|duraj=noun|doing=verb|?=punctuation]

'What' isn't a pronoun, but never mind. You're suggesting a fairly
primitive level of NLP - is there any evidence to suggest this will
give good results? For instance, there's no way you could use that
strategy to deal with referrents. Also, how are you planning on coping
with ambiguity in part-of-speech? ('dove' is both a noun and a verb.)

Couldn't you do this separately to Xapian by judicious fiddling with
the generated query? Get the raw unstemmed terms (aka 'words' in this
context) and figure out how you want to treat them, and construct a
new query which reflects the weighting you want to apply. (Bear in
mind that BM25 takes into account with within-query-frequency of a
term as well as the within-document-frequency, and the defaults
include this.)

I have no idea how this applies to other languages. (Well, I do, but
only for Latin, Romance languages and to an extent Germanics. That's
not all that useful on the web.)

> PS: Can you see the future?

I think it's orange ;-)

J
(Continue reading)

Ron Kass | 2 Oct 13:40

Re: How to beat Google aka Xapian & Natural Language Processing.


Two companies to keep in mind when considering NLP search are, PowerSet and
OpinMind.
Actually, allowing effective NLP based search requires (from what I know)
both a little more complex query parsing (understanding the context and
semantic relations between parts of the document) and a much more complex
indexing (storing relations between terms).

For example:
Searching for: who shot dick chaney?
and 
Searching for: who dick chaney shot?

Both contain the same words.. so a simple parser as you suggested would
result in the same compiled query. It shouldn't.. these are two different
questions.

Also.. in the documents, if a document contains "John smith was fired upon
by dick chaney", the relation between the individuals and actions are very
important. How else will you know if its a document describing Dick Chaney
shooting John smith or the other way around?

I do agree with you though that NLP is a very important aspect of
intelligent searching and are a way to "beat Google". I think Google is
aware of it too though ;)
--

-- 
View this message in context: http://www.nabble.com/How-to-beat-Google-aka-Xapian---Natural-Language-Processing.-tf4551151.html#a12990884
Sent from the Xapian - Discuss mailing list archive at Nabble.com.
Kenneth Loafman | 2 Oct 22:34
Favicon
Gravatar

Omega and Wildcards

I made the following changes to Omega, per the instructions in

http://www.xapian.org/docs/queryparser.html,

and wildcards do not work at all.  I'm hoping its something simple.

/usr/lib/cgi-bin/omega/omega DB=pdf P=finan\*
(with or without the backslash)

returns no documents, even though there are over 3000 hits for
'finance', about the same for 'financial', and probably more.

...Ken

ken <at> phobos:~/xapian/xapian-omega-1.0.2$ diff -aur query.cc*
--- query.cc    2007-10-02 14:33:35.000000000 -0500
+++ query.cc~   2007-07-04 19:38:41.000000000 -0500
@@ -218,6 +218,7 @@
     qp.set_stemming_strategy(option["stem_all"] == "true" ?
Xapian::QueryParser::STEM_ALL : Xapian::QueryParser::STEM_SOME);
     qp.set_stopper(new MyStopper());
     qp.set_default_op(default_op);
+    qp.set_database(db);
     // std::map::insert() won't overwrite an existing entry, so we'll
prefer
     // probabilistic prefixes to boolean ones in the unlikely event
that both
     // exist for the same term prefix.  We'll also prefer the first
specified
@@ -236,17 +237,12 @@
(Continue reading)

James Aylett | 2 Oct 23:07

Re: Omega and Wildcards

On Tue, Oct 02, 2007 at 03:34:07PM -0500, Kenneth Loafman wrote:

> I made the following changes to Omega, per the instructions in
> 
> http://www.xapian.org/docs/queryparser.html,
> 
> and wildcards do not work at all.  I'm hoping its something simple.
> 
> /usr/lib/cgi-bin/omega/omega DB=pdf P=finan\*
> (with or without the backslash)
> 
> returns no documents, even though there are over 3000 hits for
> 'finance', about the same for 'financial', and probably more.

You may be being tripped up by your shell escaping there. Try
P='finan*' instead of what you have.

Cheers,
James

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Richard Boulton | 3 Oct 10:48

Re: Omega and Wildcards

Kenneth Loafman wrote:
> I made the following changes to Omega, per the instructions in
> 
> http://www.xapian.org/docs/queryparser.html,
> 
> and wildcards do not work at all.  I'm hoping its something simple.
> 
> /usr/lib/cgi-bin/omega/omega DB=pdf P=finan\*
> (with or without the backslash)
> 
> returns no documents, even though there are over 3000 hits for
> 'finance', about the same for 'financial', and probably more.

(I'm assuming your diff is reversed, so that "-" lines are those which 
you added.  If that's wrong, I'm very confused.)

It looks like you've changed the flags for the query parser 
"parse_query" invocation correctly.  However, you've moved the 
"qp.set_database(db)" call to after this invocation.  As a result, the 
wildcard expansion code can't lookup the terms in the database, so you 
end up with no matches.

I expect it will work if you move the set_database() call back to its 
original location.

--

-- 
Richard
James Aylett | 3 Oct 11:44

Re: Omega and Wildcards

[Please keep discussions on list.]

On Tue, Oct 02, 2007 at 04:38:58PM -0500, Kenneth Loafman wrote:

> > You may be being tripped up by your shell escaping there. Try
> > P='finan*' instead of what you have.
> 
> That's what the escape char is for (bash).  Quotes do not help either,
> however, 'finan*' finds 295, where 'finan\*' finds 0.  Does not work
> using the web query either.  Strange results.

I didn't know which shell you were using; in general I prefer single
quotes for avoiding shell globbing (but I can't explain why).

Richard has pointed out where your problem is, which I missed.

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org

Gmane