Olly Betts | 1 Sep 07:04
Favicon
Gravatar

Re: Order of NOT operand?

On Mon, Aug 31, 2009 at 10:01:58AM -0400, David Sauve wrote:
> I'm having a strange issue with NOT queries in my xapian backend for
> Django-Haystack.  The query string is generated through user input, and as
> such, the order is undetermined.

Hmm, "generated through"?  The query string should really *be* user
input.  It is almost inevitably a mistake to try to modify it before
passing it to Xapian.  If you want to apply other filtering, combine
queries, etc, then do that to the Xapian::Query object(s) produced.

> I wouldn't think that would matter, but
> the following two queries are generating different search results:
> 
> java AND NOT id:1 NOT id:2
> vs.
> NOT id:1 NOT id:2 AND java

What sort of prefix is "id"?

> Logically, I'd think this would be the same, but in practice, it's not.  The
> first format seems to generate random results, but the second, generates the
> correct results.

They aren't quite the same in practice - the first is:

((java NOT id:1) NOT id:2)

And the second (with FLAG_PURE_NOT enabled) is:

((<everything> NOT id:1) NOT id:2) AND java
(Continue reading)

David Sauve | 1 Sep 14:22
Picon
Gravatar

Re: Order of NOT operand?

On Tue, Sep 1, 2009 at 1:04 AM, Olly Betts <olly <at> survex.com> wrote:

> On Mon, Aug 31, 2009 at 10:01:58AM -0400, David Sauve wrote:
> > I'm having a strange issue with NOT queries in my xapian backend for
> > Django-Haystack.  The query string is generated through user input, and
> as
> > such, the order is undetermined.
>
> Hmm, "generated through"?  The query string should really *be* user
> input.  It is almost inevitably a mistake to try to modify it before
> passing it to Xapian.  If you want to apply other filtering, combine
> queries, etc, then do that to the Xapian::Query object(s) produced.
>
> To be more specific, the query string is a combination of user input (what
the typed into the search box), and filters such as field equals, exclude,
etc.  These are all done by Django-Haystack itself in order to make the
backend (in this case Xapian) pluggable.

In practice, it is made up of two parts, a SearchBackend (the Xapian
interface), and a SearchQuery (the bit the "cleans" and assembles the query
string into a format that Xapian can recognise).

What I get, after Django-Haystack is done, in the SeachQuery, is a series of
filters for fields.  From this, I need to "re-assemble" a query string to be
passed to the SearchBackend instance at a later time.

> > I wouldn't think that would matter, but
> > the following two queries are generating different search results:
> >
> > java AND NOT id:1 NOT id:2
(Continue reading)

Richard Boulton | 1 Sep 15:31
Gravatar

Re: Order of NOT operand?

2009/9/1 David Sauve <dnsauve <at> gmail.com>

>  To be more specific, the query string is a combination of user input (what
> the typed into the search box), and filters such as field equals, exclude,
> etc.  These are all done by Django-Haystack itself in order to make the
> backend (in this case Xapian) pluggable.
>

It's probably about time I checked out a copy of the django-haystack xapian
backend.  This sounds very unpleasant.  As Olly said, it's almost certainly
a mistake to be constructing something to pass to the query parser, rather
than passing it user input directly.  To construct queries without running
into unexpected problems with quoting, operator precedence, etc, is almost
impossible, and is always going to be fragile with respect to changes in the
query parser.  This is because the query parser is not parsing a formal
grammar - it is trying to guess the user's intention to some extend, and is
thus likely to get confused when presented with input which isn't actually
user input, but is machine generated.

Is this an unavoidable result of the way the rest of Django-Haystack works?
Is there no way that haystack can be persuaded to give the backend the raw
input?  If not, it sounds like a bug in Django-Haystack's design, to me...

Looking at
http://github.com/notanumber/xapian-haystack/blob/d593924386cc050e3e97ce129ff71dad50e1139e/xapian_backend.py#L268however,
it looks like the search() method is presented with the user's
query string separately from the list of fields to filter on.  Maybe I'm
misinterpreting.  Also, could the "build_query" function at line 879 in that
file not return a structured representation of the query, rather than a
single string?  (If there's some reason imposed by haystack that forces it
(Continue reading)

win 32 | 1 Sep 17:58
Picon

German Danish Russian

Hello,I searched the mailing list but all the language problems seemed to
disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm
under Windows XP, CSharp and .NET framework 3.5
The TermGenerator.IndexText treats some characters as separators, for
example german 'ß' or danish 'æ' so it splits words with them into seperate
word parts while other letters are 'simplified' ö / ø -> o, ä -> a. Indexing
text in Russian result in a non-readable index list (retreived later with
iterating through Document.TermListBegin .. End)
I wrote my own indexer that doesn't split those words but saves them with
AddPosting method, still when they are read from the database there are '?'
(question signs) in places of 'ß' / 'æ' and o in place of ö / ø.
The other part of the problem is with the QueryParser which does the same
bad things to query terms. I searched xapian source code and found that it
requires UTF-8 encoded string as input, am I right ?

Query QueryParser::Internal::parse_query(const string &qs, unsigned
flags,  string &default_prefix) {
...
    Utf8Iterator it(qs), end;

So I made a small patch to the binding Query.cs / QueryParser.cs files to
allow me to override ParseQuery method so that I pass a utf-8 encoded string
to Xapian dll. With a debugger I trap process at dll
entrance _CSharp_QueryParser_ParseQuery__SWIG_1 and do make sure that passed
parameter is a UTF-8 encoded string. Still it doesn't help! Query term
iterator returns strange things instead of German/Danish/Russian characters
and search fails. Did someone manage the search to work in non-english
languages ?
Adam Sjøgren | 1 Sep 21:49
X-Face
Picon
Favicon
Gravatar

Re: German Danish Russian

On Tue, 1 Sep 2009 18:58:47 +0300, win wrote:

> Did someone manage the search to work in non-english languages ?

It works for me (Danish):

 * http://kammeratadam.dk/find/?q=Trov%C3%A6rdig

... but I'm using the Perl-bindings under Linux, so this datapoint might
not be that valuable in your context.

(Debian 5.0.2 (lenny), Xapian 1.0.7, Search::Xapian 1.0.7, Perl 5.10.0)

  Best regards,

    Adam

--

-- 
 "I think there are enough frivolous lawsuits in this         Adam Sjøgren
  country without people fighting over pop songs."       asjo <at> koldfront.dk
David Sauve | 2 Sep 00:08
Picon
Gravatar

Re: Order of NOT operand?

Scratch that Query output.  I was mixing query string values from a defect I
have open on Github with the actual test database I'm using for debugging.

When I construct a query using real field names and NOT operators I get the
expected query from the query parser, I think.  Here's an example:

self.sb.search('indexed NOT name:david1 NOT name:david2')
# Xapian::Query(((Zindex:(pos=1) AND_NOT ZXNAMEdavid1:(pos=2)) AND_NOT
(XNAMEdavid2:(pos=3) OR ZXNAMEdavid2:(pos=3))))
self.sb.search('NOT name:david1 NOT name:david2 indexed')
# Xapian::Query(((<alldocuments> AND_NOT ZXNAMEdavid1:(pos=1)) AND_NOT
(ZXNAMEdavid2:(pos=2) OR indexed:(pos=3) OR Zindex:(pos=3))))

That said, I'm going to take Richard's advice, and rather than use
QueryParser to parse a generated query string, I think I'll build the query
myself.

On Tue, Sep 1, 2009 at 9:31 AM, Richard Boulton <richard <at> tartarus.org>wrote:

> 2009/9/1 David Sauve <dnsauve <at> gmail.com>
>
>>  To be more specific, the query string is a combination of user input
>> (what
>> the typed into the search box), and filters such as field equals, exclude,
>> etc.  These are all done by Django-Haystack itself in order to make the
>> backend (in this case Xapian) pluggable.
>>
>
> It's probably about time I checked out a copy of the django-haystack xapian
> backend.  This sounds very unpleasant.  As Olly said, it's almost certainly
(Continue reading)

Olly Betts | 2 Sep 02:23
Favicon
Gravatar

Re: German Danish Russian

On Tue, Sep 01, 2009 at 06:58:47PM +0300, win 32 wrote:
> Hello,I searched the mailing list but all the language problems seemed to
> disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm
> under Windows XP, CSharp and .NET framework 3.5

Unicode support for the C# bindings has been tested and is known to
work with Mono, but whether it works with Microsoft's implementation is
an open question:

http://xapian.org/docs/bindings/csharp/

It sounds like it doesn't work out of the box currently.

But it's probably just a matter of working out how the conversions on
string parameters and string return values need to be and it will all
suddenly work.

> I wrote my own indexer that doesn't split those words but saves them with
> AddPosting method, still when they are read from the database there are '?'
> (question signs) in places of 'ß' / 'æ' and o in place of ö / ø.

It's unclear from this if the problem is passing strings to Xapian, or
returning strings from Xapian (or both).

Can you look at the database with the "delve" utility to see what terms
actually get added?

Cheers,
    Olly
(Continue reading)

ouwind | 2 Sep 09:54
Favicon

how can i reduce cpu usage

i used flint as my backend database. and i add sleep in function of FlintTable::add and FlintTable::del to
reduce cpu usage. and it works in when call add_document.the cpu usage below 10%. but when i insert 15k
documents, it will get very high cpu usage, will have many times more than 40% in two minutes. what does it do
in that time. adjust the index? and how can reduce the cpu usage. where can i add sleep to reduce the usage in
that two minutes and in the time of commit_transaction

2009-9-2

ouwind
Olly Betts | 2 Sep 13:10
Favicon
Gravatar

Re: how can i reduce cpu usage

On Wed, Sep 02, 2009 at 03:54:13PM +0800, ouwind wrote:
> i used flint as my backend database. and i add sleep in function of
> FlintTable::add and FlintTable::del to reduce cpu usage. and it works
> in when call add_document.the cpu usage below 10%. but when i insert
> 15k documents, it will get very high cpu usage, will have many times
> more than 40% in two minutes. what does it do in that time. adjust the
> index? and how can reduce the cpu usage. where can i add sleep to
> reduce the usage in that two minutes and in the time of
> commit_transaction

I'm not sure this is a sensible approach.  There are many places where
CPU time could be spent, and the only way to reliably find them is to
spend a lot of time profiling.  You're bound to end up with a lot of
calls to sleep, and upgrading to a new Xapian release will often be
painful, both because your changes will often conflict with changes
in the new release, and because the places where CPU time is spent
may change.

Why not work with the facilities the OS provides?

If this is Unix, then use "nice -n19" to tell the scheduler to give
least priority to the indexing process.  Then other processes will get
all the CPU time they want, and the indexer will use as much of the
remaining CPU time as it wants.

If you have a recent Linux version, "ionice -c3" can do similar things
for I/O prioritising.

Cheers,
    Olly
(Continue reading)

win 32 | 2 Sep 13:16
Picon

Re: German Danish Russian

All right, as I understand the default Windows .NET marshalling doesn’t work
as it translates CSharp Unicode strings to the user default ANSI charset
(Russian in my case) and it is UTF-8, not ANSI what Xapian wants. So I wrote
a custom marshaller to perform Unicode -> UTF-8 -> Unicode conversions:

class UTF8StringMarshaler : ICustomMarshaler
{
private static UTF8StringMarshaler marshaler = null;
public static ICustomMarshaler GetInstance(string cookie)
{
if (marshaler == null)
marshaler = new UTF8StringMarshaler();
return marshaler;
}

public IntPtr MarshalManagedToNative(Object ManagedObj)
{
if (!(ManagedObj is string))
throw new ArgumentException("The passed object is not a string",
"ManagedObj");
byte[] unicodeBytes = Encoding.Unicode.GetBytes((string)ManagedObj);
byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8,
unicodeBytes);
IntPtr result = Marshal.AllocHGlobal(utf8Bytes.Length + 1);
Marshal.Copy(utf8Bytes, 0, result, utf8Bytes.Length);
Marshal.WriteByte(result, utf8Bytes.Length, 0);
return result;
}

public void CleanUpNativeData(IntPtr pNativeData)
(Continue reading)


Gmane