dimazest | 25 Jul 21:03
Picon

xapwrap to xappy migration


Hello,

I refactor Moin search code to use xappy instead of xapwrap. The first
think I tried was querying existing database using xappy. Indexing was
done by xapwrap.

Here is the code that queries the database:
#!/usr/bin/env python

import os
import re
import sys
import xappy

_whitespace_re = re.compile('\s+')

def open_index(dbpath):
    return xappy.SearchConnection(dbpath)

def main(request, argv):
    dbpath = os.path.join(request.cfg.cache_dir, 'xapian/index')
    search = ' '.join(argv[1:])
    sconn = open_index(dbpath)
    print "Searching %d documents for \"%s\"" % (
        sconn.get_doccount(),
        search
    )
    q = sconn.query_parse(search, default_op=sconn.OP_AND)
    print q
(Continue reading)

Dominic LoBue | 28 Jul 07:05
Picon

Unable to figure out ProcessedDocument's add_term method


Hello,

I'm trying to update a ProcessedDocument with new/changed information,
but as yet I've been unable to make any changes I make searchable.

To elaborate, I'm indexing my email, then iterating over the index
records and putting the documents through an email threader and
assigning each thread a UUID. Up to this point I'm doing fine and got
it all under control.

Where I keep getting tripped up is when I then attempt to add the
thread UUID as a stored and searchable term to all of the members of
that particular thread.

When I use the add_term method, nothing happens. When I do
processeddoc.data['thread'] = [threadid] and then use the replace
method to update the record, the new field shows up in the visible
records, but isn't searchable.

Here's the code I'm using to thread and then assign the thread UUID to
members:
******************************
class threadIndexer(object):
    def __init__(self):
        xconn = xappy.IndexerConnection('xap.idx')
        xconn.add_field_action('thread',
xappy.FieldActions.INDEX_EXACT)
        xconn.add_field_action('thread',
xappy.FieldActions.STORE_CONTENT)
(Continue reading)

Richard Boulton | 28 Jul 09:18

Re: Unable to figure out ProcessedDocument's add_term method


On Jul 28, 6:05 am, Dominic LoBue <dom.lo...@...> wrote:
>
> Can anybody tell me what I'm doing wrong?
>

self.writer.replace(__doc) is commented out.

Other than that, it looks about right - if that's not the problem,
perhaps you're not forming the searches for the threadid properly -
post an example of how you're doing a search, and I'll take a look at
that.

--
Richard
Richard Boulton | 28 Jul 09:22

Re: xapwrap to xappy migration


On Jul 25, 8:03 pm, dimazest <dimaz...@...> wrote:
> The problem is that it finds some documents, but I cannot get IDs of
> them. Any ideas how can i get IDs and other fields?

It very much depends on how xapwrap stores its data, but I'm afraid I
don't know anything about how it does that.

I suspect that it's going to be quite hard to make a database built
with xapwrap searchable with xappy.  Even if you can get the IDs and
other field prefix mappings set correctly (which would involve hacking
into the internals of xappy to directly set its prefix map - not a
very robust approach), the way in which text is indexed with xapwrap
is unlikely to be identical to xappy, which will lead to poor search
performance at best; often searches simply won't return the right
results.

Instead, I think you'd be much better off trying to build new indexes
from scratch with xappy.

--
Richard
Dominic LoBue | 28 Jul 10:43
Picon

Re: Unable to figure out ProcessedDocument's add_term method


Ah, there we go.

so the problem was when I used just add_term that's all it did - add
the term to the db. It didn't store the string and show it in the
results like I expected it to. So when I tried setting the values
equal to each other, it the value was stored as a string but wasn't
searchable. And since the UUID is 100% random, what was searchable and
what was showing up in the results were two different things.

In short, here's the solution:
*******************************8
class threadIndexer(object):
    def __init__(self):
        xconn = xappy.IndexerConnection('xap.idx')
        xconn.add_field_action('thread', xappy.FieldActions.INDEX_EXACT)
        xconn.add_field_action('thread', xappy.FieldActions.STORE_CONTENT)
        xconn.add_field_action('thread', xappy.FieldActions.SORTABLE,
type='string')
        self.writer = xconn
        self.writer.set_max_mem_use(max_mem=256*1024*1024)

        import lazythread
        self.thread = lazythread.lazy_thread()

    def test_ids(self):
        for id in self.writer.iterids():
            print id

    def start(self):
(Continue reading)

dimazest | 28 Jul 12:57
Picon

Re: xapwrap to xappy migration


Hello,

On Jul 28, 9:22 am, Richard Boulton <boulton...@...> wrote:
> On Jul 25, 8:03 pm, dimazest <dimaz...@...> wrote:
>
>
> Instead, I think you'd be much better off trying to build new indexes
> from scratch with xappy.
>

We decided to build index with xappy from scratch. Now I need to add
terms to fields::

            pdoc = connection.process(doc)
            pdoc.add_term('revision', 'XREV')
            pdoc.add_term('mimetype', 'T')
            pdoc.add_term('title', 'S')
            pdoc.add_term('fulltitle', 'XFT')
            pdoc.add_term('domain', 'XDOMAIN')

But I get
...
  File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/search/
Xapian.py", line 628, in _index_page_rev
    pdoc.add_term('revision', 'XREV')
  File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/
support/xappy/datastructures.py", line 116, in add_term
    prefix = self._fieldmappings.get_prefix(field)
  File "/Volumes/RamDisk/moin/src/1.9-xapian-dmilajevs/MoinMoin/
(Continue reading)

Richard Boulton | 28 Jul 15:49
Gravatar

Re: xapwrap to xappy migration

2009/7/28 dimazest <dimazest-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Hello,

On Jul 28, 9:22 am, Richard Boulton <boulton...-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
> On Jul 25, 8:03 pm, dimazest <dimaz...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>
> Instead, I think you'd be much better off trying to build new indexes
> from scratch with xappy.
>

We decided to build index with xappy from scratch. Now I need to add
terms to fields::

           pdoc = connection.process(doc)
           pdoc.add_term('revision', 'XREV')
           pdoc.add_term('mimetype', 'T')
           pdoc.add_term('title', 'S')
           pdoc.add_term('fulltitle', 'XFT')
           pdoc.add_term('domain', 'XDOMAIN')

Am I right that terms are added to the processed documents? Could you
suggest some documentation describing terms.

You probably don't want to work at the term level at all.  Instead, set up field actions on the database (via an IndexerConnection), create UnprocessedDocuments, and add the UnprocessedDocuments to an IndexerConnection,  The terms will be generated from the text automatically.

See docs/introduction.rst for an introduction to the concepts.

-- 
Richard

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
To post to this group, send email to xappy-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to xappy-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---

dimazest | 28 Jul 18:43
Picon

Re: xapwrap to xappy migration


Thank you for reply.

Another question, do I need care about stemming or I should just set
lang parameter for FREE_TEXT actions? Is it possible for different
documents set different languages?

On Jul 28, 3:49 pm, Richard Boulton <rich...@...> wrote:
> 2009/7/28 dimazest <dimaz...@...>
>
>
>
> > Hello,
>
> > On Jul 28, 9:22 am, Richard Boulton <boulton...@...> wrote:
> > > On Jul 25, 8:03 pm, dimazest <dimaz...@...> wrote:
>
> > > Instead, I think you'd be much better off trying to build new indexes
> > > from scratch with xappy.
>
> > We decided to build index with xappy from scratch. Now I need to add
> > terms to fields::
>
> >            pdoc = connection.process(doc)
> >            pdoc.add_term('revision', 'XREV')
> >            pdoc.add_term('mimetype', 'T')
> >            pdoc.add_term('title', 'S')
> >            pdoc.add_term('fulltitle', 'XFT')
> >            pdoc.add_term('domain', 'XDOMAIN')
>
> > Am I right that terms are added to the processed documents? Could you
> > suggest some documentation describing terms.
>
> You probably don't want to work at the term level at all.  Instead, set up
> field actions on the database (via an IndexerConnection), create
> UnprocessedDocuments, and add the UnprocessedDocuments to an
> IndexerConnection,  The terms will be generated from the text automatically.
>
> See docs/introduction.rst for an introduction to the concepts.
>
> --
> Richard
Richard Boulton | 28 Jul 18:52
Picon
Favicon

Re: xapwrap to xappy migration

2009/7/28 dimazest <dimazest-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Thank you for reply.

Another question, do I need care about stemming or I should just set
lang parameter for FREE_TEXT actions? Is it possible for different
documents set different languages?

If you set the "language" parameter for free text actions, xappy will take care of stemming for you.

It's not really possible for different documents to have different languages.  You can, however, set different fields to have different languages (so one field could be text_en and english, and another could be text_fr and be in french).  However, if doing this, you'll need to decide which language to search in at query construction time, and use the appropriate field (eg, with query_parse(default_allow="text_fr"))  You can't easily mix french and english queries (for example) because the stemming algorithm used at search time needs to be the same as that applied to the field at index time.

-- 
Richard 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
To post to this group, send email to xappy-discuss-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to xappy-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en
-~----------~----~----~----~------~----~------~--~---


Gmane