Robert Kaye | 2 Jun 21:19 2008
Picon

Re: Ordering search results and defining a custom Weight class in python


On May 30, 2008, at 6:21 PM, Olly Betts wrote:
> Yes, BM25Weight has several parameters which can be adjusted to change
> the emphasis of the weighting.  If your documents are typically quite
> short, then you probably will get better results if you make the
> document length less important.

Awesome -- thanks for the excellent tip. With just a little tweaking  
the search results have improved drastically.

I've asked for some help testing our new search service and that has  
turned up that we're having problems properly tokenizing Chinese text.  
Our database can conceivably have text from all languages supported by  
Unicode and we'd need to find a way to properly tokenize chinese text.  
I've seen a few posts from last year talking about a Chinese  
tokenization scheme, but I haven't found anything about that in the  
official docs.

Is there a preferred way (in python) to handle the tokenization of  
Chinese characters?

Thanks for your help!

--

--ruaok      Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye     --     rob <at> eorbit.net     --    http://mayhem-chaos.net
Iain Emsley | 2 Jun 22:43 2008

Creating or Opening a Database

Just starting out with Xapian and the Python bindings but I've not been able
to get my code to open or create a database file to index the text using the
simple index as a start. I'd be grateful for any advice on what I've done
incorrectly in the _db code:
import xapian
import string

class TextIndex (object):
 _db = None
def get_db ():
    if _db == None:
     _db = xapian.WritableDatabase ('c:\\index', xapian.DB_CREATE_OR_OPEN)
    return _db

def index (db):
#need to add in time taken to index
  try:

    indexer = xapian.TermGenerator()
    stemmer = xapian.Stem("english")
    indexer.set_stemmer(stemmer)
    texts = open('C:\\webroot\\milton\\miltontest.txt')

    para = ''
    try:
        for line in texts:
            line = string.strip(line)
            if line == '':
                if para != '':
                    # We've reached the end of a paragraph, so index it.
(Continue reading)

Olly Betts | 2 Jun 23:32 2008

Re: Creating or Opening a Database

On Mon, Jun 02, 2008 at 09:43:35PM +0100, Iain Emsley wrote:
> Just starting out with Xapian and the Python bindings but I've not been able
> to get my code to open or create a database file to index the text using the
> simple index as a start. I'd be grateful for any advice on what I've done
> incorrectly in the _db code:

How does it fail?  If there's an error, what is it?

Also, what version of Xapian is this with?

Cheers,
    Olly
Olly Betts | 2 Jun 23:44 2008

Re: Ordering search results and defining a custom Weight class in python

On Mon, Jun 02, 2008 at 12:19:58PM -0700, Robert Kaye wrote:
> I've asked for some help testing our new search service and that has  
> turned up that we're having problems properly tokenizing Chinese text.  
> Our database can conceivably have text from all languages supported by  
> Unicode and we'd need to find a way to properly tokenize chinese text.  
> I've seen a few posts from last year talking about a Chinese  
> tokenization scheme, but I haven't found anything about that in the  
> official docs.

The current state is that Chinese characters are interpreted as word
characters pretty much the same way that A-Z, 0-9, etc are.  So text
which consists of such characters without spaces doesn't really work
very well.

I'd like to add support for a better indexing/searching approach for
Chinese (and other languages which work in a similar way).  Someone
provided some standalone code for tokenising Chinese (which is probably
what you were looking at in the archives), so it's mostly a matter of
integrating this, or using it as a model for implementing something
similar if it isn't a good fit.

> Is there a preferred way (in python) to handle the tokenization of  
> Chinese characters?

I'm not aware of one.

But a simple hack which might help for now is to insert a space between
any two adjacent Chinese characters before indexing or searching.
Particularly for the short "documents" you're looking at, that should
work pretty well.
(Continue reading)

James Aylett | 3 Jun 03:07 2008

Re: Creating or Opening a Database

On Mon, Jun 02, 2008 at 09:43:35PM +0100, Iain Emsley wrote:

> class TextIndex (object):
>  _db = None
> def get_db ():
>     if _db == None:
>      _db = xapian.WritableDatabase ('c:\\index', xapian.DB_CREATE_OR_OPEN)
>     return _db

This isn't how to implement a singleton in python. At a minimum, you
need to global _db inside the get_db() function.

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Robert Kaye | 3 Jun 05:54 2008
Picon

Re: Ordering search results and defining a custom Weight class in python


On Jun 2, 2008, at 2:44 PM, Olly Betts wrote:
>
> But a simple hack which might help for now is to insert a space  
> between
> any two adjacent Chinese characters before indexing or searching.
> Particularly for the short "documents" you're looking at, that should
> work pretty well.

Nice hack! Seems to look ok to my western eyes -- I'll ask more  
knowledgeable people to test it.

Thanks!

--

--ruaok      Somewhere in Texas a village is *still* missing its idiot.

Robert Kaye     --     rob <at> eorbit.net     --    http://mayhem-chaos.net
Iain Emsley | 3 Jun 09:41 2008

Re: Creating or Opening a Database

When I run it off the command line or run module in Idle, it appears to run
but the expected c:\index folder does not appear and the file which is being
pointed at is not being read. I'm using the Windows installer from
http://www.raptorized.com/xapian-python-win32/ which appears to be Xapian
1.0.6 running on Python 2.5.

I had this issue when I first started with changing the simpleindex.py
script line:
database = xapian.WritableDatabase(sys.argv[1], xapian.DB_CREATE_OR_OPEN) so
that sys,argv[1] pointed at an index in my c: drive.

I'm assuming that the _db file is not being opened so that nothing is being
indexed.

MTIA, Iain

On Mon, Jun 2, 2008 at 10:32 PM, Olly Betts <olly <at> survex.com> wrote:

> On Mon, Jun 02, 2008 at 09:43:35PM +0100, Iain Emsley wrote:
> > Just starting out with Xapian and the Python bindings but I've not been
> able
> > to get my code to open or create a database file to index the text using
> the
> > simple index as a start. I'd be grateful for any advice on what I've done
> > incorrectly in the _db code:
>
> How does it fail?  If there's an error, what is it?
>
> Also, what version of Xapian is this with?
>
(Continue reading)

Kevin Duraj | 4 Jun 01:23 2008
Picon

Re: Xapian Terms vs. Document Partition.

Yeah, it is getting tougher to get 100 million of web sites index on
one machine. But, I learned few things to make it faster. Today a big
problem on Internet, is that there are around 20% of web sites
completely dedicated to spamdex and another 20% to xxx, and they are
connected to each other and also to legitimate web sites. My crawlers
used to get stuck in cycle of spamdex or xxx web sites and could not
get out from it. However, it is not hard to write pattern recognition
software in Perl that detect these web sites and avoid going there.
But this patterns must be evaluated and often re-written and is time
consuming. There are many web sites with different URL's having same
content. So, I am running SHA1 hash algorithm on first part of the web
site content and use that as unique key that made the search faster
and removed duplicated web sites. So that seems to be done and you can
see my index at http://myhealthcare.com is almost completely clean
from these type of web sites.

Another thing is that my crawlers brought to index lot of Asian web
sites and because they use different characters they create the
postlist of index terms really big. So I am trying several different
scenarios analyzing the text and detect whether text is readable by
gunning fog and flesh-kincaid algorithms, that takes care of
separating all good text from bad or text that do not originate in
Cyrillic charset which includes also Asian languages. I end up with
second search engine that I run on http://pacificair.com and you can
search for Chinese, Japanese, Korean and some other non Cyrillic web
sites.

There are few things on my mind including that most probably in Xapian
we can rewrite the scriptindex to create index distributed into
separate indexes based on terms across many servers and then search
(Continue reading)

James Aylett | 4 Jun 11:44 2008

Re: Xapian Terms vs. Document Partition.

On Tue, Jun 03, 2008 at 04:23:31PM -0700, Kevin Duraj wrote:

> Another thing is that my crawlers brought to index lot of Asian web
> sites and because they use different characters they create the
> postlist of index terms really big.

Out of interest, do you have (or could you generate) a stat for how
many of these mark their languages correctly (either xml:lang in
XHTML, or lang in HTML4, or some other method - there's probably a
META one, but the first two are preferred)?

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
Juan Gargiulo | 4 Jun 09:53 2008
Picon

Workaround for mod_python not working

Hi,

I am experiencing the problem described in ticket 185 (
http://trac.xapian.org/ticket/185), and the suggested workaround is not
working for me. 
I am running Apache 2.0.x, Python 2.5, mod_python 3.3.1, xapian-core 1.0.6,
xapian-bindings 1.0.6. Everything under Mac OS 10.5 64 bits.

I am using PythonInterpreter main_interpreter but I am still experiencing
the hang.

Any help is welcome.

Thanks,

Juan

Gmane