Richard Boulton | 1 Jul 11:34 2004

Re: Xapian 0.8.1 released

Olly Betts wrote:
> I've uploaded Xapian 0.8.1:

Debian packages are now available from the Xapian website:

If you're running Debian stable add the following to your sources.list:

deb http://www.xapian.org/debian stable main
deb-src http://www.xapian.org/debian stable main

If you're running Debian unstable or testing, add the following:

deb http://www.xapian.org/debian unstable main
deb-src http://www.xapian.org/debian unstable main

Note that these packages are still incomplete: while they should work 
well enough for most purposes, they fail lintian tests due to manpages 
for some of the binaries being missing.  In addition, only the python 
bindings are packaged so far, and the omega package doesn't perform any
automatic configuration (ideally, it would be possible to configure it 
at install time to index, for example, the system documentation).
There are also some other lintian failure with the stable packages, 
which don't look serious but need addressing.

Note also that packages are only built for i386.  If you're on another 
architecture, you can build your own by adding the "deb-src" line above,
then:

# su -
# apt-get update
(Continue reading)

Olly Betts | 1 Jul 13:05 2004

Re: Xapian 0.8.1 released

On Thu, Jul 01, 2004 at 10:34:29AM +0100, Richard Boulton wrote:
> [If you're on testing, you'll currently need to install the version of 
> swig in unstable.  If you're on stable, you'll need to locally install 
> swig 1.3.20 or later and tell dpkg to ignore build dependencies, or make 
> up your own package of swig too.]

For stable, you can get swig 1.3.21 from backports.org:

http://www.backports.org/package.php?search=swig

Cheers,
    Olly
Eric B. Ridge | 1 Jul 16:55 2004

Re: Xapian 0.8.1 released

On Jun 30, 2004, at 5:13 PM, Olly Betts wrote:

> I've uploaded Xapian 0.8.1:
>
> http://www.xapian.org/download.php
>
> This release contains a few important fixes to 0.8.0 - if you're using
> 0.8.0, we recommend upgrading.

Thanks for getting this out so quickly.  We'll be upgrading to it next 
weekend.

eric
Olly Betts | 1 Jul 20:37 2004

Re: "dangerous" patch

On Wed, Jun 30, 2004 at 10:44:07PM +0100, Olly Betts wrote:
> Attached is a patch which turns off the B-tree versioning mechanism
> during update.  You'll lose atomic updates, so if the indexer or machine
> crashes, you've probably got a useless database.  But it reduces the
> number of blocks written which can speed up updating quite a bit if
> it is I/O bound - it's good for reindexing from scratch, for example.
> 
> Here's a couple of graphs showing performance on a fairly low spec
> machine.  The first is without, the second with.  X axis is database
> size, Y axis is documents added per second:
> 
> http://www.survex.com/~olly/dangerous.png
> 
> These graphs don't show off the gain especially well, but they're what
> I have to hand.  Notice how the second graph only drops away from 20
> docs/sec at around 800,000 documents while the first drops below at
> around 350,000.

I realised just overlaying the two graphs would make the difference more
obvious - a couple of minutes with the gimp gives this:

http://www.survex.com/~olly/dangerous2.png

Green is standard, red is "dangerous".

Cheers,
    Olly
Lee Johnson | 2 Jul 10:45 2004
Picon

Adding a Web Spider

Hi,
i have read the future of xapian thread today. One
item is specifically is very interesting for me in
that thread is adding a web spider. We all know that
Xapian is not designed exclusively for that purpose
but a web spider can increase greatly the usage of
Xapian. I'm not a programmer but writing a web spider
is rather simple wrt writing xapian itself. In turn,
xapian can earn lots of users and those ones become
familiar with xapian and so they use in other areas,
they tell others about xapian and so on.

I'm saying this because i also need a crawler for
xapian. I have hand-picked rather big list of URLs
(just URLs not the contents) and need a crawler to
crawl all pages beneath the URLs and put the those
content into a db. so i can use xapian to index and
search that db. I'm very open to suggestions. I looked
at nutch, heritrix and larbin (this one probably just
fetches the URLs not the contents i asked this to the
developer but no answer yet) but with those i cannot
use xapian (if i use one of them then probably i will
use mnogosearch). another thing with nutch and
heritrix is that they are written in java, imho, is
not a good idea.

Also for those interested a good read may be
http://acmqueue.com/modules.php?name=Content&pa=list_pages_issues&issue_id=12
which devoted that month's issue to search topic.

(Continue reading)

rm | 2 Jul 12:38 2004
Picon

Re: Adding a Web Spider

On Fri, Jul 02, 2004 at 01:45:29AM -0700, Lee Johnson wrote:
> Hi,
> i have read the future of xapian thread today. One
> item is specifically is very interesting for me in
> that thread is adding a web spider. We all know that
> Xapian is not designed exclusively for that purpose
> but a web spider can increase greatly the usage of
> Xapian. I'm not a programmer but writing a web spider
> is rather simple wrt writing xapian itself. 

I happen to call myself a programmer and have written some
crawlers myself. No, writing a _good_ crawler is _far_ from
simple (you need a rather error-tolerant HTML/XHTML parser,
a good http lib, smart tracking of Etag headers and content
hash sums, and more and more a rather capable ECMA-script
interpreter (for those stupid javascript links ....).

I agree, it's pretty trivial to hack a _bad_ crawler in
one of the P languages but there are allready quite a few
out there in the wild you can catch and abuse :-)

> In turn,
> xapian can earn lots of users and those ones become
> familiar with xapian and so they use in other areas,
> they tell others about xapian and so on.

??? Is this really how it works? People use Xapian because
the need powerfull IR technology. Word of mouth is a bad
advisor in such areas. I doubt that mifluz is used more
often because of its use in htdig -- actually, i'm still
(Continue reading)

James Aylett | 2 Jul 13:17 2004

Re: Adding a Web Spider

On Fri, Jul 02, 2004 at 12:38:33PM +0200, rm <at> fabula.de wrote:

> I happen to call myself a programmer and have written some
> crawlers myself. No, writing a _good_ crawler is _far_ from
> simple (you need a rather error-tolerant HTML/XHTML parser,
> a good http lib, smart tracking of Etag headers and content
> hash sums, and more and more a rather capable ECMA-script
> interpreter (for those stupid javascript links ....).

I'd echo that; i haven't written a crawler for indexing, but I've
written similar systems at work, and they tend to be fairly painful
:-/

However if we were to come up with some sort of modular design of
spider-crawler / indexer pair, and implement it well, it might indeed
help. But I do wonder how many people actually need something like
that? Surely most potential uses of an IR system will be working with
local data? (Larger institutions need spiders, so I can see the appeal
for consultancy companies, and I'll certainly support and offer
suggestions if anyone is going to write one. Just don't think it's
going to be easy :-)

> Use Perl with the LWP lib  to fetch the documents,
> parse them with the Perl libxml2 parser (that has a pretty
> good html mode), use libxml2's Reader API to fetch all
> URLs nd push them onto a stack of jobs. Use Xapian's 
> Perl bindings to do the actual indexing. Nothing to
> hard. But: if the resources you grab aren't on your servers
> you might want to honor robot.txt and add delays to the
> job queue, check for dynamic content etc.
(Continue reading)

Olly Betts | 2 Jul 14:22 2004

Re: Adding a Web Spider

On Fri, Jul 02, 2004 at 12:17:52PM +0100, James Aylett wrote:
> On Fri, Jul 02, 2004 at 12:38:33PM +0200, rm <at> fabula.de wrote:
> 
> > I happen to call myself a programmer and have written some
> > crawlers myself. No, writing a _good_ crawler is _far_ from
> > simple (you need a rather error-tolerant HTML/XHTML parser,
> > a good http lib, smart tracking of Etag headers and content
> > hash sums, and more and more a rather capable ECMA-script
> > interpreter (for those stupid javascript links ....).
> 
> I'd echo that; i haven't written a crawler for indexing, but I've
> written similar systems at work, and they tend to be fairly painful
> :-/

Several years ago I bumped into someone I was at university with and we
were catching up with what we'd been up to.  I told him I was writing
web crawlers these days.  He responded "Oh how dull, that's just a big
hash table!"

This always makes me think of the Python sketch in which John Cleese
teaches us how to play the flute:

    "You blow there and you move your fingers up and down here."

For the full sketch see: http://orangecow.org/pythonet/sketches/toridof.htm

I've written several web crawlers, and it's hard to do well.  The fact I
had to write several is a clue - in the process of writing one and
seeing how it performs and the problems it runs into, you learn a lot
and are then able to rework to produce something better.
(Continue reading)

James Aylett | 2 Jul 15:05 2004

Re: Adding a Web Spider

On Fri, Jul 02, 2004 at 01:22:41PM +0100, Olly Betts wrote:

> > If you use Python, there's a robots.txt implementation in the
> > library. Although IIRC it's buggy :-(
> 
> All the standard robots.txt implementations I've seen implement the
> spec.  Sadly almost nobody who writes robots.txt files seems to read
> the spec...

ISTR something by Mark Pilgrim saying the Python one didn't behave
itself properly; he wrote his own for the Ultra Liberal Feed
Parser. I note that he's fixing two bugs, one fixed in Python 2.3a2;
there's another which appears to still be open in Python itself which
he patches (bug 690214 - although this doesn't appear to be a valid bug).

It doesn't help that the robots.txt spec is fairly poorly-written, and
exists only as an expired I-D. It also lacks some useful features that
make it tortuous or impossible to build certain types of robot control
policies :-(

Having said that, it's fairly simple even to deal with weird
robots.txt files. But the effort does all add up ... :-(

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org
(Continue reading)

Francis Irving | 5 Jul 14:25 2004

Expected another key with the same term name

Just started getting this error when adding/changing entries in the
TheyWorkForYou Xapian database.

Exception: Expected another key with the same term name but found a
different one at ./index.pl line 114.

The line of code in my Perl indexing script is:
        $db->add_document($::doc);

What might I be doing that suddenly provokes this?  I have been
stopping some indexing half way through, but surely that should
be safe?  We're using 0.8.0 still, might upgrading help?

Francis
--

-- 
Report card for your MP   http://www.theyworkforyou.com

Gmane