Olly Betts | 6 Jun 2003 03:12
Favicon
Gravatar

omindex --duplicates=duplicate ?

I've just been converting omega's omindex to use the Xapian namespace.

I'm rather puzzled by the duplicate handling options.

You can choose to replace an existing document with the same URL as one
being added (which is sensible - it means you can reindex a document
tree by running omindex with an existing database, though deleted
documents won't be removed from the index).

You can also ignore a duplicate, which I can see might be useful in odd
circumstances - you might build a database by running omindex several
times with different filename->URL mappings, and two could produce the
same URL I suppose.  But this option seems to be redundant since you
could just run the omindex commands in reverse order replacing
duplicates.  So that the first, preserved entry becomes the last,
replacing entry.  And this way you can update an existing database,
which you can't if you ignore duplicates.

The final option is to create another record with the same URL.  I'm
hard pushed to think of circumstances in which this would be a useful
thing to do...

Does anybody using omega use --duplicates with anything apart from the
default setting (replace duplicates)?  If so, how are you making use of
it?

Cheers,
    Olly

-------------------------------------------------------
(Continue reading)

James Aylett | 6 Jun 2003 11:10

Re: omindex --duplicates=duplicate ?

On Fri, Jun 06, 2003 at 02:12:12AM +0100, Olly Betts wrote:

> I've just been converting omega's omindex to use the Xapian namespace.
> I'm rather puzzled by the duplicate handling options.
> 
> You can ignore a duplicate, which I can see might be useful in odd
> circumstances.

It's useful in a common situation for very simple use. I have a
website that contains a diary; most pages don't change (if they do, I
can reindex them with replace duplicates). Most of the time I've added
several files (because diary entries appear in batches). The easiest
solution is to reindex, say, the month or year that contains all new
entries, ignoring duplicates - this provides a fast and convenient way
of updating a simple site. Similar situations could occur in a small
company with a product list where new products are added regularly -
you index with replacement on the product list page, but index
ignoring duplicates on the directory containing all individual product
pages.

Yes, you could argue that they could use a db-backed system. No, most
of them won't (it's not even always worth it). This supports the
flat-file system with a minimum of fuss.

> The final option is to create another record with the same URL.  I'm
> hard pushed to think of circumstances in which this would be a useful
> thing to do...

My justification for this is weaker, and omindex doesn't fully support
what I'd want to do with it right now. If a document evolved over
(Continue reading)

James Aylett | 6 Jun 2003 13:03

Re: omindex --duplicates=duplicate ?

On Fri, Jun 06, 2003 at 11:52:28AM +0100, Olly Betts wrote:

> > > You can ignore a duplicate, which I can see might be useful in odd
> > > circumstances.
> Ah, I see.  It effectively means "check for any new pages and index them".

Indeed.

> > > The final option is to create another record with the same URL.
[snip]
> Hmm, that's rather obscure.

I said it was a weak justification :-)

> I'd like to remove duplicate duplicates at some point if there are no
> objections.  I want to rework omindex to allow it to do proper updating
> (checking for removed documents too).  Ignore duplicates and replace
> duplicates fit into this nicely, but duplicate duplicates doesn't...

I have no problems with that at all. Removed documents was something I
never got round to doing. Duplicate duplicates was pretty much there
because it was easy to do ...

J

--

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james <at> tartarus.org                               uncertaintydivision.org

(Continue reading)

Olly Betts | 6 Jun 2003 12:52
Favicon
Gravatar

Re: omindex --duplicates=duplicate ?

On Fri, Jun 06, 2003 at 10:10:37AM +0100, James Aylett wrote:
> On Fri, Jun 06, 2003 at 02:12:12AM +0100, Olly Betts wrote:
> 
> > I've just been converting omega's omindex to use the Xapian namespace.
> > I'm rather puzzled by the duplicate handling options.
> > 
> > You can ignore a duplicate, which I can see might be useful in odd
> > circumstances.
> 
> It's useful in a common situation for very simple use. [...]

Ah, I see.  It effectively means "check for any new pages and index them".

> > The final option is to create another record with the same URL.  I'm
> > hard pushed to think of circumstances in which this would be a useful
> > thing to do...
> 
> My justification for this is weaker, and omindex doesn't fully support
> what I'd want to do with it right now. If a document evolved over
> time, you could well want to search old documents. If you used the URL
> to give a base (which, on its own, gives the latest version of the
> document, but with a time stamp gave the version current at that
> time), and if omindex added a timestamp field, the create duplicate
> option would be useful.

Hmm, that's rather obscure.  You'd also need to careful only reindex
the changed documents, as otherwise you'll get duplicates URLs with
exactly the same document attached.

If you want to achieve this, you'll need to modify omindex to handle
(Continue reading)

D. Werner | 12 Jun 2003 10:50
Picon
Favicon

Xapian and other languages

Hello, i a new to this list and Xapian.

I have a simple question: I know that
some companies in Germany use Xapian for
their intranets with umlauts. When I use
the Omega interface it stips out these
umlauts.

Is there any chance to get these in? How
can I install the german stemming?

Any one?

Thanks,
David.

-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
Jennifer Woodhead-Browne | 12 Jun 2003 15:37
Picon

Thanks :o)

Hiya,

I'd just like to say Xapian really rocks, thanks so much to all the people who have put their time and effort into something this good.

erm that's it really. THANKS again and keep "xapian" ing as it is truly appreciated. (sorry I know I am a bit off topic but I felt it needed to be said)

take care, stay safe,
Jen

__________________________________
Jen Woodhead-Browne
Senior User Interface Developer
User Interface Development
Tel: +44 (0) 113 367 4565
Mobile: +44 (0) 7866 695872

Orange Multimedia Operations
Marshall Mill
Marshall Street
Leeds
LS11 9YJ

http://www.orange-today.co.uk
http://wap.orange-today.co.uk
http://pda.orange-today.co.uk
http://spv.orange-today.co.uk

Ananova Limited Registered Office:
St James Court
Great Park Road
Almondsbury Park
Bradley Stoke
Bristol BS32 4QJ
Registered in England No.2858918

Confidentiality: This communication is intended for the above-named person and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of the company. If it has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender immediately.

Monitoring/Viruses: Orange may monitor all incoming and outgoing emails in line with current legislation.  Although we have taken steps to ensure that this email and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Olly Betts | 12 Jun 2003 17:36
Favicon
Gravatar

Re: Xapian and other languages

On Thu, Jun 12, 2003 at 10:50:45AM +0200, D. Werner wrote:
> I have a simple question: I know that
> some companies in Germany use Xapian for
> their intranets with umlauts. When I use
> the Omega interface it stips out these
> umlauts.

This is a known issue:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=9

> Is there any chance to get these in?

It's fairly high up my list of things to fix, as it causes problems for
almost anybody using languages other than English - it'll probably be
fixed in the next week.

> How can I install the german stemming?

At present, you need to edit 3 files in the omega sources and change
"english" to "german".  These are the three places:

omindex.cc:    Xapian::Stem stemmer("english");
query.cc:static const char * STEM_LANGUAGE = "english";
scriptindex.cc: Xapian::Stem stemmer("english");

If you have perl, you can do this like so:

perl -pi -e 's/"english"/"german"/' omindex.cc query.cc scriptindex.cc

This is a temporary situation - the language needs to be configurable
per database.

Cheers,
    Olly

-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
Sam Liddicott | 18 Jun 2003 19:47

OmIndex exclusions

I notice omindex excludes files and directories beginning with . from 
being indexed/

I patched my personal copy to also exclude those beginning with _ so 
that less experienced users (ISP customers) will be able to more easily 
exclude content from indexing (MS Frontpage also uses _ to signify meta 
content and private content)

No doubt omindex could do with a better way of specifying recursive 
content to index.  A file list on stdin sounds most likely, _ prefix is 
a hack, but perhaps not too much out of spirit?

Anyway, you may all be interested to know that Bigwig.net now offers 
xapian search engine to all customers who signup for a "bronze" account 
(less than 30 UKP per year) or who dial up on average at least 4 hours a 
week.  I'm not trying to sell you bigwig accounts, but this use of 
xapian may be interesting.

Sam

-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
Olly Betts | 18 Jun 2003 22:57
Favicon
Gravatar

Re: Xapian and other languages

On Thu, Jun 12, 2003 at 04:36:31PM +0100, Olly Betts wrote:
> On Thu, Jun 12, 2003 at 10:50:45AM +0200, D. Werner wrote:
> > I have a simple question: I know that
> > some companies in Germany use Xapian for
> > their intranets with umlauts. When I use
> > the Omega interface it stips out these
> > umlauts.
> 
> This is a known issue:
> 
> http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=9
> 
> > Is there any chance to get these in?
> 
> It's fairly high up my list of things to fix, as it causes problems for
> almost anybody using languages other than English - it'll probably be
> fixed in the next week.

This is now done, with the exception of highlighting of matching terms
in the sample (the omegascript $highlight command) - that will simply
fail to highlight some words it should at the moment, which is fairly
harmless.

The code is in CVS.  I've done some rudimentary testing - if you'd like
to exercise it more thoroughly, that would be appreciated.

> > How can I install the german stemming?
> 
> At present, you need to edit 3 files in the omega sources and change
> "english" to "german".  [...]  If you have perl, you can do this like so:
> 
> perl -pi -e 's/"english"/"german"/' omindex.cc query.cc scriptindex.cc
> 
> This is a temporary situation - the language needs to be configurable
> per database.

This is still to be done.

Cheers,
    Olly

-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
Daniel Hanks | 20 Jun 2003 18:48

Phrase search with Perl?

Hi folks,

I've been working with the Search::Xapian Perl modules, and wondering if phrase searching is supported?

I'm trying to do the following:

my $db = Search::Xapian::Database->new('/some/path');
my $enq = $db->enquire(OP_PHRASE, 'this','is','a','phrase');

I then go through the $enq->matches stuff in the code, etc. When I run the above, it bails in the $db->enquire
call with "Aborted". If I call the above with only two words in the phrase, it works fine, but I'd prefer not
to be limited to two word phrases :-).

Can someone provide some sample code to implement a simple phrase search using the Perl modules? Or point me
to some docs?

Thanks for any help,
-- Dan
========================================================================
   Daniel Hanks - Systems/Database Administrator
   About Inc., Web Services Division
========================================================================

-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php

Gmane