Frank John Bruzzaniti | 2 Feb 2009 14:23
Picon

Ticket #282: omindex-assorted-enhancements.patch woes

I would really like to try out the features in the patch above.  But I
can't ever seem to get the resulting omindex.cc to "make".

I tried updating to rev 10801 from the SVN then run /bootstrap but then
I seem to get errors compiling everything when I try and do "make" (I'm
using ubuntu 8.10).

So I thought I'd try an apply the patch to the latest stable version
1.0.10.  The patch created a reject file but I edited in the two lines
it rejected myself.

But when I run "make" I get the following errors:

omindex.o: In function `main':
/home/frankie/Desktop/xapian-omega-1.0.10/omindex.cc:874: undefined
reference to `read_config_file()'
/home/frankie/Desktop/xapian-omega-1.0.10/omindex.cc:987: undefined
reference to `log_dir'
collect2: ld returned 1 exit status

Could anyone suggest what I've dome wrong?

FYI Here's the reject file:

***************
*** 397,406 ****
      } else if (mimetype == "text/rtf") {
  	// The --text option unhelpfully converts all non-ASCII characters to
  	// "?" so we use --html instead, which produces HTML entities.
- 	string cmd = "unrtf --nopict --html 2>/dev/null " +
(Continue reading)

James Aylett | 2 Feb 2009 14:34

Re: Ticket #282: omindex-assorted-enhancements.patch woes

On Mon, Feb 02, 2009 at 11:53:45PM +1030, Frank John Bruzzaniti wrote:

> I would really like to try out the features in the patch above.  But I
> can't ever seem to get the resulting omindex.cc to "make".
> 
> I tried updating to rev 10801 from the SVN then run /bootstrap but then
> I seem to get errors compiling everything when I try and do "make" (I'm
> using ubuntu 8.10).

Without some idea of the errors, we probably can't help on this. Does
it compile without the patch?

> So I thought I'd try an apply the patch to the latest stable version
> 1.0.10.  The patch created a reject file but I edited in the two lines
> it rejected myself.

Okay.

> But when I run "make" I get the following errors:
> 
> omindex.o: In function `main':
> /home/frankie/Desktop/xapian-omega-1.0.10/omindex.cc:874: undefined
> reference to `read_config_file()'
> /home/frankie/Desktop/xapian-omega-1.0.10/omindex.cc:987: undefined
> reference to `log_dir'
> collect2: ld returned 1 exit status
> 
> Could anyone suggest what I've dome wrong?

Looks like that target will need configfile.o as well, although I
(Continue reading)

Frank John Bruzzaniti | 2 Feb 2009 16:05
Picon

Using Open Office to convert documents.

I wrote a little python script (oOC.py) that I could insert as one of
the "helper" apps that uses unoconv and openoffice to convert documents
to text. E.g. I was having trouble converting *.doc that were saved with
wordperfect as antiword didn't decode them so I substitute the line in
omindex that contains atiword with oOC.py.  Theoretically oOC can
convert almost any format supported by OpenOffice and Unconv. 

I've done some initial testing and it seems to work ok. I wouldn't
recommend it in a production environment without lots of testing, I
decided to email it for the sake of curiosity.

Basically it runs a headless copy of openoffice which should stay
running and accept requests from unconv and print the results from
stdout.

#!/usr/bin/python
# Python script to convert dpcuments via OpenOffice for Xapian-Omega
# By Frank J Bruzzaniti
# frank.bruzzaniti <at> gmail.com

import os, sys, time
from subprocess import *

# Get pid of any running soffice processes
getpid = Popen(["ps -ef | grep -v grep | grep
'/usr/lib/openoffice/program/soffice.bin -headless
-accept=socket,host=127.0.0.1,port=2002;urp; -nofirststartwizard' | cut
-f3 -d' '"], shell=True, stdout=PIPE).stdout

# Save pid might be usefull
(Continue reading)

James Aylett | 2 Feb 2009 20:25

Re: Using Open Office to convert documents.

On Tue, Feb 03, 2009 at 01:35:14AM +1030, Frank John Bruzzaniti wrote:

> I wrote a little python script (oOC.py) that I could insert as one
> of the "helper" apps that uses unoconv and openoffice to convert
> documents to text.

Cool--any chance you could slap a license on it, or put it up on the
wiki or something? I have a feeling it'd be useful for lots of folk.

J

--

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org
Richard Boulton | 2 Feb 2009 21:23

Re: Using Open Office to convert documents.

On Mon, Feb 02, 2009 at 07:25:04PM +0000, James Aylett wrote:
> On Tue, Feb 03, 2009 at 01:35:14AM +1030, Frank John Bruzzaniti wrote:
> 
> > I wrote a little python script (oOC.py) that I could insert as one
> > of the "helper" apps that uses unoconv and openoffice to convert
> > documents to text.
> 
> Cool--any chance you could slap a license on it, or put it up on the
> wiki or something? I have a feeling it'd be useful for lots of folk.

Definitely would be helpful - I'd like to build a collection of such
scripts, and ultimately gather them together under a coherent interface -
having them licensed and on the wiki would help to do this no end.

--

-- 
Richard
Frank John Bruzzaniti | 2 Feb 2009 22:41
Picon

Re: Using Open Office to convert documents.

Okies done added OpenOffice filter python script as
http://trac.xapian.org/ticket/324

On Mon, 2009-02-02 at 19:25 +0000, James Aylett wrote:
> On Tue, Feb 03, 2009 at 01:35:14AM +1030, Frank John Bruzzaniti wrote:
> 
> > I wrote a little python script (oOC.py) that I could insert as one
> > of the "helper" apps that uses unoconv and openoffice to convert
> > documents to text.
> 
> Cool--any chance you could slap a license on it, or put it up on the
> wiki or something? I have a feeling it'd be useful for lots of folk.
> 
> J
> 
Kevin Duraj | 3 Feb 2009 06:30
Picon

Re: Bug in set_cutoff - xapian-core 1.0.10

Olly,

We should not make any change to the code, if that would cause
performance loss or significant memory usage. Let's leave everything
as it is because then we can manipulate multi millions search result
set. The top search result is always good after cut off and sorting,
that is most important.

Thanks,
Kevin Duraj

On Sun, Jan 18, 2009 at 3:00 AM, Olly Betts <olly <at> survex.com> wrote:
> On Wed, Jan 14, 2009 at 12:03:22PM +0000, Olly Betts wrote:
>> On Tue, Jan 13, 2009 at 12:18:33PM -0800, Kevin Duraj wrote:
>> > We have a bug when calling set_cutoff and
>> > set_sort_by_value_then_relevance functions. Some documents are
>> > displaying at the beginning and at end of result sets but are not
>> > displaying in middle of result set.
>>
>> I think this is likely the same issue as this bug:
>>
>> http://trac.xapian.org/ticket/216
>
> It occurred to me over the weekend that while this bug won't help, a
> percentage cutoff while sorting primarily by value just isn't going to
> work properly as things stand.
>
> The fundamental problem is that we might find a document with a higher
> relevance score thus increasing the lower bound on the weight which
> the percentage cutoff corresponds to, resulting in us dropping lower
(Continue reading)

Frank John Bruzzaniti | 3 Feb 2009 14:43
Picon

PowerPoint 2007 filter

Hi,

I'm trying to write the PowerPoint2007 filter in the same manner that I
did for *.docx and *.xlsx but I'm getting the following error when I tru
an index.

The document is called:  

Indexing "/Frisk in Power Point.pptx" as
application/vnd.openxmlformats-officedocument.presentationml.presentation ... caution:
filename not matched:  ppt/notesSlides/notesSlide*.xml
caution: filename not matched:  ppt/comments/comment*.xml

The problem is that not all pptx files contain notes and comments.

Do you think just including the slide text is enough, if not how can I
test to see if the files exists, it looks like unzip throws an error id
the file dosen;t exsist can I test this with a couple of if's (my c
isn;t very good was hoping someone could help me with the coding).

Here's what I have so far from omindex.cc it works for the main slides
you will see the other command commented out that also extracts notes
and comments from the powerpoint file.

// Start: PowerPoint 2007 .pptx
    } else if (startswith(mimetype,
"application/vnd.openxmlformats-officedocument.presentationml."))
    {
    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
    string safefile = shell_protect(file);
(Continue reading)

James Aylett | 3 Feb 2009 15:45

Re: PowerPoint 2007 filter

On Wed, Feb 04, 2009 at 12:13:27AM +1030, Frank John Bruzzaniti wrote:

> I'm trying to write the PowerPoint2007 filter in the same manner that I
> did for *.docx and *.xlsx but I'm getting the following error when I tru
> an index.
> 
> The document is called:  
> 
> Indexing "/Frisk in Power Point.pptx" as
> application/vnd.openxmlformats-officedocument.presentationml.presentation ... caution:
filename not matched:  ppt/notesSlides/notesSlide*.xml
> caution: filename not matched:  ppt/comments/comment*.xml
> 
> The problem is that not all pptx files contain notes and comments.
> 
> Do you think just including the slide text is enough, if not how can I
> test to see if the files exists, it looks like unzip throws an error id
> the file dosen;t exsist can I test this with a couple of if's (my c
> isn;t very good was hoping someone could help me with the coding).

One solution would be to do them individually, and check the return
value of unzip. Another would be to use something like:

$ unzip -lqq <pptx> ppt/notesSlides/notesSlide*.xml

which will give you one line per matched file; if there are none, you
can skip it. That's more fragile, though; I'd recommend checking the
return value instead.

You won't be able to use stdout_to_string() for that as it stands, but
(Continue reading)

Markus Wörle | 3 Feb 2009 17:56
Picon

problem on closing writable databases

Hi

(I am using xapian 1.0.10, with perl bindings.)

because of the issue, that xapian btrees thin out in the longrun, I  
decided to add support to my indexer for auto-compacting an index from  
time to time using the xapian-compact binary. It does so by

* flushing the open index
* undef the database-handle to do an inplicit close (there is no way  
to do an explicit close, right?)
* running "xapian-compact -n --no-renumber ./index ./index-compact   
2>&1 >/dev/null"
* moving ./index -> ./index-old
(* copying some arbitrary statistic files from ./index-old to ./index- 
compact, but this won't affect anything)
* moving ./index-compact -> ./index
* deleting ./index-old
* reopening the ./index by calling the  
Search::Xapian::WritableDatabase->new() constructor

Now my problem is, that the diskspace ./index-old consumes doesn't get  
freed. So I used lsof and found out that a "cat" process is holding  
open filehandles on the .DB files.

cat  2072  root  36u  REG  8,1  997842944  5529607 /var/lib/wtf/db/ 
profile-old/record.DB (deleted)
cat  2072  root  38u  REG  8,1  121257984  5529616 /var/lib/wtf/db/ 
profile-old/value.DB (deleted)
cat  2072  root  39u  REG  8,1  717463552  5529610 /var/lib/wtf/db/ 
(Continue reading)


Gmane