Olly Betts | 5 Jun 00:47
Favicon
Gravatar

Re: [Xapian-commits] 10683: trunk/xapian-core/ trunk/xapian-core/common/ trunk/xapian-core/queryparser/ trunk/xapian-core/tests/

On Wed, Jun 04, 2008 at 01:48:37PM +0100, richard wrote:
> queryparser/queryparser.lemony: Fix various cases where queries
> were constructed pair-wise within a loop, which leads to O(N*N)
> scaling behaviour (because each intermediate query construction
> is O(M) where M is the size of that query, and there are N of
> them).

Um, the real bug is in query construction then!  Pairwise construction
shouldn't depend on the size of the subqueries (it used to, but I
thought that had already been fixed).

That aside, it may be more efficient to build up subqueries in an array
and construct the query object in one (as your patch does), particularly
in the case where we're using new to create the query object.  But we
really should fix this in Query, not just work around the issue in
QueryParser.

Cheers,
    Olly
Richard Boulton | 5 Jun 08:21

Re: [Xapian-commits] 10683: trunk/xapian-core/ trunk/xapian-core/common/ trunk/xapian-core/queryparser/ trunk/xapian-core/tests/

Olly Betts wrote:
> On Wed, Jun 04, 2008 at 01:48:37PM +0100, richard wrote:
>> queryparser/queryparser.lemony: Fix various cases where queries
>> were constructed pair-wise within a loop, which leads to O(N*N)
>> scaling behaviour (because each intermediate query construction
>> is O(M) where M is the size of that query, and there are N of
>> them).
> 
> Um, the real bug is in query construction then!  Pairwise construction
> shouldn't depend on the size of the subqueries (it used to, but I
> thought that had already been fixed).

I haven't looked at the code in any detail, but I think the problem is 
that each pairwise construction causes a full copy of the subquery 
(perhaps due to the simplification part of the construction).  Perhaps 
the end_construction() part needs to be made lazy to avoid calling this 
step repeatedly.

> That aside, it may be more efficient to build up subqueries in an array
> and construct the query object in one (as your patch does), particularly
> in the case where we're using new to create the query object.  But we
> really should fix this in Query, not just work around the issue in
> QueryParser.

The test for checking the behaviour of QueryParser should be applicable 
to Query with a few modifications, so we at least have an easy way to 
check the performance now. :)

--

-- 
Richard
(Continue reading)

Olly Betts | 5 Jun 08:47
Favicon
Gravatar

Re: [Xapian-commits] 10683: trunk/xapian-core/ trunk/xapian-core/common/ trunk/xapian-core/queryparser/ trunk/xapian-core/tests/

On Thu, Jun 05, 2008 at 07:21:27AM +0100, Richard Boulton wrote:
> Olly Betts wrote:
> > Um, the real bug is in query construction then!  Pairwise construction
> > shouldn't depend on the size of the subqueries (it used to, but I
> > thought that had already been fixed).
> 
> I haven't looked at the code in any detail, but I think the problem is 
> that each pairwise construction causes a full copy of the subquery 
> (perhaps due to the simplification part of the construction).  Perhaps 
> the end_construction() part needs to be made lazy to avoid calling this 
> step repeatedly.

It occurred to me that pair-wise construction will be O(n^2) when debug
logging even once this issue is addressed (because
Query::get_description() is called to show the parameters when
constructing and this method iterates the subquery tree).  Not sure
what to do about that...

Cheers,
    Olly
Olly Betts | 5 Jun 08:55
Favicon
Gravatar

Re: [Xapian-commits] 10685: branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/

On Thu, Jun 05, 2008 at 07:43:39AM +0100, richard wrote:
> SVN root:       svn://svn.xapian.org/xapian
> Changes by:     richard
> Revision:       10685
> Date:           2008-06-05 07:43:39 +0100 (Thu, 05 Jun 2008)
> 
> Log message (2 lines):
> include.c: on Win32 build \ directory seperator is no handled
> correctly. By Adam Tegen
> 
> Modified files:
> U   branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/ChangeLog
> U   branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/main.c

Um, this commit message doesn't even slightly match the change committed
(completely different file; change committed is backporting a #pragma
to kill stupid MSVC warnings).

Cheers,
    Olly
Richard Boulton | 5 Jun 09:00

Re: [Xapian-commits] 10685: branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/

Olly Betts wrote:
> On Thu, Jun 05, 2008 at 07:43:39AM +0100, richard wrote:
>> SVN root:       svn://svn.xapian.org/xapian
>> Changes by:     richard
>> Revision:       10685
>> Date:           2008-06-05 07:43:39 +0100 (Thu, 05 Jun 2008)
>>
>> Log message (2 lines):
>> include.c: on Win32 build \ directory seperator is no handled
>> correctly. By Adam Tegen
>>
>> Modified files:
>> U   branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/ChangeLog
>> U   branches/1.0/xapian-maintainer-tools/win32msvc/makedepend/main.c
> 
> Um, this commit message doesn't even slightly match the change committed
> (completely different file; change committed is backporting a #pragma
> to kill stupid MSVC warnings).

Yes, I know - I used svn-ci, but the Changelog was in the wrong order! 
I've fixed the ChangeLog so this shouldn't happen again.

--

-- 
Richard
Michal Fapso | 5 Jun 16:33
Picon
Favicon

indexing and searching of timed events

Hello,

I am working on an indexing/search engine for speech and I would like to
try to use Xapian for that. I have an idea how to do it in Xapian, but I
am not sure, if it is correct since I have just quickly looked at the
Xapian code.

Tokens I need to index:
Each speech audio record, processed by a speech recognizer is converted
to an oriented graph of hypotheses. Each hypothesis contains the
recognized word, start time, end time and confidence score. These
hypotheses are overlapped in time, so there is generally a bunch of
hypotheses in each point of time. 

A simple graph of hypotheses (output of speech recognizer):
http://www.research.ibm.com/journal/sj/404/brown1.gif

So I suppose that the main thing I need to change in Xapian code is the
termpos type (in types.h), which is just an unsigned integer. For speech
indexing I need to change it to a struct containing start time, end time
and score of recognized words.

Then to be able to search for phrases correctly, I have to change the
code in ./matcher/phrasepostlist.cc to take start and end time into
account.

Please, correct me if I am wrong or if I missed something. I am really
new to Xapian, so I will be grateful for any hint on this problem
(tutorial, code snippet, doxygen page, ...).

(Continue reading)

Richard Boulton | 6 Jun 01:12

Re: indexing and searching of timed events

Michal Fapso wrote:
> I am working on an indexing/search engine for speech and I would like to
> try to use Xapian for that. I have an idea how to do it in Xapian, but I
> am not sure, if it is correct since I have just quickly looked at the
> Xapian code.

Are you aiming for the textual queries and spoken documents, or spoken 
queries and spoken documents?  (Spoken queries and textual documents is 
also probably an interesting case, but from the next paragraph it sounds 
like your documents are definitely spoken.)

> Tokens I need to index:
> Each speech audio record, processed by a speech recognizer is converted
> to an oriented graph of hypotheses. Each hypothesis contains the
> recognized word, start time, end time and confidence score. These
> hypotheses are overlapped in time, so there is generally a bunch of
> hypotheses in each point of time. 
> 
> A simple graph of hypotheses (output of speech recognizer):
> http://www.research.ibm.com/journal/sj/404/brown1.gif

Out of interest, what speech recogniser are you using?  If it's an open 
source one, others here might be interested in collaborating.  Sphinx-4 
is the only open source recogniser I know of, but there may be others 
(I've not looked for them).

> So I suppose that the main thing I need to change in Xapian code is the
> termpos type (in types.h), which is just an unsigned integer. For speech
> indexing I need to change it to a struct containing start time, end time
> and score of recognized words.
(Continue reading)

Michal Fapso | 6 Jun 10:38
Picon
Favicon

Re: indexing and searching of timed events

Hi Richard,

thank you for your quick reply. I am aiming for textual queries and
spoken documents. 

The recognizer:
---------------
In our speech processing group, we have our own recognizer
(http://speech.fit.vutbr.cz/en/software/hmm-toolkit-stk-speech-fit). It
is released under GPL. It can produce recognition lattices in HTK format
(http://labrosa.ee.columbia.edu/doc/HTKBook21/node262.html), which is
quite a standard. So the indexer should work with the output lattices
and will not be hard-wired to any particular recognizer. Using it with
any other recognizer would be just a matter of writing another lattice
file parser.

Phrase searching:
-----------------
If I had a lattice stored with the indexed document, I could check the
phrase by traversing the lattice structure which would be quite
inefficient, since the lattices can be really huge. However, for
generating a context of search results, it would be helpful. I can
imagine another way. If I store term's start times in
OmDocumentTerm::term_positions vector and create another vector storing
the end times in the same order, OmDocumentTerm::term_positions_end, it
will be quite efficient to check the phrases. The terms will be inserted
in time-increasing order for the log complexity of search. Do you think,
that it would be easier to use the PostingSource for storing the end
times than adding another vector to OmDocumentTerm?

(Continue reading)


Gmane