John Wang | 1 Apr 2009 04:22
Picon
Gravatar

Re: API to get index info

Excellent!

Thanks

-John

On Tue, Mar 31, 2009 at 2:25 PM, Yonik Seeley <yonik <at> lucidimagination.com>wrote:

> On Tue, Mar 31, 2009 at 4:55 PM, John Wang <john.wang <at> gmail.com> wrote:
> > Maybe I am missing something. I don't see any calls that would gimme the
> > number of segments. Are you suggesting:
> IndexCommit.getFileNames().size()?
>
> IndexReader.getSequentialSubReaders().length
>
> The stats page of Solr now displays the number of segments in the
> current index using this.
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-user-help <at> lucene.apache.org
>
>
John Wang | 1 Apr 2009 04:23
Picon
Gravatar

Re: IndexWriter.deleteDocuments(Query query)

So do you think it is a good addition/change to the current api now?

-John

On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <yonik <at> lucidimagination.com>wrote:

> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <john.wang <at> gmail.com> wrote:
> > I fail to see the difference of exposing the api to allow for a Query
> > instance to be passed in vs a DocIdSet.
>
> I was commenting specifically on your idea to allow deletion by int[]
> (docids) on the IndexWriter.
>
> DocIdSet is a different issue - it didn't exist when the conversation
> to add deleteByQuery was going on.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>  In this specific case, Query is
> > essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it
> > what DocIdSet is?
> > Thanks
> >
> > -John
> >
> > On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
> > <yonik <at> lucidimagination.com>wrote:
> >
(Continue reading)

Bon | 1 Apr 2009 05:32

What is the right query syntax for matching some field's substring?


Hi all,

    I've a question about the query syntax statement,
    There is a lucene text field and the value of the field like
,11,12,15,16,
    if I want to query some data and the value of the field has included
some number what I like(11 or 15),
    how can I do?
    I try to give a query like (filed_name:,11,) but it can not get the
matching.

    or I must reformat the field value with some other symbol not the symbol
comma ','

Bon
--

-- 
View this message in context: http://www.nabble.com/What-is-the-right-query-syntax-for-matching-some-field%27s-substring--tp22819197p22819197.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
nitin gopi | 1 Apr 2009 07:17
Picon

semantic vectors

hi all,
        I want to know everything about semantic vectors. I want to know how
does it indexes the documents such that the results produced are
semantically better than normal search. I also want to know how it is
different from semantic web, which uses the concept of ontologies and
metadata. It would be very helpful if somebody mail me all the study
material related to it?

Thanking You
Nitin
Michael McCandless | 1 Apr 2009 10:02

Re: IndexWriter.deleteDocuments(Query query)

John,

I think this has the same problem as exposing delete by docID, ie, how
would you produce that docIdSet?

We could consider delete by Filter instead, since that exposes the
necessary getDocIdSet(IndexReader) method.

Or, with near real-time search, we could enhance it to allow deletions
via the obtained reader (the first approach doesn't).

Mike

On Tue, Mar 31, 2009 at 10:23 PM, John Wang <john.wang <at> gmail.com> wrote:
> So do you think it is a good addition/change to the current api now?
>
> -John
>
> On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <yonik <at> lucidimagination.com>wrote:
>
>> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <john.wang <at> gmail.com> wrote:
>> > I fail to see the difference of exposing the api to allow for a Query
>> > instance to be passed in vs a DocIdSet.
>>
>> I was commenting specifically on your idea to allow deletion by int[]
>> (docids) on the IndexWriter.
>>
>> DocIdSet is a different issue - it didn't exist when the conversation
>> to add deleteByQuery was going on.
>>
(Continue reading)

Toke Eskildsen | 1 Apr 2009 12:04
Picon
Favicon

Re: Unable to improve performance

On Fri, 2009-03-27 at 12:07 +0100, Paul Taylor wrote:

[2Gb index, 7 million documents(?)]

> I ran the test a number of times with 30 threads, and max memory of 
> 3500mb I was processing 10,000 records in about 43 seconds ( 233 
> queries/second) , the index was stored on a solid state drive running on 
> a MacBook Pro (2.66 Ghz Intel Core 2 Duo, 4GB DDR). I dont really have a 
> view on whether this is a good result or not but I was keen to try a few 
> other things to see if I could improve performance further, but all my 
> efforts have had minimal effect.

You might want to try reducing the number of threads all the way down to
3 or 4 and queue pending searches instead, but I doubt it will change
much - as far as I know, the SSDs in MacBooks are quite okay with regard
to read-latency and with such a small index the system will probably
cache most of it anyway.

I can see elsewhere that you have upped the speed to 466 q/sec by
switching to NIOFSDirectory, so my guess is that you're now CPU and
memory speed bound.

You could try the freeware tool visualVM that profiles running Java
applications. It is extremely easy to use (just run it and select your
application from a list) and it will show you where the CPU-time is
used. Of course, if you're just using simple query analysis with Lucene
supplied Analyzers, there's probably not much you can do about it. On
the other hand, it might show you that you're spending a lot of time
generating queries or similar outside-Lucene-work.

(Continue reading)

Matthew Hall | 1 Apr 2009 14:56
Favicon

Re: What is the right query syntax for matching some field's substring?

Which analyzer are you using here?  Depending on your choice the comma 
separated values might be being kept together in your index, rather than 
tokenized as you expected.

Secondly, you should get Luke, and take a look into your index, this 
should give you a much better idea of what's going on in your index.

Anyhow, closely examine your analyzer choice, and then your query type 
choice and see if that's where the problem lies.

Matt

Bon wrote:
> Hi all,
>
>     I've a question about the query syntax statement,
>     There is a lucene text field and the value of the field like
> ,11,12,15,16,
>     if I want to query some data and the value of the field has included
> some number what I like(11 or 15),
>     how can I do?
>     I try to give a query like (filed_name:,11,) but it can not get the
> matching.
>
>     or I must reformat the field value with some other symbol not the symbol
> comma ','
>
> Bon
>   
(Continue reading)

John Wang | 1 Apr 2009 18:13
Picon
Gravatar

Re: IndexWriter.deleteDocuments(Query query)

Hi Michael:

    Let me first share what I am doing w.r.t deleting by docid:

I have a customized index reader that stores a mapping of docid -> uid in
the payload (something Michael Bush and Ning Li suggested a while back) And
that mapping is loaded a IndexReader load time and is shared by searchers.

I do realtime update, so I get a batch of updates with a uid associated with
each batch. So I do deleted on the uid and add the document. And I
implemented using IndexWriter.deleteDocuments(Term[])

I realized I have an IndexReader around already with a docId->uid mapping, I
can just find out the docid from that list and simply call
IndexReader.deleteDocument(int). So out of curiosity, I compare the times
doing deletes with these two mechanisms with 1 batch of 10000 deletes. And
on my macbook pro, I see a difference/overhead of 3-4 seconds (with various
runs and how much term table is cached etc.) And that is something I would
expect because we essentially doing a "query" per element in the batch,
albeit posting list length is only 1, but still...

Now to me that is significant enough to move away from
IndexWriter.deleteDocuments().

However, to actually implement the delete with IndexWriter on docids, I have
to create a customized Query object that iterates my int[] of docids. Which
is kinda silly, since IndexWriter calls delete on the docid anyway. I don't
want to open an IndexReader for the deletes (cuz that's where the api is at)
and then open another IndexWriter to add documents because:
1) seems like we are moving away from this paradigm with the delete apis on
(Continue reading)

Yonik Seeley | 1 Apr 2009 19:13

Re: IndexWriter.deleteDocuments(Query query)

On Wed, Apr 1, 2009 at 4:02 AM, Michael McCandless
<lucene <at> mikemccandless.com> wrote:
> I think this has the same problem as exposing delete by docID, ie, how
> would you produce that docIdSet?

Whoops, right.  I was going by memory that there was a
get(IndexReader) type method there... but that's on Filter of course.

-Yonik
http://www.lucidimagination.com
Michael McCandless | 1 Apr 2009 19:17

Re: IndexWriter.deleteDocuments(Query query)

> For me at lease, IndexWriter.deleteDocument(int) would be useful.

I completely agree: delete-by-docID in IndexWriter would be a great
feature.  Long ago I became convinced of that.

Where this feature always gets stuck (search the lists -- it's gotten
stuck alot) is how to implement it?  At any time, a merge can commit,
which invalidates all docIDs stored anywhere.  We need to solve that
before we can delete by docID.

I don't see a clean solution.  Do you?

> I have a customized index reader that stores a mapping of docid -> uid in
> the payload (something Michael Bush and Ning Li suggested a while back) And
> that mapping is loaded a IndexReader load time and is shared by searchers.

OK

> I do realtime update, so I get a batch of updates with a uid associated with
> each batch. So I do deleted on the uid and add the document. And I
> implemented using IndexWriter.deleteDocuments(Term[])

OK

> I realized I have an IndexReader around already with a docId->uid mapping, I
> can just find out the docid from that list and simply call
> IndexReader.deleteDocument(int). So out of curiosity, I compare the times
> doing deletes with these two mechanisms with 1 batch of 10000 deletes. And
> on my macbook pro, I see a difference/overhead of 3-4 seconds (with various
> runs and how much term table is cached etc.) And that is something I would
(Continue reading)


Gmane