Leonid Maslov | 1 Sep 2008 09:25
Picon
Gravatar

Newbie question: using Lucene to index hierarchical information.

Hi all,

First of all, sorry for my poor English. It's not my native language.

I'm trying to use Lucene to index hierarchical kind of information: I have
structured html and pdf/word documents and I want to index them in ways to
perform search in titles, text, paragraphs or tables only, or any
combinations of items mentioned above. At the moment I see 3 possible
solutions:

   - Create the set of all possible fields, like: contents, title, heading,
   table etc... And index the data in all them accordingly. Possible impacts:
   - a big count of fields
      - data duplication (because I need to make search looking in the
      paragraphs to look inside all the inner elements, so every outer element
      indexed will contain all the inner element content as well)
   - Create the hierarchy of the fields, like "title", "paragraph/title",
   "paragraph/title/subparagraph/table". Possible impacts:
      - count of fields remains the same
      - soft set of fields (not consistent)
      - I'm not sure about the ways I could process required information and
      perform search.
      - Performance issues?
      - Use one field for content and just add location prefix to content.
   For example "contents:*paragraph/heading:*token1 token2". *
   paragraph/heading:* here is used as additional information prefix. So, I
   (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
      - Strong set of index fields (small)
      - Additional information processing - all the queries I'll use will
      have to work as PrefixQuery
(Continue reading)

tom | 1 Sep 2008 09:26
Picon

Re: Newbie question: using Lucene to index hierarchical information.

AUTOMATIC REPLY

Tom Roberts is out of the office till 2nd September 2008.

LUX reopens on 1st September 2008
tom | 1 Sep 2008 09:26
Picon

Re: Re: Newbie question: using Lucene to index hierarchical information.

AUTOMATIC REPLY

Tom Roberts is out of the office till 2nd September 2008.

LUX reopens on 1st September 2008
Karsten F. | 1 Sep 2008 10:29
Picon
Favicon

Re: Index types


Hi John,

I am not sure about the way Solr implements range query.
But it looks like, that Solr is using 
org.apache.lucene.search.ConstantScoreRangeQuery
which itself is using
org.apache.lucene.search.RangeFilter

So Solr do not rewrite the query to a large Boolean SHOULD, but it is
reading all terms for the "range"-Field.
This could be slow. 
If the RangeFilter is used many time a CachingWrapperFilter would speed up
the search.
You also could (try to) read the terms from memory instead of hard disk.
Because a timestamp is only a long = 64bit, a appropriate solution would
also to store the timestamps in main memory for a fast RangeFilter
implementation.

Best regards
   Karsten

John Patterson wrote:
> 
> 
> Does Solr's range impementation use the large Boolean SHOULD queries?
> 

--

-- 
View this message in context: http://www.nabble.com/Index-types-tp19177298p19250290.html
(Continue reading)

Akshay | 1 Sep 2008 12:06
Picon

getTimestamp method in IndexCommit

Hi,

We need a feature for time based cleanup of IndexCommits. Would it be
possible to add a method to IndexCommit class to get the timestamp of an
index commit?

Thanks.

--

-- 
Regards,
Akshay Ukey.
gaz77 | 1 Sep 2008 13:48

Re: Confused with NGRAM results


Hi Otis,

The original message text is:

> Hi,
> 
> I'd appreciate if someone could explain the results I'm getting.
> 
> I've written a simple custom analyzer that applies the NGramTokenFilter to
> the token stream during indexing. It's never applied during searching. The
> purpose of this is to match sub-words.
> 
> Without the ngram filter, if I searched on the word 'postcode' it returns
> 2 documents. If I searched on 'code' it returns 6 documents (with no
> overlap on the postcode results).
> 
> If I apply the ngram filter with min 1 and max 10, searching 'postcode'
> returns the same 2 docs, while searching 'code' returns 9 docs. This
> sort-of feels right.
> 
> The problem comes when I set the min ngram size to 3, and the max to 5.
> Searching 'postcode' returns no results (as expected), but searching
> 'code' only returns 2 docs (the 2 normally returned by a 'postcode'
> search).
> 
> This last result for 'code' just doesn't seem correct - it should be
> returning at least the 6 docs from the original search.
> 
> I'd really appreciate some advice on what is going on with the ngram
(Continue reading)

davood | 1 Sep 2008 15:16
Picon

Re: MoreLikeThis return no results


Hi,

I enabled the TermVector for  required fields using following piece of code,
Field titleField = new Field("title", title, Field.Store.NO,
Field.Index.TOKENIZED, TermVector.YES);
and then re-indexed it. But again it shows no result. 
I checked the stored documents and they TermVector exists and si correct but
morelikethis return no result for a given document id.

What am I missing? 

mark harwood wrote:
> 
> MoreLikeThis needs to find the terms in your doc. It tries to do this by
> using TermFreqVectors which are stored in the index if you choose to add
> them at  index-time. If you haven't done this then it will fall back to
> reanalysing the content of the document usings an analyser (despite what
> the javadocs for the setAnalyzer method  say about not needing to set an
> analyzer when MoreLiking an existing document).
> 
> So your options are probably to re-index with term vectors turned on or
> set an appropriate choice of analyzer.
> 
> Cheers,
> Mark
> (only 3 days to go until Tom Roberts is back in the office! ) 
> 

--

-- 
(Continue reading)

Leonid Maslov | 1 Sep 2008 15:55
Picon
Gravatar

Re: Newbie question: using Lucene to index hierarchical information.

Any comments, suggestions? Maybe I should rephrase my original message or
describe it in detail?
I really would like to get any response if possible.

Thanks a lot in advance!

On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <leonidms <at> gmail.com> wrote:

> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>    - Create the set of all possible fields, like: contents, title,
>    heading, table etc... And index the data in all them accordingly. Possible
>    impacts:
>    - a big count of fields
>       - data duplication (because I need to make search looking in the
>       paragraphs to look inside all the inner elements, so every outer element
>       indexed will contain all the inner element content as well)
>    - Create the hierarchy of the fields, like "title", "paragraph/title",
>    "paragraph/title/subparagraph/table". Possible impacts:
>       - count of fields remains the same
>       - soft set of fields (not consistent)
>       - I'm not sure about the ways I could process required information
(Continue reading)

Cam Bazz | 1 Sep 2008 16:06
Picon

lucene based tagging structure

Hello,

Recently I developed an interest in making a lucene based structure for
tagging. As we all know lucene's update is not real-time and one has to
delete a document prior to updating it.

I have been googling for different approaches to a lucene based tagging
structure, and I stumbled upon

http://blogs.sun.com/basler/entry/lucene_search_engine_web_crawlers

Quote:

Updating Indexes in Lucene for Tagging:
One thing to keep in mind is that Lucene doesn't allow an index to be
updated, the specific index has to be deleted then re-created.  When adding
a new tag to an item or updating a document index you have to be able to
access all the data that was originally in the index before re-creating it.
This sounds straightforward but there is on caveat.  If you index items
using an approach that doesn't allow retrieval of all the data in the index,
you will have to read the data from a persistent store so the index can be
completely re-created.  You can get in this state when you create a
org.apache.lucene.document.Field for the documents index utilizing the
"UnStore" method or "Text" method with a Reader.  When using these methods,
the data can't be retrieve via the exposed APIs.  This really isn't a big
deal once you factor it in to your approach. Our tagging requirement came
after the initial implementation was completed and it caused some problems
that made us have to re-think our index scheme.

Any ideas on the approach or different approaches.
(Continue reading)

Marcelo Ochoa | 1 Sep 2008 16:43
Picon

Re: MoreLikeThis return no results

Hi Dave:
  MoreLikeThis object has two parameters which controls his functionality:
        mlt.setMinTermFreq(minTermFreq.intValue());
        mlt.setMinDocFreq(minDocFreq.intValue());
  By default MinTermFreq is 2, so if your document has no terms with
freq greater than 2 will return a query with no terms which returns 0
hits.
  Try setting it to 1.
  Best regards, Marcelo.
On Mon, Sep 1, 2008 at 10:16 AM, davood <dave.fet <at> gmail.com> wrote:
>
> Hi,
>
> I enabled the TermVector for  required fields using following piece of code,
> Field titleField = new Field("title", title, Field.Store.NO,
> Field.Index.TOKENIZED, TermVector.YES);
> and then re-indexed it. But again it shows no result.
> I checked the stored documents and they TermVector exists and si correct but
> morelikethis return no result for a given document id.
>
> What am I missing?
>
>
> mark harwood wrote:
>>
>> MoreLikeThis needs to find the terms in your doc. It tries to do this by
>> using TermFreqVectors which are stored in the index if you choose to add
>> them at  index-time. If you haven't done this then it will fall back to
>> reanalysing the content of the document usings an analyser (despite what
>> the javadocs for the setAnalyzer method  say about not needing to set an
(Continue reading)


Gmane