1 Sep 2008 09:25
Newbie question: using Lucene to index hierarchical information.
Leonid Maslov <leonidms <at> gmail.com>
2008-09-01 07:25:36 GMT
2008-09-01 07:25:36 GMT
Hi all,
First of all, sorry for my poor English. It's not my native language.
I'm trying to use Lucene to index hierarchical kind of information: I have
structured html and pdf/word documents and I want to index them in ways to
perform search in titles, text, paragraphs or tables only, or any
combinations of items mentioned above. At the moment I see 3 possible
solutions:
- Create the set of all possible fields, like: contents, title, heading,
table etc... And index the data in all them accordingly. Possible impacts:
- a big count of fields
- data duplication (because I need to make search looking in the
paragraphs to look inside all the inner elements, so every outer element
indexed will contain all the inner element content as well)
- Create the hierarchy of the fields, like "title", "paragraph/title",
"paragraph/title/subparagraph/table". Possible impacts:
- count of fields remains the same
- soft set of fields (not consistent)
- I'm not sure about the ways I could process required information and
perform search.
- Performance issues?
- Use one field for content and just add location prefix to content.
For example "contents:*paragraph/heading:*token1 token2". *
paragraph/heading:* here is used as additional information prefix. So, I
(possibly?) could reuse PrefixQuery functionality or smth. Impacts:
- Strong set of index fields (small)
- Additional information processing - all the queries I'll use will
have to work as PrefixQuery
(Continue reading)
RSS Feed