Paul Hill | 2 Apr 18:49 2012

RE: Lucene 4 - POS and Syntactic Tagging

> Mark McGuire wrote:
> I'm working on a project where I need to tag both the part of speech and other syntactic information on tokens

To pick up on this thread from a few weeks back.

I've never done this myself, but I think that your desire to put extra information that is not really a token
in the index at a particular location is exactly what Payloads are for.
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

The above article even mentions:
"A payload can be used to store weights for specific terms or things like part of speech tags or other
semantic information. "

I don't believe that searching on attributes is the way to speak about it.  Attributes are features of some of
Lucene objects, a way to ask for something from a complex object.  Some attributes return information from
the index, but attributes are not in indexes, tokens and payloads are in indexes.  But I'm sure my
understanding is incomplete also, because using something other than "WORD" seems like a way to go, but I
can't see how to get a query to search on a particular type of token.

-Paul
Andrzej Bialecki | 2 Apr 19:02 2012

Re: delete entries from posting list Lucene 4.0

On 29/03/2012 11:14, Andrzej Bialecki wrote:

> The problem in our implementation is that we use a within-document term
> frequency (the number of occurrences of t in the current document) and
> not a collection-wide term frequency... so, it looks to me that the fix
> would be to first fully traverse the doc enumeration and calculate the
> total number of term occurrences in all documents (e.g. in
> RIDFTermPruningPolicy.initPositionsTerm(..) ), and use this value in the
> formula in place of termPositions.freq().
>

This is the fix that I implemented, it's now committed to branch_3x and 
will be included in release 3.6.

--

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
Luis Paiva | 2 Apr 19:28 2012
Picon
Picon

RE: TVD, TVX and TVF files

Thank you for your help. 
I still haven't found a solution yet. I'm copying all my code below. 

BTW, I'm working with lucene version 3.5.0

 <at> Mike: Yes i do close it :) I have some files created, that are: .fdt, .fdx,
.fnm, .frq, .nrm, .prx, .tii, .tis.

Don't know why the files T* are not created. 

 <at> Uwe: I think I'm not getting any compound files. Only those above. 

Anyone has the same issue? 

CODE --------------------------- xx -------------------------------

package lucene;

import java.io.*;
import java.util.ArrayList;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * This terminal application creates an Apache Lucene index in a folder and
(Continue reading)

Michael McCandless | 2 Apr 21:48 2012

Re: TVD, TVX and TVF files

As far as I can see, you are not indexing term vectors in the code
below?  Your Fields don't have TermVector.*...

Can you boil this down to a small test case showing the missing term
vector files...?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 2, 2012 at 1:28 PM, Luis Paiva <luismpaiva <at> mail.telepac.pt> wrote:
> Thank you for your help.
> I still haven't found a solution yet. I'm copying all my code below.
>
> BTW, I'm working with lucene version 3.5.0
>
>  <at> Mike: Yes i do close it :) I have some files created, that are: .fdt, .fdx,
> .fnm, .frq, .nrm, .prx, .tii, .tis.
>
> Don't know why the files T* are not created.
>
>  <at> Uwe: I think I'm not getting any compound files. Only those above.
>
> Anyone has the same issue?
>
>
>
> CODE --------------------------- xx -------------------------------
>
>
(Continue reading)

Luis Paiva | 2 Apr 22:01 2012
Picon
Picon

RE: TVD, TVX and TVF files

Sorry Mike, 

I pasted the old code. I've already included something like this to index
with TermVector: 

        String xpto = fr.toString();
        doc.add(new Field("contents2", xpto, 
                Field.Store.YES,
                Field.Index.ANALYZED,
                Field.TermVector.YES));

Probably my approach to my problem isn't the correct, so I explain better
what do i want. 

My idea is to have some files, like txt ones, and to get their each
TermVector for each file. I don't know if this can be done by simple
indexing the files. 

Thanks Mike. :)
Luis Paiva

-----Mensagem original-----
De: Michael McCandless [mailto:lucene <at> mikemccandless.com] 
Enviada: segunda-feira, 2 de Abril de 2012 20:49
Para: java-user <at> lucene.apache.org
Assunto: Re: TVD, TVX and TVF files

As far as I can see, you are not indexing term vectors in the code
below?  Your Fields don't have TermVector.*...

(Continue reading)

Benson Margulies | 2 Apr 23:28 2012
Picon

Repeatability of results

We've observed something that, in some ways, is not surprising.

If you take a set of documents that are close in 'score' to some query,

 and shuffle them in different orders

 and then see what results you get in what order from the reference query,

the scores will vary according to the insertion order.

I can't see any way to argue that it's wrong, but we find it
inconvenient when we are testing something and we want to multithread
the test to speed it up, thus making the insertion order
nondeterministic.

It occurred to me that perhaps you all have some similar concerns in
testing lucene itself, and might have some advice about how to get
around it, thus this email.

We currently observe this with 2.9.1 and 3.5.0.
Michael McCandless | 2 Apr 23:33 2012

Re: Repeatability of results

Hmm that's odd.

If the scores were identical I'd expect different sort order, since we
tie-break by internal docID.

But if the scores are different... the insertion order shouldn't
matter.  And, the score should not change as a function of insertion
order...

Do you have a small test case?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies <bimargulies <at> gmail.com> wrote:
> We've observed something that, in some ways, is not surprising.
>
> If you take a set of documents that are close in 'score' to some query,
>
>  and shuffle them in different orders
>
>  and then see what results you get in what order from the reference query,
>
> the scores will vary according to the insertion order.
>
> I can't see any way to argue that it's wrong, but we find it
> inconvenient when we are testing something and we want to multithread
> the test to speed it up, thus making the insertion order
> nondeterministic.
(Continue reading)

Benson Margulies | 2 Apr 23:41 2012
Picon

Re: Repeatability of results

On Mon, Apr 2, 2012 at 5:33 PM, Michael McCandless
<lucene <at> mikemccandless.com> wrote:
> Hmm that's odd.
>
> If the scores were identical I'd expect different sort order, since we
> tie-break by internal docID.
>
> But if the scores are different... the insertion order shouldn't
> matter.  And, the score should not change as a function of insertion
> order...

Well, I assumed that TF-IDF would wiggle.

>
> Do you have a small test case?

SInce this surprises you, I will build a test case.

>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Mon, Apr 2, 2012 at 5:28 PM, Benson Margulies <bimargulies <at> gmail.com> wrote:
>> We've observed something that, in some ways, is not surprising.
>>
>> If you take a set of documents that are close in 'score' to some query,
>>
>>  and shuffle them in different orders
>>
(Continue reading)

kiwi clive | 4 Apr 19:37 2012
Picon

Re: createJoinQuery use

Hi Martijn,

Thank you for responding so quickly. I found my reply was stuck in my drafts folder so apologies for not
getting back sooner.  I'm trying to search and join across two disparate document types that have some
common features. Your response confirms what I was looking into.

My understanding about the join query is that it does a 'search' for the from field and then uses those
results as part of the second or 'real' query. The first of these 'queries' I believe is an internal
mechanism that does not return docs, but smaller more efficient objects.

So if I wanted to search across two different types of document in one index with some fields on one doc type
and some on the other, I effectively need to perform 4 queries.

This is kind of where I was coming from.

Thanks,
Clive

________________________________
 From: Martijn v Groningen <martijn.v.groningen <at> gmail.com>
To: java-user <at> lucene.apache.org 
Sent: Thursday, March 29, 2012 3:23 PM
Subject: Re: createJoinQuery use

It is only possible to join on one side at the same time.
You mean something like this:
Query fromQuery = JoinUtil.createJoinQuery("from", "to", actualQuery,
indexSearcher);
Query toQuery = JoinUtil.createJoinQuery("to", "from", actualQuery,
indexSearcher);
(Continue reading)

Paul Hill | 4 Apr 22:06 2012

Hit Highlighting which highlighter to use?

Using the original org.apache.lucene.search.highlight.Highlighter should I be able to give it a query
like [ My AND Words AND "My Words"^100 ] (the actually phrase in this query is converted to a span query with a
slop 1),
and expect it find the fragment many pages into the file that has span "My Words" and rank it better than
fragments earlier in the document with "My" and "Word" (or lots of "My" and "Words")?

I  ask because currently, I'm not getting the fragment with the phrase as the best fragment, and I go through
some hacky post processing to look down the list for a "better" match, but I'm wondering if we have the
HitHighlighter wired up wrong.

At this time, my index does not have offsets and positions vectors for all tokenized fields and the body
"text" field just how positions.

I understand that FastVectorHighlighter is fast, but would it do a better job of finding the phrase or span
in the text if I added positions and offsets to text?

When highlighting the small fields like title, path etc.  should I add term vector with positions and offset
and use FastVectorHighlighter or is it just not worth storing that extra information just for highlighting?

-Paul


Gmane