Grant Ingersoll (JIRA | 1 Jul 04:00 2007
Picon

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff


    [
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509347
] 

Grant Ingersoll commented on LUCENE-848:
----------------------------------------

OK, I reran it and it went fine.  Not sure what happened, but maybe just a hiccup on my machine.

I am going to commit.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-848-build.patch, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt,
LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,
xerces.jar, xerces.jar, xml-apis.jar
>
>
> Add support for using Wikipedia for benchmarking.

(Continue reading)

Grant Ingersoll (JIRA | 1 Jul 04:05 2007
Picon

[jira] Updated: (LUCENE-868) Making Term Vectors more accessible


     [
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-868:
-----------------------------------

    Attachment: LUCENE-868-v1.patch

First pass at a patch on providing a callback mechanism for term vectors.  The big benefit to this is the
ability to control how to store the TV info, instead of being stuck with parallel arrays.  Two sample
implementations are provided, as FieldSortedTermVectorMapper and SortedTermVectorMapper.

One thing I am not completely sure on and would appreciate a review of, is the interface definition of
TermVectorMapper, in particular the interplay of the setExpectations method with the actual map method
(see the TermVectorsReader).  It is not thread-safe (although I am not so sure the current way is either)

The existing access methods for TVs still works just the same, even though the underlying implementation
is different.  

I don't know if this is any faster in the Lucene core, but it should speed up those applications that have to
reformat the TV info for their needs.

> Making Term Vectors more accessible
> -----------------------------------
>
>                 Key: LUCENE-868
>                 URL: https://issues.apache.org/jira/browse/LUCENE-868
>             Project: Lucene - Java
>          Issue Type: New Feature
(Continue reading)

Grant Ingersoll (JIRA | 1 Jul 04:23 2007
Picon

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff


    [
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509349
] 

Grant Ingersoll commented on LUCENE-848:
----------------------------------------

Committed.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-848-build.patch, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt,
LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,
xerces.jar, xerces.jar, xml-apis.jar
>
>
> Add support for using Wikipedia for benchmarking.

--

-- 
This message is automatically generated by JIRA.
(Continue reading)

Doron Cohen | 1 Jul 10:49 2007
Picon

Re: search quality - assessment & improvements

Thanks for your comments Chris, and sorry for the delayed
response - you raised some tough questions for me, and I
felt I have to clear my thoughts on this before replying.
(Well, as you'll see below they are not too clear now either,
but I am going to be off-line for the next ~10 days, so
decided to not wait more with this...)

Chris Hostetter wrote:

> for the record: SweetSpotSimilarity was designed to let you
> use the average length as the "sweet spot" (both min and max)
> but as you say: you have to know/guess the average length in
> advance (and explicitly set it using
> SweetSpotSimilarity.setLengthNormFactors(yourAvg,yourAvg,
> steepness))

I didn't try this - passing the computed avg doc length to
SweetSpotSimilarity (SSS) - it would be interesting to try. I wonder
how this would perform comparing to the variation of pivoted
(unique) length normalization that I tried. The difference is
that SSS punishes docs above and below the range, while with
pivoted normalization docs above the pivot are punished and
those below the pivot boosted. Pivoted normalization makes
more sense to me than SSS.

Assuming this proves to be useful in Lucene as a
general improvement (collection independent, largely) -
question is how to compute/store/retrieve this data.
The way I experimented with it was not focused on efficiency
but rather on flexibility at search time, my custom analyzer
(Continue reading)

Grant Ingersoll | 1 Jul 17:05 2007
Picon

Re: [jira] Created: (LUCENE-945) contrib/benchmark tests fail find data dirs

I think legally we are fine, since we aren't actually shipping it.  I  
just mean that people may not want to wait however long it takes to  
download it.  Of course, I don't know a work around other than to  
have some smaller set.
On Jun 30, 2007, at 10:10 AM, Doron Cohen wrote:

> Grant Ingersoll <gsingers <at> apache.org> wrote on 30/06/2007 05:20:34:
>
>> Does this imply it is going to download the test collection for
>> people when they don't have it when running tests?  I don't know if
>> that is something people are going to want to happen.
>
> Yes it does.. and so would auto-build-bots - I am also using
> the Reuters collection for TestQualityRun in LUCENE-836. Do
> you mean legal wise? I should probably add a "DOWNLOAD"
> warning here. Is there another issue with this?
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe <at> lucene.apache.org
> For additional commands, e-mail: java-dev-help <at> lucene.apache.org
>
Michael McCandless (JIRA | 2 Jul 02:41 2007
Picon

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff


    [
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509473
] 

Michael McCandless commented on LUCENE-848:
-------------------------------------------

I think somehow the wrong version (1.4.4) of Xerces was committed and
named as lib/xerces-2.9.0.jar.

I'm hitting what I think is the same issue Steven is referring to
above (https://issues.apache.org/jira/browse/LUCENE-848#action_1248988),
this exception after ~350K docs:

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Internal
Error: fPreviousChunk == NULL
	at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:184)
	at org.apache.lucene.benchmark.utils.ExtractWikipedia.main(ExtractWikipedia.java:199)
Caused by: java.lang.RuntimeException: Internal Error: fPreviousChunk == NULL
	at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1111)
	at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:181)
	... 1 more

I downloaded 1.4.4 of xerces.jar and "cmp" says it's the same file
that's now checked in as lib/xerces-2.9.0.jar.  When I download xerces
2.9.0 myself and overwrite this one in lib, the extraction finishes
without errors.

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
(Continue reading)

Grant Ingersoll (JIRA | 2 Jul 03:00 2007
Picon

[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff


    [
https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509474
] 

Grant Ingersoll commented on LUCENE-848:
----------------------------------------

OK, go ahead and commit it.

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ

> Add supported for Wikipedia English as a corpus in the benchmarker stuff
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-848
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Steven Parkes
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-848-build.patch, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt,
LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,
(Continue reading)

Michael McCandless | 2 Jul 15:35 2007

[VOTE] Commit LUCENE-843 (IndexWriter performance gains)

Hi,

I'd like to commit LUCENE-843.

The patch has gone through a number of iterations but the final
version that's there now (take9) is quite a bit cleaner & simpler than
the ones leading up to it and I believe ready.

It provides solid indexing performance gains (between 2X-8X), but, it
is somewhat more complex than the current "single doc per segment"
approach and it does introduce a change to the index format (only when
autoCommit=false) whereby multiple segments can share a single set of
term vector & stored fields files.

Given that it's such a big change I think (?) it's appropriate to ask
for a vote (only PMC member votes are binding) to make sure we have
consensus that this is net/net a good change for Lucene.

Mike
Michael McCandless (JIRA | 2 Jul 15:42 2007
Picon

[jira] Created: (LUCENE-947) Some improvements to contrib/benchmark

Some improvements to contrib/benchmark
--------------------------------------

                 Key: LUCENE-947
                 URL: https://issues.apache.org/jira/browse/LUCENE-947
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/benchmark
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor

I've made some small improvements to the contrib/benchmark, mostly
merging in the ad-hoc benchmarking code I've been using in LUCENE-843:

  - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat

  - Print the props in sorted order

  - Added new config "autocommit=true|false" to CreateIndexTask

  - Added new config "ram.flush.mb=int" to AddDocTask

  - Added new configs "doc.term.vector.positions=true|false" and
    "doc.term.vector.offsets=true|false" to BasicDocMaker

  - Added WriteLineDocTask.java, so you can make an alg that uses this
    to build up a single file containing one document per line in a
    single file.  EG this alg converts the reuters-out tree into a
    single file that has ~1000 bytes per body field, saved to
(Continue reading)

Michael McCandless (JIRA | 2 Jul 15:44 2007
Picon

[jira] Updated: (LUCENE-947) Some improvements to contrib/benchmark


     [
https://issues.apache.org/jira/browse/LUCENE-947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-947:
--------------------------------------

    Attachment: LUCENE-947.patch

First cut patch.

> Some improvements to contrib/benchmark
> --------------------------------------
>
>                 Key: LUCENE-947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-947
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-947.patch
>
>
> I've made some small improvements to the contrib/benchmark, mostly
> merging in the ad-hoc benchmarking code I've been using in LUCENE-843:
>   - Fixed thread safety of DirDocMaker's usage of SimpleDateFormat
>   - Print the props in sorted order
>   - Added new config "autocommit=true|false" to CreateIndexTask
(Continue reading)


Gmane