Picon

[jira] [Commented] (LUCENE-3930) nuke jars from source tree and use ivy


    [
https://issues.apache.org/jira/browse/LUCENE-3930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243598#comment-13243598
] 

Jan Høydahl commented on LUCENE-3930:
-------------------------------------

We have a 7Mb jar which is included in the binary distro twice. Any way to get rid of one?
{code}
./contrib/analysis-extras/lib/icu4j-4.8.1.1.jar
./contrib/extraction/lib/icu4j-4.8.1.1.jar
{code}

Also, from what I can see, {{solr/contrib/extraction/lib/xml-apis-1.0.b2.jar}} dependency is
redundant - tests pass without it
See https://issues.apache.org/jira/browse/TIKA-412 and https://issues.apache.org/jira/browse/LUCENE-2961

> nuke jars from source tree and use ivy
> --------------------------------------
>
>                 Key: LUCENE-3930
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3930
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: general/build
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Blocker
>             Fix For: 3.6
(Continue reading)

Picon

[jira] [Updated] (SOLR-3254) Upgrade Solr to Tika 1.1


     [
https://issues.apache.org/jira/browse/SOLR-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated SOLR-3254:
------------------------------

    Attachment: SOLR-3254.patch

With Ivy it's really easy to do the Tika upgrade, and the patch becomes an appliable plaintext patch!

This patch also adds some comments to the dependencies section with instructions for upgrading, and
rearranges the deps to match the order listed in http://tika.apache.org/1.1/gettingstarted.html#Using_Tika_as_a_Maven_dependency

It also removes a non-used xml-apis dep

> Upgrade Solr to Tika 1.1
> ------------------------
>
>                 Key: SOLR-3254
>                 URL: https://issues.apache.org/jira/browse/SOLR-3254
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - LangId, contrib - Solr Cell (Tika extraction)
>            Reporter: Jan Høydahl
>             Fix For: 4.0
>
>         Attachments: SOLR-3254.patch
>
>
(Continue reading)

Picon

[jira] [Commented] (SOLR-3254) Upgrade Solr to Tika 1.1


    [
https://issues.apache.org/jira/browse/SOLR-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243603#comment-13243603
] 

Jan Høydahl commented on SOLR-3254:
-----------------------------------

Here's the major news in v1.1: http://tika.apache.org/1.1/

I have not tried to exclude any parsers at all - such optimization is left for another issue...

> Upgrade Solr to Tika 1.1
> ------------------------
>
>                 Key: SOLR-3254
>                 URL: https://issues.apache.org/jira/browse/SOLR-3254
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - LangId, contrib - Solr Cell (Tika extraction)
>            Reporter: Jan Høydahl
>             Fix For: 4.0
>
>         Attachments: SOLR-3254.patch
>
>
> Tika 1.1 is being released soon. It features some new parsers, ability to extract text from password
protected PDFs and office docs, and several bug fixes. See http://people.apache.org/~mattmann/apache-tika-1.1/rc1/CHANGES-1.1.txt
> We should upgrade as soon as it is released.

(Continue reading)

Picon

[jira] [Assigned] (SOLR-3254) Upgrade Solr to Tika 1.1


     [
https://issues.apache.org/jira/browse/SOLR-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl reassigned SOLR-3254:
---------------------------------

    Assignee: Jan Høydahl

> Upgrade Solr to Tika 1.1
> ------------------------
>
>                 Key: SOLR-3254
>                 URL: https://issues.apache.org/jira/browse/SOLR-3254
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - LangId, contrib - Solr Cell (Tika extraction)
>            Reporter: Jan Høydahl
>            Assignee: Jan Høydahl
>             Fix For: 4.0
>
>         Attachments: SOLR-3254.patch
>
>
> Tika 1.1 is being released soon. It features some new parsers, ability to extract text from password
protected PDFs and office docs, and several bug fixes. See http://people.apache.org/~mattmann/apache-tika-1.1/rc1/CHANGES-1.1.txt
> We should upgrade as soon as it is released.

--
This message is automatically generated by JIRA.
(Continue reading)

Picon

[jira] [Updated] (SOLR-1929) Index encrypted pdf files


     [
https://issues.apache.org/jira/browse/SOLR-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated SOLR-1929:
------------------------------

    Fix Version/s: 4.0
         Assignee: Jan Høydahl

> Index encrypted pdf files
> -------------------------
>
>                 Key: SOLR-1929
>                 URL: https://issues.apache.org/jira/browse/SOLR-1929
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Yiannis Pericleous
>            Assignee: Jan Høydahl
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: SOLR-1929.patch
>
>
> SolrCell is not able to index encrypted pdfs.
> This is easily done by supplying the password in the metadata passed on to tika

--
(Continue reading)

Picon

[jira] [Commented] (LUCENE-3939) ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper


    [
https://issues.apache.org/jira/browse/LUCENE-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243607#comment-13243607
] 

SHIN HWEI TAN commented on LUCENE-3939:
---------------------------------------

Thanks for the quick response.

I don't think that passing null as Comparator is the problem. For example, if the first invocation of the
method "map" is commented out(as below), then there is no exception thrown. In this case, the Comparator
is still null.

   org.apache.lucene.index.SortedTermVectorMapper var3 = new       
org.apache.lucene.index.SortedTermVectorMapper(false, false,(java.util.Comparator)null);
   var3.setExpectations("", 0, false, false);
   var3.map("*:", (-1), (org.apache.lucene.index.TermVectorOffsetInfo[])null, (int[])null);

> ClassCastException thrown in the map(String,int,TermVectorOffsetInfo[],int[]) method in org.apache.lucene.index.SortedTermVectorMapper
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3939
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3939
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 3.0.2, 3.1, 3.4, 3.5
>            Reporter: SHIN HWEI TAN
>   Original Estimate: 0.05h
(Continue reading)

Picon

[jira] [Closed] (SOLR-1924) Solr's updateRequestHandler does not have a fast way of guaranteeing document delivery


     [
https://issues.apache.org/jira/browse/SOLR-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl closed SOLR-1924.
-----------------------------

    Resolution: Duplicate

> Solr's updateRequestHandler does not have a fast way of guaranteeing document delivery
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-1924
>                 URL: https://issues.apache.org/jira/browse/SOLR-1924
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Karl Wright
>
> It is currently not possible, without performing a commit on every document, to use
updateRequestHandler to guarantee delivery into the index of any document.  The reason is that whenever
Solr is restarted, some or all documents that have not been committed yet are dropped on the floor, and
there is no way for a client of updateRequestHandler to know which ones this happened to.
> I believe it is not even possible to write a middleware-style layer that stores documents and performs
periodic commits on its own, because the update request handler never ACKs individual documents on a
commit, but merely everything it has seen since the last time Solr bounced.  So you have this potential scenario:
> - middleware layer receives document 1, saves it
> - middleware layer receives document 2, saves it
> Now it's time for the commit, so:
> - middleware layer sends document 1 to updateRequestHandler
(Continue reading)

Picon

[jira] [Updated] (SOLR-1856) In Solr Cell, literals should override Tika-parsed values


     [
https://issues.apache.org/jira/browse/SOLR-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated SOLR-1856:
------------------------------

    Affects Version/s:     (was: 1.4)
        Fix Version/s: 4.0
             Assignee: Jan Høydahl

> In Solr Cell, literals should override Tika-parsed values
> ---------------------------------------------------------
>
>                 Key: SOLR-1856
>                 URL: https://issues.apache.org/jira/browse/SOLR-1856
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris Harris
>            Assignee: Jan Høydahl
>             Fix For: 4.0
>
>         Attachments: SOLR-1856.patch
>
>
> I propose that ExtractingRequestHandler / SolrCell literals should take precedence over Tika-parsed
metadata in all situations, including where multiValued="true". (Compare SOLR-1633?)
> My personal motivation is that I have several fields (e.g. "title", "date") where my own metadata is much
superior to what Tika offers, and I want to throw those Tika values away. (I actually wouldn't mind
(Continue reading)

Picon

[jira] [Updated] (SOLR-2649) MM ignored in edismax queries with operators


     [
https://issues.apache.org/jira/browse/SOLR-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated SOLR-2649:
------------------------------

    Affects Version/s:     (was: 3.3)
        Fix Version/s: 4.0

> MM ignored in edismax queries with operators
> --------------------------------------------
>
>                 Key: SOLR-2649
>                 URL: https://issues.apache.org/jira/browse/SOLR-2649
>             Project: Solr
>          Issue Type: Bug
>          Components: query parsers
>            Reporter: Magnus Bergmark
>            Priority: Minor
>             Fix For: 4.0
>
>
> Hypothetical scenario:
>   1. User searches for "stocks oil gold" with MM set to "50%"
>   2. User adds "-stockings" to the query: "stocks oil gold -stockings"
>   3. User gets no hits since MM was ignored and all terms where AND-ed together
> The behavior seems to be intentional, although the reason why is never explained:
>   // For correct lucene queries, turn off mm processing if there
>   // were explicit operators (except for AND).
(Continue reading)

Picon

[jira] [Commented] (SOLR-2366) Facet Range Gaps


    [
https://issues.apache.org/jira/browse/SOLR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243611#comment-13243611
] 

Jan Høydahl commented on SOLR-2366:
-----------------------------------

Note to self: catch up on this again :)

> Facet Range Gaps
> ----------------
>
>                 Key: SOLR-2366
>                 URL: https://issues.apache.org/jira/browse/SOLR-2366
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: SOLR-2366.patch, SOLR-2366.patch
>
>
> There really is no reason why the range gap for date and numeric faceting needs to be evenly spaced.  For
instance, if and when SOLR-1581 is completed and one were doing spatial distance calculations, one could
facet by function into 3 different sized buckets: walking distance (0-5KM), driving distance
(5KM-150KM) and everything else (150KM+), for instance.  We should be able to quantize the results into
arbitrarily sized buckets.
> (Original syntax proposal removed, see discussion for concrete syntax)
(Continue reading)


Gmane