Lance Norskog (JIRA | 1 Jan 2011 01:03
Picon
Favicon

[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup


    [
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976370#action_12976370
] 

Lance Norskog commented on LUCENE-2611:
---------------------------------------

Robert, thank you for this grand effort.  After a few times, I can set it up faster now but it's still a pain. 

Now, the last thing you want to hear :) I use Eclipse text search a lot and it's much faster when a code based is
split between a few projects instead of one large project. Maybe I can help on that one.

> IntelliJ IDEA and Eclipse setup
> -------------------------------
>
>                 Key: LUCENE-2611
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2611
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Build
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch,
LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch,
LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch,
LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch,
(Continue reading)

Lance Norskog (JIRA | 1 Jan 2011 01:07
Picon
Favicon

[jira] Commented: (LUCENE-2611) IntelliJ IDEA and Eclipse setup


    [
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976371#action_12976371
] 

Lance Norskog commented on LUCENE-2611:
---------------------------------------

About making it easier to contribute with patches: what makes this easy is Git or Mercurial, not an IDE. If
you make a local Git repository matching your checkout you can make separate branches for each project. 
Git is way too flexible and confusing, but I can't live without it now.

You can do a full GIT checkout from the links on git.apache.org. There is a nice GIT extension for Eclipse
that makes it really easy to manage branches.

Cheers, Lance.

> IntelliJ IDEA and Eclipse setup
> -------------------------------
>
>                 Key: LUCENE-2611
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2611
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Build
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
(Continue reading)

Apache Hudson Server | 1 Jan 2011 07:47
Picon

Solr-3.x - Build # 214 - Still Failing

Build: https://hudson.apache.org/hudson/job/Solr-3.x/214/

All tests passed

Build Log (for compile errors):
[...truncated 20224 lines...]
John Koumarelas | 1 Jan 2011 13:45
Picon
Favicon

Getting involved with Lucene - Java.

First of all Happy new year to everyone in Lucene!

I would like to know, who should i ask if i want to get involved with the project (e.g. Find/Fix bugs etc.)? Should i search for a mentor, or should i just try to fix bugs from the list by myself?

Thank you for your time,

John Koumarelas
Simon Willnauer | 1 Jan 2011 16:31

Re: Getting involved with Lucene - Java.

Hey John,

great to see your intrest in helping lucene to move forward. Its an
amazing time to get started as we are moving very quickly development
wise right now. To get started it might be the easiest if you get
familiar with how we work and how apache works. The Contribution Guide
(http://wiki.apache.org/lucene-java/HowToContribute)  might be a very
good place to start from.
You should have a JIRA account and a working dev environment to run
tests etc. - don't hesitate to check out the latest trunk or one of
the branches if you wanna work on one of those. We currently have 3
branches with active development side by side with the trunk:
Under http://svn.apache.org/repos/asf/lucene/dev/branches/  you find 5
subfolders

branch_3x/ - a branch that maintains compatibility with 3.x releases
and is the place for backporting importent and yet compatible features
to the 3.x versions.
bulkpostings/ - Bulkreading support for postinglists see
https://issues.apache.org/jira/browse/LUCENE-2723
docvalues/ - Column Stride fields, see
https://issues.apache.org/jira/browse/LUCENE-2186
realtime_search/ - Realtime search branch, see
https://issues.apache.org/jira/browse/LUCENE-2324

under  http://svn.apache.org/repos/asf/lucene/dev/trunk you find the
latest trunk sources. It might be a good idea to check those out and
run a "ant clean test" and see if it build and tests successful.

Once you are set up its up to you to fix bugs if you find one, you
don't need a mentor - once you open an issue you should get attention
from one of the committers and / or one of the other contributors who
works with you towards a committable patch.

Don't hesitate to ask any further questions.

Simon

2011/1/1 John Koumarelas <koumjohn <at> hotmail.com>:
> First of all Happy new year to everyone in Lucene!
>
> I would like to know, who should i ask if i want to get involved with the
> project (e.g. Find/Fix bugs etc.)? Should i search for a mentor, or should i
> just try to fix bugs from the list by myself?
>
> Thank you for your time,
>
> John Koumarelas
>
Apache Hudson Server | 1 Jan 2011 18:05
Picon

Lucene-Solr-tests-only-trunk - Build # 3258 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/3258/

1 tests failed.
REGRESSION:  org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple

Error Message:
expected:<3> but was:<2>

Stack Trace:
junit.framework.AssertionFailedError: expected:<3> but was:<2>
	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1109)
	at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1047)
	at org.apache.solr.client.solrj.TestLBHttpSolrServer.testSimple(TestLBHttpSolrServer.java:126)

Build Log (for compile errors):
[...truncated 7572 lines...]
Li Li | 1 Jan 2011 18:33
Picon

Re: strange problem of PForDelta decoder

   I sent a mail to MG4J group and Sebastiano Vigna recommended the paper
Reducing query latencies in web search using fine-grained parallelism.
World Wide Web, 12(4):441-460, 2009.
   I read it roughly. But there are some questions
     it says:
    it first coalesces all disk reads in a single process, and then
distributes the index data among
the parallel threads. So when the index server receives a query, it
loads from storage (disk or cache)
the required posting lists, initializes the query execution, and then
spawns lightweight threads, one per core.
Each thread receives an equal-sized subset of document IDs to scan,
together covering the entire index
partition. All threads execute the same code on the same query, but
with private data structures.
The only writable shared data structure is a heap of the top-scoring
hits, protected by a lock. At the end of the threads'
execution, the heap contains the highest-scoring hits in the entire
index partition, which is then transmitted to the query
integrator as usual. Since the index contains skip-lists that permit
near-random-access to any document ID, and since
hits are generally distributed evenly in the index, our measurements
show that all threads complete their work with
little variability, within 1-2% of each other.

  I have some questions
  1.Since the index contains skip-lists that permit
near-random-access to any document ID
   skip list can be near random access?(especially when it's in hard disk)

  2.For "or query", it's easy. e.g. we search "search" and "engine",
we using one main thread to get postings the
the 2 terms and divide their postings into 2 groups(e.g. even docIds
and odd docIds) and using 2 threads to score
them and finally merge it(or using locked priority queue)
     but for "and query", we usally don't decode all the docIDs
because we can skip many documents. especially when
searching low frequent terms with high frequent terms. only a small
number of docIDs of high frequent term is decoded.

   btw. I think or query is often much slower than and query. If we can
parallel or query well, it's also very useful.

On Dec 31, 2010, at 7:25 AM, Li Li wrote:

>    Which one is used in MG4J to support multithreads searching? Are

2010/12/31 Li Li <fancyerii <at> gmail.com>:
> is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/)
> it says Multithreading. Indices can be queried and scored concurrently.
> maybe we can learn something from it.
>
> 2010/12/31 Li Li <fancyerii <at> gmail.com>:
>> plus
>> 2 means search a term need seek many times for tis(if it's not cached in tii)
>>
>> 2010/12/31 Li Li <fancyerii <at> gmail.com>:
>>> searching multi segments is a alternative solution but it has some
>>> disadvantages.
>>> 1. idf is not global?(I am not familiar with its implementation) maybe
>>> it's easy to solve it by share global idf
>>> 2. each segments will has it's own tii and tis files, which may make
>>> search slower(that's why optimization of
>>> index is neccessary)
>>> 3. one term's  docList is distributed in many files rather than one.
>>> more than one frq files means
>>> hard disk must seek different tracks, it's time consuming. if there is
>>> only one segment, the are likely
>>> stored in a single track.
>>>
>>>
>>> 2010/12/31 Earwin Burrfoot <earwin <at> gmail.com>:
>>>>>>>until we fix Lucene to run a single search concurrently (which we
>>>>>>>badly need to do).
>>>>> I am interested in this idea.(I have posted it before) do you have some
>>>>> resources such as papers or tech articles about it?
>>>>> I have tried but it need to modify index format dramatically and we use
>>>>> solr distributed search to relieve the problem of response time. so finally
>>>>> give it up.
>>>>> lucene4's index format is more flexible that it supports customed codecs
>>>>> and it's now on development, I think it's good time to take it into
>>>>> consideration
>>>>> that let it support multithread searching for a single query.
>>>>> I have a naive solution. dividing docList into many groups
>>>>> e.g grouping docIds by it's even or odd
>>>>> term1 df1=4  docList =  0  4  8  10
>>>>> term1 df2=4  docList = 1  3  9  11
>>>>>
>>>>> term2 df1=4  docList = 0  6  8  12
>>>>> term2 df2=4  docList = 3  9  11 15
>>>>>   then we can use 2 threads to search topN docs on even group and odd group
>>>>> and finally merge their results into a single on just like solr
>>>>> distributed search.
>>>>> But it's better than solr distributed search.
>>>>>   First, it's in a single process and data communication between
>>>>> threads is much
>>>>> faster than network.
>>>>>   Second, each threads process the same number of documents.For solr
>>>>> distributed
>>>>> search, one shard may process 7 documents and another shard may 1 document
>>>>> Even if we can make each shard have the same document number. we can not
>>>>> make it uniformly for each term.
>>>>>    e.g. shard1 has doc1 doc2
>>>>>           shard2 has doc3 doc4
>>>>>    but term1 may only occur in doc1 and doc2
>>>>>    while term2 may only occur in doc3 and doc4
>>>>>    we may modify it
>>>>>           shard1 doc1 doc3
>>>>>           shard2 doc2 doc4
>>>>>    it's good for term1 and term2
>>>>>    but term3 may occur in doc1 and doc3...
>>>>>    So I think it's fine-grained distributed in index while solr
>>>>> distributed search is coarse-
>>>>> grained.
>>>> This is just crazy :)
>>>>
>>>> The simple way is just to search different segments in parallel.
>>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized
>>>> large segments (and small ones don't count, they're small!).
>>>> If you're bound on squeezing out that extra millisecond (and making
>>>> your life miserable along the way), you can search a single segment
>>>> with multiple threads (by dividing it in even chunks, and then doing
>>>> skipTo to position your iterators to the beginning of each chunk).
>>>>
>>>> First approach is really easy to implement. Second one is harder, but
>>>> still doesn't require you to cook the number of CPU cores available
>>>> into your index!
>>>>
>>>> It's the law of diminishing returns at play here. You're most likely
>>>> to search in parallel over mostly memory-resident index
>>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
>>>> tend to slow down considerably on parallel sequential reads, so you
>>>> already have pretty decent speed.
>>>> Searching different segments in parallel (with BSMP) makes you several
>>>> times faster.
>>>> Searching in parallel within a segment requires some weird hacks, but
>>>> has maybe a few percent advantage over previous solution.
>>>> Sharding posting lists requires a great deal of weird hacks, makes
>>>> index machine-bound, and boosts speed by another couple of percent.
>>>> Sounds worthless.
>>>>
>>>> --
>>>> Kirill Zakharenko/Кирилл Захаренко (earwin <at> gmail.com)
>>>> Phone: +7 (495) 683-567-4
>>>> ICQ: 104465785
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe <at> lucene.apache.org
>>>> For additional commands, e-mail: dev-help <at> lucene.apache.org
>>>>
>>>>
>>>
>>
>
Uwe Schindler (JIRA | 1 Jan 2011 21:23
Picon
Favicon

[jira] Updated: (LUCENE-2838) ConstantScoreQuery should directly support wrapping Query and simply strip off scores


     [
https://issues.apache.org/jira/browse/LUCENE-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2838:
----------------------------------

    Attachment: LUCENE-2838.patch

Attached is the correct solution of ConstantScoreQuery with a scorer that supports topScorer and
directly delegates hit collection to its wrapped scorer. This enables use of BooleanScorer2 for MTQs.

The test case verifies that the correct boosts are collected and queries wrapped two times with different
boosts return the correct scores, too.

I will commit soon!

> ConstantScoreQuery should directly support wrapping Query and simply strip off scores
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2838
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2838-no-topscorer-opt.patch, LUCENE-2838.patch, LUCENE-2838.patch, LUCENE-2838.patch
>
>
> Especially in MultiTermQuery rewrite modes we often simply need to strip off scores from Queries and make
them constant score. Currently the code to do this looks quite ugly: new ConstantScoreQuery(new QueryWrapperFilter(query))
> As the name says, QueryWrapperFilter should make any other Query constant score, so why does it not take a
Query as ctor param? This question was aldso asked quite often by my customers and is simply correct, if you
think about it.
> Looking closer into the code, it is clear that this would also speed up MTQs:
> - One additional wrapping and method calls can be removed
> - Maybe we can even deprecate QueryWrapperFilter in 3.1 now (it's now only used in tests and the use-case
for this class is not really available) and LUCENE-2831 does not need the stupid hack to make Simon's
assertions pass
> - CSQ now supports out-of-order scoring and topLevel scoring, so a CSQ on top-level now directly feeds the
Collector. For that a small trick is used: The score(Collector) calls are directly delegated and the
scores are stripped by wrapping the setScorer() method in Collector
> During that I found a visibility bug in Scorer (LUCENE-2839): The method "boolean score(Collector
collector, int max, int firstDocID)" should be public not protected, as its not solely intended to be
overridden by subclasses and is called from other classes, too! This leads to no compiler bugs as the other
classes that calls it is mainly BooleanScorer(2) and thats in same package, but visibility is wrong. I
will open an issue for that and fix it at least in trunk where we have no backwards-requirement.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Uwe Schindler (JIRA | 1 Jan 2011 21:41
Picon
Favicon

[jira] Updated: (LUCENE-2838) ConstantScoreQuery should directly support wrapping Query and simply strip off scores


     [
https://issues.apache.org/jira/browse/LUCENE-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2838:
----------------------------------

    Attachment:     (was: LUCENE-2838.patch)

> ConstantScoreQuery should directly support wrapping Query and simply strip off scores
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2838
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2838-no-topscorer-opt.patch, LUCENE-2838.patch, LUCENE-2838.patch
>
>
> Especially in MultiTermQuery rewrite modes we often simply need to strip off scores from Queries and make
them constant score. Currently the code to do this looks quite ugly: new ConstantScoreQuery(new QueryWrapperFilter(query))
> As the name says, QueryWrapperFilter should make any other Query constant score, so why does it not take a
Query as ctor param? This question was aldso asked quite often by my customers and is simply correct, if you
think about it.
> Looking closer into the code, it is clear that this would also speed up MTQs:
> - One additional wrapping and method calls can be removed
> - Maybe we can even deprecate QueryWrapperFilter in 3.1 now (it's now only used in tests and the use-case
for this class is not really available) and LUCENE-2831 does not need the stupid hack to make Simon's
assertions pass
> - CSQ now supports out-of-order scoring and topLevel scoring, so a CSQ on top-level now directly feeds the
Collector. For that a small trick is used: The score(Collector) calls are directly delegated and the
scores are stripped by wrapping the setScorer() method in Collector
> During that I found a visibility bug in Scorer (LUCENE-2839): The method "boolean score(Collector
collector, int max, int firstDocID)" should be public not protected, as its not solely intended to be
overridden by subclasses and is called from other classes, too! This leads to no compiler bugs as the other
classes that calls it is mainly BooleanScorer(2) and thats in same package, but visibility is wrong. I
will open an issue for that and fix it at least in trunk where we have no backwards-requirement.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Uwe Schindler (JIRA | 1 Jan 2011 21:43
Picon
Favicon

[jira] Issue Comment Edited: (LUCENE-2838) ConstantScoreQuery should directly support wrapping Query and simply strip off scores


    [
https://issues.apache.org/jira/browse/LUCENE-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976445#action_12976445
] 

Uwe Schindler edited comment on LUCENE-2838 at 1/1/11 3:42 PM:
---------------------------------------------------------------

Attached is the correct solution of ConstantScoreQuery with a scorer that supports topScorer and
directly delegates hit collection to its wrapped scorer. This enables use of the bucket-using
BooleanScorer instead of BooleanScorer2 for MTQs.

The test case verifies that the correct boosts are collected and queries wrapped two times with different
boosts return the correct scores, too.

      was (Author: thetaphi):
    Attached is the correct solution of ConstantScoreQuery with a scorer that supports topScorer and
directly delegates hit collection to its wrapped scorer. This enables use of BooleanScorer2 for MTQs.

The test case verifies that the correct boosts are collected and queries wrapped two times with different
boosts return the correct scores, too.

I will commit soon!

> ConstantScoreQuery should directly support wrapping Query and simply strip off scores
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2838
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2838-no-topscorer-opt.patch, LUCENE-2838.patch, LUCENE-2838.patch, LUCENE-2838.patch
>
>
> Especially in MultiTermQuery rewrite modes we often simply need to strip off scores from Queries and make
them constant score. Currently the code to do this looks quite ugly: new ConstantScoreQuery(new QueryWrapperFilter(query))
> As the name says, QueryWrapperFilter should make any other Query constant score, so why does it not take a
Query as ctor param? This question was aldso asked quite often by my customers and is simply correct, if you
think about it.
> Looking closer into the code, it is clear that this would also speed up MTQs:
> - One additional wrapping and method calls can be removed
> - Maybe we can even deprecate QueryWrapperFilter in 3.1 now (it's now only used in tests and the use-case
for this class is not really available) and LUCENE-2831 does not need the stupid hack to make Simon's
assertions pass
> - CSQ now supports out-of-order scoring and topLevel scoring, so a CSQ on top-level now directly feeds the
Collector. For that a small trick is used: The score(Collector) calls are directly delegated and the
scores are stripped by wrapping the setScorer() method in Collector
> During that I found a visibility bug in Scorer (LUCENE-2839): The method "boolean score(Collector
collector, int max, int firstDocID)" should be public not protected, as its not solely intended to be
overridden by subclasses and is called from other classes, too! This leads to no compiler bugs as the other
classes that calls it is mainly BooleanScorer(2) and thats in same package, but visibility is wrong. I
will open an issue for that and fix it at least in trunk where we have no backwards-requirement.

--

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Gmane