1 Jul 2010 12:16
Query totals - approximations.
Simon Gaeremynck <gaeremyncks <at> gmail.com>
2010-07-01 10:16:04 GMT
2010-07-01 10:16:04 GMT
First off I know the question has been asked many times before whether
it is possible to get an accurate count from query results.
I know Jackrabbit only loads the next result when it really has to, which is fine
since it gives a great performance boost.
And I also know you can "trick/force" Jackrabbit to return a total by adding a sort in there but that's not
really what we want.
So we thought we might take a Google approach where we say
"Displaying first 10 results of approximately 1400000."
Some more info about this:
Now, to do this we thought we could get the hit count from Lucene, get the first 10 nodes,
keep a record of how many Lucene Documents we had to iterate over to get those first 10
and then do a very rudimentary approximation of how many nodes the user would be able to see for this query.
ie:
1. Lucene returns a total hitcount of 1.523.145
2. We fetch the first 10 Nodes which results in 452 Documents that needed to be processed but could not be used
because the user doesn't have READ access.
3. Based on these 2 numbers we approximate that the user can see 3370 Nodes.
4. We round this number off to 3300 just to indicate that it's unlikely we guessed right.
5. The UI displays a message in likes of:
"
Displaying page 1 of approximately 330
Showing 10 results per page.
"
Now I had a look at how Jackrabbit executes queries and there seem to be 3 ways it gets the QueryHits (in JackrabbitIndexSearcher.evaluate)
- Check if it is a JackrabbitQuery and let the Query implementation deal with it.
- It is not a JackrabbitQuery and there is no sort required -- use LuceneQueryHits
(Continue reading)
RSS Feed