aarsh shah | 15 May 2013 15:27
Picon

Better parsing of BM25 parameters in Omega

Hello guys, as discussed on IRC, I have written some code for better parsing of BM25 parameters in Omega. If no parameters are specified ,it defaults all of them. However, if there some are specified and some are not or if the invalid values are given for any of them,it throws an error.

https://github.com/aarshkshah1992/xapian/commit/ac0a11f5d8ff975fad1e96e63764eab9b04dfcfb

-Regards
-Aarsh
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 11 Apr 2013 12:19
Picon

Added support for TfIdf to Omega


Hello guys,I have added code for tfidf to the weight.cc file in omega/ .

Here is the patch : -

https://github.com/aarshkshah1992/xapian/commit/5ff41a15f574e6780cc61e67e7f3da3d97ff4ec8

It compiles well and I think it'll work well.

Here's the link to the documentation file omegascript.rst where I've added tfidf.

https://github.com/aarshkshah1992/xapian/commit/9434ad15ad8b69691ad45f2d340450b3070f524e

Please let me know what you'll think. Also,I somehow seem to have messed my branch by trying to merge the tfidf branch with the master branch and so am not sending a pull request.I'll sort it out soon.

-Regards
-Aarsh




_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
张旭东 | 5 Apr 2013 15:42
Picon

Question about Project: Posting list encoding improvements

Dear all:

My name is Xudong Zhang, from Peking University. I'm very interested in the Project: Posting list encoding improvements, and I have had some experience on encoding, and so I want to be involved in the project.

About the projectI have a question to consult:

What's the main goal of the improvement of the speed for the encoding and decoding of posting lists? Is it OK to increase the encoding/decoding speed greatly with the slight loss of compression?

Below is about my personal information. I'm a master student from Peking University, Beijing, China. My research area is search engine and web mining, and I've been focusing on compression algorithms of posting lists in search engines since last February. I have proposed a novel decoding algorithm called SIMD-PFORDelta, which exploit SIMD instructions in PFORDelta. It is fast and can achieve good compression ratio. Recently, I also did some experiments to test the state-of-the-art and proposed encoding algorithms of posting lists. We used our own search engine called PARADISE and the datasets are GOV2(25 million docs) and ClueWeb09B(50 million docs). Below is some experimental results of docIDs’ gap sequence on ClueWeb09B, which is similar to Lemire's results[1] (The implementation of SIMD_pfor_delta is somewhat different to that in Lemire's paper):

 

Algorithm Name

Compression Speed(M/s)

Decompression Speed(M/s)

Compression Ratio

simple9

88.2749297

500.8305639

16.72%

simple16

51.29381855

488.0998879

15.55%

var_byte

492.9379145

519.1921354

26.02%

group_var_byte

243.7189985

501.3713526

31.76%

group_var_byte_incomp_unary

124.5200775

501.8188062

28.77%

group_var_byte_comp_unary

123.1249031

461.8294703

28.70%

SIMD_group_var_byte

240.296232

793.0340184

31.76%

SIMD_group_var_byte_incomp_unary

126.3003402

1325.49703

28.77%

SIMD_group_var_byte_comp_unary

126.3347572

1128.968014

28.70%

packed_binary

252.6257354

1151.000784

27.55%

SIMD_packed_binary

254.2433557

1507.316133

27.55%

pfor_delta

26.99031483

1026.717585

19.53%

SIMD_pfor_delta

27.23024567

1580.111396

19.53%

rice

68.74742448

79.16529964

15.77%

k_gamma

172.9461803

242.8787555

15.24%

gamma_opt

61.7750133

59.30571281

14.91%

SIMD__kgamma

174.1829249

262.0594328

15.24%


Thank you very much!

P.S. In D. Lemire's paper: Decoding billions of integers per second through vectorization(CORR'12), SIMD-PORDelta can outperforms VSEncoding w.r.t decoding speed, while remaining nearly the same compression ratio, namely, bits/int.

[1] D. Lemire et al. Decoding billions of integers per second through vectorization, CORR’12, http://arxiv.org/abs/1209.2137

Best wishes,

Xudong Zhang


--
---------------------------

Xudong Zhang
+86-13488695609
Team of Search Engine and Web Mining
School of Electronic Engineering  and Computer Science
Peking University, Beijing, 100871, P.R.China
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
Marius Tibeica | 2 Apr 2013 15:37
Picon
Gravatar

GSOC - Posting list encoding improvements

> It would be better to have this discussion on the mailing list really.

Done! Tried to get as much information in here as possible.

>
> On Mon, Apr 01, 2013 at 04:22:09PM +0300, Marius Tibeica wrote:
> > I've looked through the code of BitReader and BitWriter.
> >
> > From what I understood, we could keep state and allow repeated calls, but
> > we can't get the list in order without extra O(n) memory (how it is now)
> > unless we make that an O(log n) operation for each item, meaning O(n*log n)
> > total.
>
> I don't think we want to make the decode time complexity worse.  The
> current space complexity might be unavoidable (though actually I think
> we only need O(log(n)) space - see below), but I think the key win to
> be had is to avoid doing a full decode just to get the first few entries
> (in terms of time, the first entry is O(1) to decode, the second
> O(log(n)), but the third isn't an additive O(log(n)) on top of that).
>
> > To do this we must:
> >   - initialize the interpolative decode
> >       - keep the state information from the BitReader(idx, n_bits, acc) for
> > future requests - probably returned as a struct?
>
> Yes, I think you'd want to wrap up the state in a class.
>
> >       - make a full decode (without keeping the termpos array) to update
> > the BitWriter so that other future decodings are properly done (initial
> > O(n) when creating the state).
>
> We're trying to avoid the work of doing the full decode though, so doing
> it first and throwing the results away seems a step backwards.  We are
> unlikely to be able to store this data in a persistent enough way for
> this to be useful (for a phrase search, we potentially decode
> #terms_in_phrase*#docs_matching_all_those_terms position lists to check
> phrase matches, though we try hard to avoid doing that many).
>
> >   - for each requested item of the list, use the stored state. The code
> > will be very similar to decode_interpolative, but only calculate the "half"
> > where the item is each time (hence the log n).
> >
> > Does this sound like a good approach?
>
> My thinking is that the code runs much like it currently does, except
> that the state is in an object (rather than on the stack from recursive
> calls to decode_interpolative and in the passed in pos vector), and
> that it pauses its work when it knows the next position to return.
>
> I think we may actually only need to store O(log(n)) data to keep the
> state - the depth of recursion looks to be O(log(n)), and if you
> adjust the decode_interpolative like so then the pos vector doesn't
> seem to be needed:
>
> void
> BitReader::decode_interpolative(int j, int k,
>                                 Xapian::termpos pos_j,
>                                 Xapian::termpos pos_k)
> {
>     while (j + 1 < k) {
>         const size_t mid = (j + k) / 2;
>         // Decode one out of (pos_k - pos_j + 1) values
>         // (less some at either end because we must be able to fit
>         // all the intervening pos in)
>         const size_t outof = pos_k - pos_j + j - k + 1;
>         Xapian::termpos pos_mid = decode(outof) + (pos_j + mid - j);
>         decode_interpolative(j, mid, pos_j, pos_mid);
>         j = mid;
>     }
> }
>
> I believe if you check mid before the recursive call to
> decode_interpolative and k after it, then you'll see all the elements in
> order (plus out of order repeats before and/or after the useful
> occurrence - it looks like if run to completion, you end up considering
> about n*2 values to find the n elements, which is OK).
>

Guess I understood the request wrong.
Using a binary search tree in order traversal modified version we get
this solution, which is pretty close to what you said:

extra items in the BitReader (or a derived class)
stack<DIState> di_stack; //will get to a maximum of O(logn) space size indeed
DIState di_current;

struct DIState {
  DIState() { set(0, 0, 0, 0); }
  DIState(int p_j, int p_k, Xapian::termpos pos_j, Xapian::termpos
pos_k) { set(j, k, pos_j, pos_k); }
  void set(int j, int k) { this.j = j; this.k = k;, this.pos_j =
pos_j; this.pos_k = pos_k; }
  bool is_nextterm() { return j + 1 < k; };
  bool is_uninitialized() { return j == 0 && k == 0 && pos_j == 0 &&
pos_k == 0; }
  Xapian::termpos pos_j, pos_k;
  int j, k;
};

// after calling this method, decode_interpolative_next will
sequentially return the termpos from j + 1 to k - 1
// on all the decoding methods, including decode_interpolative, we
must also check if a decode_interpolative has been started and not
finished, and signal an error or do all the reads from the buffer.
Xapian::termpos
BitReader::decode_interpolative(int j, int k, Xapian::termpos pos_j,
Xapian::termpos pos_k)
{
    di_current(j, k, pos_j, pos_k);
}

Xapian::termpos
BitReader::decode_interpolative_next()
{
    if (di_current.is_uninitialized()) {
        //signal an error
    }
    while (!di_stack.empty() || di_current.is_nextterm()) {
        if (di_current.is_nextterm()) {
            di_current.push(di_current);
            int mid = (j + k) / 2;
            int pos_mid = decode(pos[k] - pos[j] + j - k + 1) +
(pos[j] + mid - j);
            di_current.set(di_current.j, mid, di_current.pos_j, pos_mid);
        } else {
            Xapian::termpos pos_ret = di_current.pos_k;
            if (di_stack.empty() && !di_current.is_nextterm()) {
                // signal an error. we have called this method too many times.
            }
            di_current = di_stack.top();
            di_stack.pop();
            int mid = (j + k) / 2;
            int pos_mid = pos_ret;
            di_current.set(mid, di_current.k, pos_mid, di_current.pos_k);
            return pos_ret;
        }
    }
}

I may be able to improve this even further by keeping less information
in each DIState (if I can get it otherwise). Will focus on this if the
approach is ok.
Hope the code is self explanatory. Will add more comments if necessary.

> Cheers,
>     Olly

> > > One thing comes to mind - we currently have an interpolative coding
> > > encoder and decoder, which are used for positional information.  It
> > > would probably be good to have these as one option for posting lists
> > > - for something like a boolean filter term, interpolative coding
> > > is very compact, especially for dense cases.
> > >
> > > However our current decoder can only do a full unpack, so you need
> > > to unpack to a temporary vector and then iterate that.  It would
> > > be better if it could keep state and allow repeated calls to iterate
> > > through the list, returning one entry at a time.  This would likely
> > > benefit phrase searches as well as helping when used for postlists.
> > >
> > > The code is in common/bitstream.{cc,h} if you want to take a look.
Greg Banks | 2 Apr 2013 11:01
Gravatar

FastMail's search using Xapian

G'day,

I thought some folks on this list would be interested to know that the new full text email search feature rolled out today by FastMail


is powered by Xapian. All the code for indexing and searching emails is Open Source and in Fastmail's version of the Cyrus IMAP server.


That isn't the most recent code but its pretty close.
We expect that code will be merged into a future release of Cyrus (but not the next release which is imminent).

We evaluated Solr, Sphinx and Xapian. Sphinx was the fastest but had a number of deployment issues due to our unique requirements.
I must say that the Xapian learning curve put us off for a while, but after gaining experience with other search technologies that became less of a problem.
In the end the ability to run Xapian as a library linked with our C application, and thus to control startup times and memory usage more easily than with a standalone server, was the deciding factor.

Anyway, thanks for a great piece of software, and keep up the good work.

Greg.
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 1 Apr 2013 18:30
Picon

Doubt about GSOC proposal

Hello guys.I have begun work on writing my proposal as discussed on IRC and will submit a draft in a couple of days so that I can make it detailed and refine it after getting feedback.

I wanted to know about the number of weeks a proposal should cover and also,is it okay if I set aside a buffer week somewhere in the middle of the summer  for something like cleaning the code,working on the feedback  and getting it merged(merging whatever code Ive completed till then.) ?

-Regards
-Aarsh
_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 27 Mar 2013 15:14
Picon

Need help as Pl2 tests not performing as expected

Hello guys. I just ran the updated tests for PL2 and they are not giving the mset order I expect.Now,the thing is, dfr's behavior is a bit hard to predict and so even if I expect a particular order ,it may give another order and still be correct.So,the only way to write correct tests for PL2 is to manually calculate the weight of the documents to decide the expected order.For that,I need to have a look at the statistics  stored in the database for my tests.I ran the tests and after it failed ,I tried "delve db" at the terminal,but it says that it can't open the database.Please can someone help me this.

Also,the max possible  statistic of the Mset is less than max attained and so ,Ill have to have a look at the code again.This may take some time,as PL2 has a very complex formula and it's a bit hard to understand what's happening where.

-Regards
-Aarsh

On Wed, Mar 27, 2013 at 6:23 PM, aarsh shah <aarshkshah1992 <at> gmail.com> wrote:
Hello guys.I just  realized that Ive not set the weighting scheme to PL2 in the tests for PL2 and so a default weighting scheme of BM25 is used. I am extremely sorry for this and am updating the tests by setting the weighting scheme to PL2.

-Regards
-Aarsh

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 27 Mar 2013 13:53
Picon

Major Mistake in pL2 tests in the pull request

Hello guys.I just  realized that Ive not set the weighting scheme to PL2 in the tests for PL2 and so a default weighting scheme of BM25 is used. I am extremely sorry for this and am updating the tests by setting the weighting scheme to PL2.

-Regards
-Aarsh

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 26 Mar 2013 16:59
Picon

Merging of the TfIdf patch

Hello Guys. I have updated the code,tests,documentation,makefile entries and the registry entry of the TfIdf patch as per the feedback.Please do let me know if any additional changes are required before the  patch can be merged,

-Regards
-Aarsh

On Sun, Mar 3, 2013 at 2:50 PM, aarsh shah <aarshkshah1992 <at> gmail.com> wrote:
Hello guys.I have sent a pull request for the code and tests of the Tf-Idf weighting scheme.
Please do let me know if any changes are required.Meanwhile,Ill begin working on implementing normalizations which require additional statistics and on the DFR schemes. 

https://github.com/xapian/xapian/pull/6

On Tue, Feb 26, 2013 at 5:30 PM, <xapian-devel-request <at> lists.xapian.org> wrote:
Send Xapian-devel mailing list submissions to
        xapian-devel <at> lists.xapian.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.xapian.org/mailman/listinfo/xapian-devel
or, via email, send a message with subject or body 'help' to
        xapian-devel-request <at> lists.xapian.org

You can reach the person managing the list at
        xapian-devel-owner <at> lists.xapian.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Xapian-devel digest..."


Today's Topics:

   1. Sent a pull request for the Tf-Idf Weighting scheme (aarsh shah)


----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Feb 2013 02:21:52 +0530
From: aarsh shah <aarshkshah1992 <at> gmail.com>
Subject: [Xapian-devel] Sent a pull request for the Tf-Idf Weighting
        scheme
To: Xapian Development <xapian-devel <at> lists.xapian.org>
Message-ID:
        <CABz8NmSN0E-P54Sp9Rr4VWbSbYw+ryv5mOpwkKaxHBtf1hJdjg <at> mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hello guys :) I have sent a pull request for the Tf-Idf Weighting scheme
incorporating as many normalizations as I could with the help of statistics
currently available from Xapian::Weight . Please let me know what you'll
think about it.

I used the weighting scheme in a simple searcher and it did a fine job with
it. I have no experience with writing tests for features like this.Please
give me some guidance on how to write a test for a weighing scheme .I have
updated some documentation to incorporate the weighting scheme .Please do
let me know if I need to add more documentation for this scheme.

I don't know why but inspite of committing this patch on a separate branch
, it still contains commits of other branches and so the pull request I
have sent  also shows many previous commits.I searched on the net but still
can't understand why this is happening.Please can someone help with that ?

The commits for the TfIdf scheme are dated 25 February in the pull request.
A big thank you to the community for all their help. :)

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130226/83d6ee35/attachment.html>

------------------------------

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel


End of Xapian-devel Digest, Vol 94, Issue 10
********************************************


_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
aarsh shah | 25 Mar 2013 16:48
Picon

Added feature tests to the PL2 pull request

Hello guys.I have added various tests to the PL2 pull request.They are working fine. Have also added PL2 to the registry and to the java and csharp makefiles.Please do let me know what you'll think.Other than the collection frequency problem discussed on IRC, it is ready.Am now beginning work on adding code and tests for DPH to the same branch.

-Regards
-Aarsh

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel
sreepriya | 24 Mar 2013 03:43
Picon
Gravatar

GSoC aspirant interested in Projects related to Testing

Hi,
I am a F/OSS enthusiasts interested in applying for GSoC this time. I am currently doing my third year of bachelors in Computer Science and Engineering. I have done some open source contributions till now. I am familiar working with large code bases and version control systems. I found the projects related to testing interesting and would like get started with it and I am comfortable working with C++. Can someone help me with the project ?

--
Sreepriya C
India
priyachalakkal.wordpress.com

_______________________________________________
Xapian-devel mailing list
Xapian-devel <at> lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-devel

Gmane