Olly Betts | 14 Sep 17:55
Favicon
Gravatar

Re: Tarball size (was Re: [Xapian-commits] 7916: trunk/xapian-core/ trunk/xapian-core/docs/)

On Thu, Mar 08, 2007 at 05:29:55AM +0000, Olly Betts wrote:
> If I take a xapian-core snapshot (new or old) and just un-tar and re-tar
> the *same* files, it roughly halves in size.  It doesn't look like "make
> dist" does anything stupid, and if I reproduce the options it seems to
> be using, I still get similar reductions in size.

Replying to an old message, but I've just found the explanation - by
default automake generates rules to build tar-v7 format archives.  If
you use tar-ustar instead then you get much smaller archives.

We now need tar-ustar anyway as there are a few filename > 100
characters long.  The automake documentation says it should be almost
universally supported now, and Solaris 9's tar supports them, which
is a good sign.  If anyone is unable to unpack them with their native
tar, they can download GNU tar with less bandwidth than they saved
downloading the smaller xapian-core tarball!

Cheers,
    Olly
Picon

Document clustering module?

Hi,

I am implementing some document clustering algorithms in the xapian
core. I would like to know if this kind of module will be considered
to be incorporated into the core release. Or is there already some
document clustering module that is just not open-sourced yet?

Best,
Yung-chung Lin
Olly Betts | 16 Sep 13:53
Favicon
Gravatar

Re: Document clustering module?

On Sun, Sep 16, 2007 at 07:27:34PM +0800, Yung-chung Lin wrote:
> I am implementing some document clustering algorithms in the xapian
> core. I would like to know if this kind of module will be considered
> to be incorporated into the core release.

Yes - I think it fits with xapian-core's role, so the issues are things
like scalability, maintainability, API consistency, etc.  The "HACKING"
document in xapian-core has some tips for contributers.

> Or is there already some document clustering module that is just not
> open-sourced yet?

Not that I'm aware of.

Cheers,
    Olly
Picon

Re: Document clustering module?

Hi,

The attached file is my current public clustering interface. I think
it would be easier to have discussions with a header file present.
My clustering module is intended to cluster documents in MSets and it
can enhance query expansion, and clustering is totally done in memory.
I am not sure if clustering on documents in database is necessary,
since it really involves a huge amount of computation. In-memory
clustering on retrieved documents is an easier and I think it is also
useful.

DSet, in the header file, stands for one cluster of documents and
MultiDSet stands for clusters of documents.

I am using a standalone similarity function
'calculate_doc_similarity()' which is overridable. Then I don't use
the xapian's weighting schemes to calculate weights. (Partly because I
have not read through xapian source code yet.) The similarity measure
is based on vector space model, and API users can simply create their
own document similarity function  on their own. I am not sure if this
is an optimal design. Maybe putting the similarity function into a
class would be even better. It needs discussion.

Now, I am using MultiDSet to store documents. I am thinking if it
would better if it returns multiple MSets, MultiMset, but the design
will be different and more complicated.

I have read the coding styles in HACKING, so I believe my coding style
would be OK. The issues would be on scalability and maintainability.

(Continue reading)

Olly Betts | 16 Sep 17:17
Favicon
Gravatar

Re: Document clustering module?

[There's no need to Cc: me on list replies]

On Sun, Sep 16, 2007 at 08:26:05PM +0800, Yung-chung Lin wrote:
> The attached file is my current public clustering interface. I think
> it would be easier to have discussions with a header file present.

Good idea.

> DSet, in the header file, stands for one cluster of documents and
> MultiDSet stands for clusters of documents.

Returning a vector of vectors by value seems suboptimal.

Simply using typedef of a vector is problematic too - existing Xapian
classes are either reference counted handles, or have very few members,
so users can expect that copying them is cheap.

> I am using a standalone similarity function
> 'calculate_doc_similarity()' which is overridable.

Unfortunately, you can't usefully put virtual functions on classes which
use RefCntPtr - if you subclass, you're only subclassing the "pointer"
bit, so Xapian won't be able to call back to the overridden method.
Bug#186 is relevant (I had some further thoughts about how we could
address this but I don't have a full solution yet):

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=186

> Maybe putting the similarity function into a class would be even
> better. It needs discussion.
(Continue reading)

Picon

Re: Document clustering module?

> > Maybe putting the similarity function into a class would be even
> > better. It needs discussion.
>
> I think that is probably the answer.

And what is your opinion of using Xapian::Weight to calculate document
similarity?
I have not read through the code yet, but I just think they seem heavy
in this use.

>
> > Now, I am using MultiDSet to store documents. I am thinking if it
> > would better if it returns multiple MSets, MultiMset, but the design
> > will be different and more complicated.
>
> I think I need to mull over how this would all be used.  Reusing MSet
> would be nice if it's a good fit, since adding more API classes tends to
> make it harder to learn the API, so it's good if it can be avoided.  But
> forcing reuse where something isn't a natural fit would be worse.
>

I just gave it a thought and my simple and non-intrusive idea is to
specify clustering algorithm when using Xapian::Enquire and to
associate each MSetItem with a cluster id, which would resemble:

  Enquire enq;
  ClusterSingleLinkage cluster_algorithm;
  enq.set_clustering_method(cluster_algorithm);
  MSet matches = enq.get_mset(1, 10);
  cout << matches.get_cluster_count() << endl;
(Continue reading)

Olly Betts | 16 Sep 20:13
Favicon
Gravatar

Re: Document clustering module?

On Sun, Sep 16, 2007 at 11:52:20PM +0800, Yung-chung Lin wrote:
> And what is your opinion of using Xapian::Weight to calculate document
> similarity?

Xapian::Weight is set up to score a single document by adding scores
from a set of terms (plus an optional contribution which depends only on
the document length), whereas here we want a score from a pair of
documents.  So I think you'd have to convert one of the documents to a
list of all the terms in it, which seems artificial.

And it seems legitimate to allow clustering using document values (e.g.
you might store geographical coordinates in a document value and cluster
by location), which doesn't fit with Xapian::Weight.

So I think a class which provides a similarity measure given two
Xapian::Document objects is probably the answer.

> > > Now, I am using MultiDSet to store documents. I am thinking if it
> > > would better if it returns multiple MSets, MultiMset, but the design
> > > will be different and more complicated.
> >
> > I think I need to mull over how this would all be used.  Reusing MSet
> > would be nice if it's a good fit, since adding more API classes tends to
> > make it harder to learn the API, so it's good if it can be avoided.  But
> > forcing reuse where something isn't a natural fit would be worse.
> 
> I just gave it a thought and my simple and non-intrusive idea is to
> specify clustering algorithm when using Xapian::Enquire and to
> associate each MSetItem with a cluster id, which would resemble:
> 
(Continue reading)

Picon

Re: Document clustering module?

> > I just gave it a thought and my simple and non-intrusive idea is to
> > specify clustering algorithm when using Xapian::Enquire and to
> > associate each MSetItem with a cluster id, which would resemble:
> >
> >   Enquire enq;
> >   ClusterSingleLinkage cluster_algorithm;
> >   enq.set_clustering_method(cluster_algorithm);
> >   MSet matches = enq.get_mset(1, 10);
> >   cout << matches.get_cluster_count() << endl;
> >   for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
> >       cout << "Document " << *miter << " is in cluster "
> >               << miter->get_cluster_id() << endl;
> >   }
> >
> > And let API users do what they want to do with the clusters.
>
> Yes, that seems a very nice approach.  It also more naturally allows the
> possibility of using document similarity to eliminate near-duplicates -
> to do that efficiently you want to do it as matches are generated so
> that you can stop when you have enough in the MSet.
>
> It wouldn't allow generating of different clusters of the same results
> (without rerunning the search) but that doesn't seem like it's likely to
> be an annoying limitation.

Calling cluster_algorithm.cluster_mset(matches) manually may
re-cluster matches and you can also choose another clustering
algorithm. What about this?

Best,
(Continue reading)

Richard Boulton | 17 Sep 11:32

Re: Document clustering module?

Olly Betts wrote:
>>   Enquire enq;
>>   ClusterSingleLinkage cluster_algorithm;
>>   enq.set_clustering_method(cluster_algorithm);
>>   MSet matches = enq.get_mset(1, 10);
>>   cout << matches.get_cluster_count() << endl;
>>   for (MSetIterator miter = matches.begin(); miter != matches.end(); ++miter) {
>>       cout << "Document " << *miter << " is in cluster "
>>               << miter->get_cluster_id() << endl;
>>   }
>>
>> And let API users do what they want to do with the clusters.
> 
> Yes, that seems a very nice approach.  It also more naturally allows the
> possibility of using document similarity to eliminate near-duplicates -
> to do that efficiently you want to do it as matches are generated so
> that you can stop when you have enough in the MSet.
> 
> It wouldn't allow generating of different clusters of the same results
> (without rerunning the search) but that doesn't seem like it's likely to
> be an annoying limitation.

We've also had the idea of extending the collapse mechanism to group by 
a value (instead of just returning the top document in a collapse group, 
as it currently does).  This kind of interface would allow that to be 
represented, too.

There would need to be some way to get a list of the cluster ids 
allocated for a given mset, and probably also a way to get further 
information on a cluster - some clustering algorithms allow a name to be 
(Continue reading)

Richard Boulton | 17 Sep 11:37

Re: Document clustering module?

Richard Boulton wrote:
> There would need to be some way to get a list of the cluster ids 
> allocated for a given mset

I meant to say - it would be useful if there was a way to get this; you 
could always iterate through the whole mset to get this list, of course.

--

-- 
Richard

Gmane