Radim Rehurek | 3 Feb 10:29 2012
Picon

[gensim:904] Re: LDA visualization questions

Hello enyun (cc Skipper and the mailing list), 

we had a conversation about TMVE some time ago with Skipper, but I think it went off the mailing list. Grrrr :)
Conclusion there was that TMVE had serious bugs in their implementation of the similarity measures, so the
output was more or less random.

I contacted Allison (the author of TMVE) about this ~one year ago, who said she's working on this, but
haven't heard since. Later, Skipper made some improvements to TMVE, check out his fork at
https://github.com/jseabold/tmve .

Skipper, how did you feed input to TMVE?

Best,
Radim

> hi radim,
> 
> I found gensim yesterday and really thought it's a great tools after
> using it.
> 
> TMVE depends on two results : beta ( about p(t|z) ) and gamma ( about
> p(z|d)),
> from : http://code.google.com/p/tmve/wiki/TMVE01
> 
> I try gemsim's LDA module, while its save() function only dump binary
> format data.
> >>> lda = models.ldamodel.LdaModel(corpus=mm, id2word=dict2, num_topics=10,
> update_every=1, chunksize=1000, passes=1)
> 
> I'm familiar with gibbs sampling version lda while not so well with
(Continue reading)

Skipper Seabold | 3 Feb 15:15 2012
Picon

[gensim:905] Re: LDA visualization questions

2012/2/3 Radim Rehurek <RadimRehurek-9Vj9tDbzfuSlVyrhU4qvOw@public.gmane.org>:
> Hello enyun (cc Skipper and the mailing list),
>
> we had a conversation about TMVE some time ago with Skipper, but I think it went off the mailing list. Grrrr :)
> Conclusion there was that TMVE had serious bugs in their implementation of the similarity measures, so the output was more or less random.
>
> I contacted Allison (the author of TMVE) about this ~one year ago, who said she's working on this, but haven't heard since. Later, Skipper made some improvements to TMVE, check out his fork at https://github.com/jseabold/tmve .

I highly recommend using this fork instead of Allison's. Her implementation is loop-based and can take several hours to run, whereas mine is vectorized using numpy and can take several minutes. I've started to include there some more distance functions versus the curious choices that were made in the original code (I think there were mistakes and this was acknowledged IIRC). There may still be mistakes so you may want to read the code. I'd be happy to fix any that you may find if you file an issue on github.

Aside: I've been playing with Hyde and Jinja2 templates a lot lately and may rewrite the whole project in the near-ish future.

>
> Skipper, how did you feed input to TMVE?
>

Hi Radim,

Here's what my code looks like to start TMVE below

> Best,
> Radim
>
>
>> hi radim,
>>
>> I found gensim yesterday and really thought it's a great tools after
>> using it.
>>
>> TMVE depends on two results : beta ( about p(t|z) ) and gamma ( about
>> p(z|d)),
>> from : http://code.google.com/p/tmve/wiki/TMVE01
>>
>> I try gemsim's LDA module, while its save() function only dump binary
>> format data.
>> >>> lda = models.ldamodel.LdaModel(corpus=mm, id2word=dict2, num_topics=10,
>> update_every=1, chunksize=1000, passes=1)
>>
>> I'm familiar with gibbs sampling version lda while not so well with
>> variational methods based onlineLDA version. (I'm trying to read that
>> paper)
>> I think the p(t|z) could be gotten from lda.show_topics(), and p(z|d)
>> have to be gotten from lda[mm].
>> Am I right?  Is there any other methods?
>> Or do you have any advice on using TMVE with gensim's lda output.
>>

This is what I'm doing. Note, I'm running an older version of gensim that I've made some changes to, so I hope it still runs on current master. After I run the LDA, I pull out a 'final.beta' and 'final.gamma' (What Blei's lda-c names them) by using these functions. (It might be nice to attach these to the model. I dunno. They may be available already easier.)

def get_gamma(lda, corpus):
   """
   Return gamma from a gensim LdaModel instance.

   Parameters
   ----------
   lda : LdaModel
       A fitted model.
   corpus : gensim Corpus
       An iterable Bag-of-Words Corpus used to fit the LDA model

   Returns
   -------
   gamma : ndarray
       An ndarray that contains gamma.
   """
   lda.VAR_MAXITER = 'est' # where is this getting set to 100 in save?
   chunksize = lda.chunksize
   import itertools
   chunker = itertools.groupby(enumerate(corpus),
               key=lambda (docno, doc): docno/chunksize)
   all_gamma = []
   for chunk_no, (key, group) in enumerate(chunker):
       chunk = np.asarray([np.asarray(doc) for _, doc in group])
       (gamma, sstats) = lda.inference(chunk)
       all_gamma.append(gamma)
   return np.vstack(all_gamma)

def blei_gamma(fname, gamma):
   """
   Writes the gamma file in Blei's format.

   Parameters
   ----------
   fname : str
       Path to file to save to
   gamma : ndarray
       Numpy array returned from get_gamma
   """
   np.savetxt(fname, gamma, fmt='%5.10f')

def blei_beta(fname, lda):
   """
   Write log probabilities to a space-delimited file as lda-c final.beta

   Parameters
   ----------
   fname : str
       Filename
   lda : LdaModel
       LdaModel instance
   """
   expElogbeta = np.log(lda.expElogbeta)
   np.savetxt(fname, expElogbeta, fmt='%5.10f')
           
blei_beta('final.beta', my_fiited_lda)
gamma = get_gamma(lda, my_corpus)
blei_gamma('final.gamma', gamma)

Feel free to include this code anywhere you want or to offer suggestions for improvement

Radim Rehurek | 3 Feb 18:33 2012
Picon

Re: [gensim:906] Re: LDA visualization questions

Thanks a lot Skipper, much appreciated!

If people start using TMVE, I can include better support for it in gensim. In fact, that was the original idea
when it came out, except TMVE seemed to have entered a zombie state :(

-rr

> ------------ Původní zpráva ------------
> Od: Skipper Seabold <jsseabold@...>
> Předmět: [gensim:905] Re: LDA visualization questions
> Datum: 03.2.2012 15:15:50
> ----------------------------------------
> 2012/2/3 Radim Rehurek <RadimRehurek@...>:
> > Hello enyun (cc Skipper and the mailing list),
> >
> > we had a conversation about TMVE some time ago with Skipper, but I think
> it went off the mailing list. Grrrr :)
> > Conclusion there was that TMVE had serious bugs in their implementation
> of the similarity measures, so the output was more or less random.
> >
> > I contacted Allison (the author of TMVE) about this ~one year ago, who
> said she's working on this, but haven't heard since. Later, Skipper made
> some improvements to TMVE, check out his fork at
> https://github.com/jseabold/tmve .
> 
> I highly recommend using this fork instead of Allison's. Her implementation
> is loop-based and can take several hours to run, whereas mine is vectorized
> using numpy and can take several minutes. I've started to include there
> some more distance functions versus the curious choices that were made in
> the original code (I think there were mistakes and this was acknowledged
> IIRC). There may still be mistakes so you may want to read the code. I'd be
> happy to fix any that you may find if you file an issue on github.
> 
> Aside: I've been playing with Hyde and Jinja2 templates a lot lately and
> may rewrite the whole project in the near-ish future.
> 
> >
> > Skipper, how did you feed input to TMVE?
> >
> 
> Hi Radim,
> 
> Here's what my code looks like to start TMVE below
> 
> > Best,
> > Radim
> >
> >
> >> hi radim,
> >>
> >> I found gensim yesterday and really thought it's a great tools after
> >> using it.
> >>
> >> TMVE depends on two results : beta ( about p(t|z) ) and gamma ( about
> >> p(z|d)),
> >> from : http://code.google.com/p/tmve/wiki/TMVE01
> >>
> >> I try gemsim's LDA module, while its save() function only dump binary
> >> format data.
> >> >>> lda = models.ldamodel.LdaModel(corpus=mm, id2word=dict2,
> num_topics=10,
> >> update_every=1, chunksize=1000, passes=1)
> >>
> >> I'm familiar with gibbs sampling version lda while not so well with
> >> variational methods based onlineLDA version. (I'm trying to read that
> >> paper)
> >> I think the p(t|z) could be gotten from lda.show_topics(), and p(z|d)
> >> have to be gotten from lda[mm].
> >> Am I right?  Is there any other methods?
> >> Or do you have any advice on using TMVE with gensim's lda output.
> >>
> 
> This is what I'm doing. Note, I'm running an older version of gensim that
> I've made some changes to, so I hope it still runs on current master. After
> I run the LDA, I pull out a 'final.beta' and 'final.gamma' (What Blei's
> lda-c names them) by using these functions. (It might be nice to attach
> these to the model. I dunno. They may be available already easier.)
> 
> def get_gamma(lda, corpus):
>    """
>    Return gamma from a gensim LdaModel instance.
> 
>    Parameters
>    ----------
>    lda : LdaModel
>        A fitted model.
>    corpus : gensim Corpus
>        An iterable Bag-of-Words Corpus used to fit the LDA model
> 
>    Returns
>    -------
>    gamma : ndarray
>        An ndarray that contains gamma.
>    """
>    lda.VAR_MAXITER = 'est' # where is this getting set to 100 in save?
>    chunksize = lda.chunksize
>    import itertools
>    chunker = itertools.groupby(enumerate(corpus),
>                key=lambda (docno, doc): docno/chunksize)
>    all_gamma = []
>    for chunk_no, (key, group) in enumerate(chunker):
>        chunk = np.asarray([np.asarray(doc) for _, doc in group])
>        (gamma, sstats) = lda.inference(chunk)
>        all_gamma.append(gamma)
>    return np.vstack(all_gamma)
> 
> def blei_gamma(fname, gamma):
>    """
>    Writes the gamma file in Blei's format.
> 
>    Parameters
>    ----------
>    fname : str
>        Path to file to save to
>    gamma : ndarray
>        Numpy array returned from get_gamma
>    """
>    np.savetxt(fname, gamma, fmt='%5.10f')
> 
> def blei_beta(fname, lda):
>    """
>    Write log probabilities to a space-delimited file as lda-c final.beta
> 
>    Parameters
>    ----------
>    fname : str
>        Filename
>    lda : LdaModel
>        LdaModel instance
>    """
>    expElogbeta = np.log(lda.expElogbeta)
>    np.savetxt(fname, expElogbeta, fmt='%5.10f')
> 
> blei_beta('final.beta', my_fiited_lda)
> gamma = get_gamma(lda, my_corpus)
> blei_gamma('final.gamma', gamma)
> 
> Feel free to include this code anywhere you want or to offer suggestions
> for improvement
> 
> 
> 

Chaminga Dissanayake | 6 Feb 11:54 2012
Picon

[gensim:908] How to evaluate short textual response using LSA ?

How to evaluate short textual response using gensim ?

I will input documents of human marked answers and when user enter new answer it should display stable marks by  using LSA.

Thomas Markus | 6 Feb 12:16 2012
Picon

Re: [gensim:908] How to evaluate short textual response using LSA ?

On 02/06/2012 11:54 AM, Chaminga Dissanayake wrote:
> How to evaluate short textual response using gensim ?
>
> I will input documents of human marked answers and when user enter new
> answer it should display stable marks by  using LSA.
I don't think gensim implements the sort of supervised learning you're
attempting.
A good start is probably the sLDA article from David Blei:

 <at> article{blei2010supervised,
  title={Supervised topic models},
  author={Blei, D.M. and McAuliffe, J.D.},
  journal={Arxiv preprint arXiv:1003.0783},
  year={2010}
}

An implementation of sLDA for gensim would be nice though :)

Cheers!
Thomas

ps: in the case I misunderstood; please enlighten me

Radim | 6 Feb 21:58 2012
Picon

[gensim:909] Re: How to evaluate short textual response using LSA ?

Hello Chaminga,

On Feb 6, 11:54 am, Chaminga Dissanayake <chaminga...@...>
wrote:
> How to evaluate short textual response using gensim ?
>
> I will input documents of human marked answers and when user enter new
> answer it should display stable marks by  using LSA.

what is a stable mark? I don't understand what you're asking.

If you want to match similar documents, have a look at the "similarity
tutorial" http://radimrehurek.com/gensim/tut3.html , or at Christian's
http similarity server: http://groups.google.com/group/gensim/browse_thread/thread/e5022e7e5793d6e2/820b5f341c6cbecb

Gensim doesn't contain any supervised (classification) algos, so like
Thomas says, go for something else (such as scikit-learn).

Best,
Radim

Radim | 6 Feb 22:10 2012
Picon

[gensim:910] Re: feeding lsi topics into a search engine.


On Jan 30, 3:48 pm, PaulR <p...@...> wrote:
> So the basic idea is to use some kind of clustering approach. Why
> cluster on the LSI (or LDA) model rather than the tfidf or bow?
> Because the lower dimensionality makes it more feasible? And because
> your topics have some structure that's not immediately apparent from
> the bow or tfidf?

Yes.

>
> In the case of sticking things into lucene do we actually need to
> learn any clusters? I can sort of think in two dimensions (or even
> three). We could divide up the unit sphere into a few course grained
> regions, and then some finer regions within each of those. At indexing
> time we add terms for the regions where our document vector would poke
> through the unit sphere. Things that are cosine close will be pointing
> in the same direction so a compound query on those terms (course
> grained first) should find similar things. We could even have
> overlapping regions to deal with the problem of things being just
> either side of a region boundary - or alternatively include scaled
> down terms for adjacent regions in the query?

I didn't really understand that :) but similarity search in high
dimensions is a hard problem (see the "curse of dimensionality").
Efficient approaches are based on data approximation (using lower
precision, so items take up less space, so more of them fit in RAM
allowing a quick initial filter pass), not data clustering (various *-
trees). It's a lot about lowering the constant factors of a linear
scan.

Of course, if you find a clever hack that works in Lucene, reasonably
well, most of the time... you're probably better off using that rather
than plowing through the academic literature :)

Radim

>
> Maybe this doesn't work so well when you have ~400 dimensions?
>
> Sorry if this is vague - my maths is rather rusty.
>
> On Jan 30, 2:03 pm, Radim <radimrehu...@...> wrote:
>
>
>
>
>
>
>
>
>
> > Re. building help structures during indexing time -- do you mean on
> > Lucene end or gensim end? I guess you mean Lucene. In general, there
> > exist many data structures and algorithms for fast nearest-neighbour
> > querying. I archived some that are relevant to high-dimensional spaces
> > in this "wishlist item":https://github.com/piskvorky/gensim/issues/51
> > , but this is not directly relevant to Lucene (~search engine)
> > integration.
>
> > -rr

Shuang | 7 Feb 00:11 2012

Re: [gensim:911] Re: Online models - changing vocabulary?

Hi, Radim,
  First let me say thank you for gensim, it's a great library. Come back to Paul's question. I am a little confused, are you saying the LSI model can't be used in an on-line fashion. Is it because that num_terms in LsiModel has to be constant? Originally I thought, when new documents with new words come, we just update the dictionary, and since each document is represented by a sparse vector anyway, adding a new word/dimension is OK. If I understand correctly now, the term-document matrix can only grow to have more documents, but not more terms.
  Regarding the hashing trick you mentioned, I am very interested in how it can solve our problem at hand. I assume we have to set num_terms ==num_of_hashes, is this problematic when num_of_hashes is very big which is normal to reduce collision? If the hashing trick indeed works, I might be wiling to implement it in gensim, as some type of dictionary class I assume.

Best,
Shuang

On Wed, Jan 25, 2012 at 2:27 AM, Radim <radimrehurek-9Vj9tDbzfuSlVyrhU4qvOw@public.gmane.org> wrote:
On Jan 24, 2:39 pm, PaulR <p...-sqPYmOVXOov10XsdtD+oqA@public.gmane.org> wrote:
> How does one deal with new dictionary terms when dealing with updates
> to an existing (LSI or LDA) model?
>
> For example lsimodel.add_documents, takes a corpus - but presumably it
> only makes sense for the documents within that corpus to be vectors in
> the same space as the documents already used for the model?

Correct. The LSI model only allows adding new observations at the
moment, not new features.


> I guess a few new terms make little difference to the model as a
> whole, but in the long run it might be that there are some terms that
> occur often in newer documents that were not present initially. Does
> this mean that it makes sense to periodically remake the model from
> scratch using all the document, possibly with a slightly different set
> of terms?

That is one option.

Another is to use the hashing trick: http://metaoptimize.com/qa/questions/6943/what-is-the-hashing-trick
. That will allow you to add new features dynamically, at the cost of
decreased traceability of what a feature actually is (hashing
collisions).

Best,
Radim

Chaminga Dissanayake | 7 Feb 03:51 2012
Picon

[gensim:913] Re: How to evaluate short textual response using LSA ?

sorry. it should be suitable ;)

Chaminga Dissanayake | 7 Feb 05:44 2012
Picon

[gensim:913] Re: How to evaluate short textual response using LSA ?

*suitable mark


Gmane