Re: [gensim:906] Re: LDA visualization questions
Thanks a lot Skipper, much appreciated!
If people start using TMVE, I can include better support for it in gensim. In fact, that was the original idea
when it came out, except TMVE seemed to have entered a zombie state :(
-rr
> ------------ Původní zpráva ------------
> Od: Skipper Seabold <jsseabold@...>
> Předmět: [gensim:905] Re: LDA visualization questions
> Datum: 03.2.2012 15:15:50
> ----------------------------------------
> 2012/2/3 Radim Rehurek <RadimRehurek@...>:
> > Hello enyun (cc Skipper and the mailing list),
> >
> > we had a conversation about TMVE some time ago with Skipper, but I think
> it went off the mailing list. Grrrr :)
> > Conclusion there was that TMVE had serious bugs in their implementation
> of the similarity measures, so the output was more or less random.
> >
> > I contacted Allison (the author of TMVE) about this ~one year ago, who
> said she's working on this, but haven't heard since. Later, Skipper made
> some improvements to TMVE, check out his fork at
> https://github.com/jseabold/tmve .
>
> I highly recommend using this fork instead of Allison's. Her implementation
> is loop-based and can take several hours to run, whereas mine is vectorized
> using numpy and can take several minutes. I've started to include there
> some more distance functions versus the curious choices that were made in
> the original code (I think there were mistakes and this was acknowledged
> IIRC). There may still be mistakes so you may want to read the code. I'd be
> happy to fix any that you may find if you file an issue on github.
>
> Aside: I've been playing with Hyde and Jinja2 templates a lot lately and
> may rewrite the whole project in the near-ish future.
>
> >
> > Skipper, how did you feed input to TMVE?
> >
>
> Hi Radim,
>
> Here's what my code looks like to start TMVE below
>
> > Best,
> > Radim
> >
> >
> >> hi radim,
> >>
> >> I found gensim yesterday and really thought it's a great tools after
> >> using it.
> >>
> >> TMVE depends on two results : beta ( about p(t|z) ) and gamma ( about
> >> p(z|d)),
> >> from : http://code.google.com/p/tmve/wiki/TMVE01
> >>
> >> I try gemsim's LDA module, while its save() function only dump binary
> >> format data.
> >> >>> lda = models.ldamodel.LdaModel(corpus=mm, id2word=dict2,
> num_topics=10,
> >> update_every=1, chunksize=1000, passes=1)
> >>
> >> I'm familiar with gibbs sampling version lda while not so well with
> >> variational methods based onlineLDA version. (I'm trying to read that
> >> paper)
> >> I think the p(t|z) could be gotten from lda.show_topics(), and p(z|d)
> >> have to be gotten from lda[mm].
> >> Am I right? Is there any other methods?
> >> Or do you have any advice on using TMVE with gensim's lda output.
> >>
>
> This is what I'm doing. Note, I'm running an older version of gensim that
> I've made some changes to, so I hope it still runs on current master. After
> I run the LDA, I pull out a 'final.beta' and 'final.gamma' (What Blei's
> lda-c names them) by using these functions. (It might be nice to attach
> these to the model. I dunno. They may be available already easier.)
>
> def get_gamma(lda, corpus):
> """
> Return gamma from a gensim LdaModel instance.
>
> Parameters
> ----------
> lda : LdaModel
> A fitted model.
> corpus : gensim Corpus
> An iterable Bag-of-Words Corpus used to fit the LDA model
>
> Returns
> -------
> gamma : ndarray
> An ndarray that contains gamma.
> """
> lda.VAR_MAXITER = 'est' # where is this getting set to 100 in save?
> chunksize = lda.chunksize
> import itertools
> chunker = itertools.groupby(enumerate(corpus),
> key=lambda (docno, doc): docno/chunksize)
> all_gamma = []
> for chunk_no, (key, group) in enumerate(chunker):
> chunk = np.asarray([np.asarray(doc) for _, doc in group])
> (gamma, sstats) = lda.inference(chunk)
> all_gamma.append(gamma)
> return np.vstack(all_gamma)
>
> def blei_gamma(fname, gamma):
> """
> Writes the gamma file in Blei's format.
>
> Parameters
> ----------
> fname : str
> Path to file to save to
> gamma : ndarray
> Numpy array returned from get_gamma
> """
> np.savetxt(fname, gamma, fmt='%5.10f')
>
> def blei_beta(fname, lda):
> """
> Write log probabilities to a space-delimited file as lda-c final.beta
>
> Parameters
> ----------
> fname : str
> Filename
> lda : LdaModel
> LdaModel instance
> """
> expElogbeta = np.log(lda.expElogbeta)
> np.savetxt(fname, expElogbeta, fmt='%5.10f')
>
> blei_beta('final.beta', my_fiited_lda)
> gamma = get_gamma(lda, my_corpus)
> blei_gamma('final.gamma', gamma)
>
> Feel free to include this code anywhere you want or to offer suggestions
> for improvement
>
>
>