John Riley | 10 May 2013 04:28
Picon

[gensim:1894] Gensim now available on PiCloud as public environment

Hello,


I saw that Gensim has a section on distributed computing [1]. I just wanted to let you all know that PiCloud [2] supports gensim. While any user could always create a custom environment, and install the latest version themselves [2], we've decided to address the issue directly by introducing a publicly shared environment with gensim.

In short, specifying a job's environment as "/picloud/gensim" will use a public, up-to-date version of gensim. You can clone the environment to "freeze" the version, so that it doesn't change under you. Hope this helps!


While we aren't integrated into gensim to the point where "distributed=True" will use PiCloud, we're interested in hearing whether this is a desirable & in-demand feature.

Best Regards,
John

--
John Riley
PiCloud, Inc.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
M. | 9 May 2013 04:51
Picon

[gensim:1893] Merge Step of Distributed LDA

Been going through the distributed LDA code. Seems like the E-step can be done in parallel, while the M-step has to gather the results of all the intermediate E-steps and then run itself on one computer. Anyway to better parallelize the M-step? Something like the merge-step in mergesort, where after all the E steps are done, the worker nodes pair-off and M-step, and then merge etc. and bubble up returning a completed M-step to the dispatcher?

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
叶璐 | 8 May 2013 09:00
Picon

[gensim:1892] install problem

When I try to install scipy

我 went to
/家庭/ VI /构建/ scipy的

sudo的蟒蛇setup.py安装


错误:库dfftpack有Fortran的来源,但没有Fortran编译器发现

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Jason | 29 Apr 2013 21:58
Picon

[gensim:1872] extremely slow lda?

I've run lda on a few different corpuses with great results, but when
I run it on a large dataset, it is too slow.  I think I might be doing
something wrong though as I was reading the wiki page and it said:
"Creating this LDA model of Wikipedia takes about 6 hours and 20
minutes on my laptop"
For my corpus, I have 1.5 million documents,with a mean of 304.95
words per document and the standar deviation is 1314.35.
To process this whole dataset from creating the dictionary to running
the model took 10 days running on aws on a m1.large box (I think
equivalent to ~2  intel cores).
My questions are, am I doing something inherently wrong?  I have  not
tried the distributed version yet, are my times expected? Are my
modeling options set incorrectly? Is there anything else I can do to
speed this up? My main reasons for speeding this up is because I am
still testing and tweaking settings, and if it is 10 days between
tests, that will be too painful.

I've included the relevant code.

def prepare_doc(text):
    #print(text)
    text = text.lower()
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text,
flags=re.MULTILINE)
    #print(text)
    text = re.sub(r' <at> \w+', ' ', text, flags=re.MULTILINE)
    text = gensim.parsing.preprocessing.strip_tags(text)
    text = gensim.parsing.preprocessing.strip_punctuation(text)
    text = gensim.parsing.preprocessing.remove_stopwords(text)
    text = gensim.parsing.preprocessing.strip_short(text)
    text = gensim.parsing.preprocessing.strip_numeric(text)
    text = gensim.parsing.preprocessing.stem_text(text)
    #print(text)
    return text

#corpus loader
class MyCorpus(object):
    def __init__(self,docs_file,dictionary):
        self.docs_file = docs_file
        self.dictionary = dictionary
    def __iter__(self):
        for line in open(self.docs_file):
             # assume there's one document per line, tokens separated
by whitespace
             yield self.dictionary.doc2bow(json.loads(line)
['doc'].split())

dictionary = corpora.Dictionary()
for line in open(docs_file):
    dictionary.add_documents([json.loads(line)['doc'].split()])
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

corpus = MyCorpus(docs_file,dictionary)
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary,
num_topics=num_topics,chunksize=100)

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.

Radim Řehůřek | 29 Apr 2013 15:55
Gravatar

[gensim:1870] ML & NLP consulting

Dear all,

I will be available again for consulting from next week on. I am
looking for interesting projects, in the natural language processing
and machine learning fields.

I have experience with system design & implementation in various
software areas (ad targeting, search engines, library systems,
plagiarism detection...). See my LinkedIn profile for more details.

At the same time, there are more tasks than I can handle, so if you
are a skilled developer interested in ML, feel free to contact me to
help share the load :)

Happy coding,
Radim

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.

Jason | 22 Apr 2013 01:14
Picon

[gensim:1861] numbers as words not processing

I've used gensim successfully a few times, but on my most recent test,
I wanted to test documents that are made of numbers only, so for
example:
doc1: 1081 12 13
doc2: 12 5
With my code I am able to create to create a dictionary file, here is
some output from save_to_text:
15835   100009414       5
45862   100011801       9
12038   100013108       11
18947   10001562        5
30806   100017230       5
76358   100020413       45
50924   1000237428      6
73904   1000262514      333

and I am able to create a lda model and have it show_topics with
numbers, but then we I go to classify documents for their topics, I
always get nothing back when I run
dictionary.doc2bow(doc) where doc is just a string like "100009414"

Is there something I need to change or fix to have this work for
numbers?

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.

Peter Parente | 19 Apr 2013 17:46
Picon
Gravatar

[gensim:1857] Distributed, online LSI and persistence

Hi,


I have a question that is somewhat a follow-on to (https://groups.google.com/forum/?fromgroups=#!searchin/gensim/distributed$20online$20lsi/gensim/MTb--vhZ3a4/0TQiB-6U-C8J).

Say I build a LSI model initially in distributed mode, persist it to disk, and shut down my LSI cluster of workers + dispatcher. If I come back later, reload my model, and want to add some documents to it:

1. Will it work? Or once I shut down the cluster of workers, has the dispatcher lost some state that is necessary for incremental additions?
2. Do I have to continue using the LsiModel in distributed mode or can I switch the model I loaded from disk to do everything locally?

Thanks,
Pete



--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Jason | 11 Apr 2013 22:07
Picon

[gensim:1850] building fast dictionaries?

is there a faster way to build dictionaries then with my current code?
I'm trying to find the write combinations of options and so I am
constantly testing with different options including with
filter_extremes.  I am working with 16k documents, and each document
has a few thousand words.

    dictionary = corpora.Dictionary()
    if os.path.isfile(dictionary_file):
        dictionary.load(dictionary_file)
        dictionary.filter_extremes(no_below=5)
    else:
        for line in open(docs_file):
            dictionary.add_documents(json.loads(line)['doc'].split())
        dictionary.save(dictionary_file)
        dictionary.filter_extremes(no_below=5)

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.

Radim Řehůřek | 9 Apr 2013 19:38
Gravatar

[gensim:1849] Re: Online LDA not really online?

Hello Quentin (cross-posting to the mailing list),

> Hi Radim,
>
> You say that "LDA is not truly online, as the impact of later updates on the model gradually diminishes".
But Hoffman's paper [1] define the decay \rho_t as
>
>    \rho_t = (\tau_0 + t)^{- \kappa}
>
> As kappa is in (0.5, 1], I think that actually the impact of later updates are getting bigger and bigger
compared to, say, the first ones.
>
> In other words, I feel like the model is indeed forgetting the old values of the model and is truly online.
>
> What do you think?

With increasing `t`, the impact `rho_t` gets smaller and smaller. Run
a few iterations of the formula to see for yourself :)

Best,
Radim

>
>
> Quentin
>
>
>
>
>
> [1] Hoffman, Blei, Bach, Online Learning for Latent Dirichlet Allocation, 2010
>
>
>

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.

Matthew Mitsui | 7 Apr 2013 19:54
Picon

[gensim:1847] Online HDP per-document topic distribution?

Hi boards,


I am fairly new to topic modeling and am attempting to extract per-document topic distributions from training and test data and use the distributions as features in a classifier that assigns labels to documents.  How would I extract a document's topic distribution for HDP/online HDP?  I read on the board topic "difference between lda.expElogbeta and lda.show_topics" that the parameter I should pay attention to is gamma.  Is gamma the variable that I should extract for each document, and if so, how?  Also, is there any normalization or additional processing that I need to do on what I need to extract before using it as a feature?

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Radim Řehůřek | 6 Apr 2013 18:45
Gravatar

[gensim:1844] Re: about Gensim

Hello Niraj,

On Fri, Apr 5, 2013 at 6:02 PM, Niraj Shrestha <nshresthan@...> wrote:
> Dear Sir
>
> I install the Gensim in ubuntu and tried the example (similarity) given in
> the website. It works fine in interactive python but when I tried to save
> all the code in .py file and run, there is bug in building set of stopwords
> and also making all_tokens etc.
>
> texts = [[word for word in document.lower().split() if word not in stoplist]
>
> for document in documents]
>
> This commands work fine in interactive python, but when it run saving in .py
> file then it create a set of character
>
> Rather than set of words.

There is no "interactive" vs. ".py" Python -- it's still the same
Python. If you see a difference, you likely made a copy & paste error.

In particular, if you see `texts` as a list of single characters
instead of tokens, it looks like your `documents` is a single string,
instead of an array of strings as in the tutorial.

HTH,
Radim

> Then I changed the above code as
>
> texts = [[word for word in documents.lower().split() if word not in
> stoplist]]
>
>
>
> It creates set of words but now there is error on
>
> all_tokens = sum(texts, [])
>
>
>
> it says,
>
> Traceback (most recent call last):
>
>   File "similarity.py", line 26, in <module>
>
>     all_tokens = sum(texts, [])
>
> TypeError: can only concatenate list (not "str") to list
>
>
>
> I also got same kind of error while I load the stopword.txt file and try to
> convert into set. But I am not able to do that. Then I have hard coded the
> stop words.
>
>
>
> Looking forward to your kind suggestion.
>
>
>
> Thanks in advance
>
> Shrestha
>
>
>
> PS: I tried to join mailing list but it say there is no group of gensim.

--

-- 
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.


Gmane