[gensim:1872] extremely slow lda?
I've run lda on a few different corpuses with great results, but when
I run it on a large dataset, it is too slow. I think I might be doing
something wrong though as I was reading the wiki page and it said:
"Creating this LDA model of Wikipedia takes about 6 hours and 20
minutes on my laptop"
For my corpus, I have 1.5 million documents,with a mean of 304.95
words per document and the standar deviation is 1314.35.
To process this whole dataset from creating the dictionary to running
the model took 10 days running on aws on a m1.large box (I think
equivalent to ~2 intel cores).
My questions are, am I doing something inherently wrong? I have not
tried the distributed version yet, are my times expected? Are my
modeling options set incorrectly? Is there anything else I can do to
speed this up? My main reasons for speeding this up is because I am
still testing and tweaking settings, and if it is 10 days between
tests, that will be too painful.
I've included the relevant code.
def prepare_doc(text):
#print(text)
text = text.lower()
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text,
flags=re.MULTILINE)
#print(text)
text = re.sub(r' <at> \w+', ' ', text, flags=re.MULTILINE)
text = gensim.parsing.preprocessing.strip_tags(text)
text = gensim.parsing.preprocessing.strip_punctuation(text)
text = gensim.parsing.preprocessing.remove_stopwords(text)
text = gensim.parsing.preprocessing.strip_short(text)
text = gensim.parsing.preprocessing.strip_numeric(text)
text = gensim.parsing.preprocessing.stem_text(text)
#print(text)
return text
#corpus loader
class MyCorpus(object):
def __init__(self,docs_file,dictionary):
self.docs_file = docs_file
self.dictionary = dictionary
def __iter__(self):
for line in open(self.docs_file):
# assume there's one document per line, tokens separated
by whitespace
yield self.dictionary.doc2bow(json.loads(line)
['doc'].split())
dictionary = corpora.Dictionary()
for line in open(docs_file):
dictionary.add_documents([json.loads(line)['doc'].split()])
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)
corpus = MyCorpus(docs_file,dictionary)
model = gensim.models.ldamodel.LdaModel(corpus, id2word=dictionary,
num_topics=num_topics,chunksize=100)
--
--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@...
For more options, visit https://groups.google.com/groups/opt_out.