Using the BBC data as before, we use sklearn.decomposition.LatentDirichletAllocation to train an LDA model with five topics (see the sklearn documentation for detail on parameters, and the notebook lda_with_sklearn for implementation details):
lda = LatentDirichletAllocation(n_components=5,
n_jobs=-1,
max_iter=500,
learning_method='batch',
evaluate_every=5,
verbose=1,
random_state=42)
ldat.fit(train_dtm)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=5, learning_decay=0.7, learning_method='batch',
learning_offset=10.0, max_doc_update_iter=100, max_iter=500,
mean_change_tol=0.001, n_components=5, n_jobs=-1,
n_topics=None, perp_tol=0.1, random_state=42,
topic_word_prior=None, total_samples=1000000.0, verbose=1)
The model tracks the in-sample perplexity during training and stops iterating once this measure stops improving. We can persist and load the result as usual with sklearn objects:
joblib.dump(lda, model_path / 'lda.pkl')
lda = joblib.load(model_path / 'lda.pkl')