Initial attempts by topic models to improve on the vector space model (developed in the mid-1970s) applied linear algebra to reduce the dimensionality of the document-term matrix. This approach is similar to the algorithm we discussed as principal component analysis in Chapter 12, Unsupervised Learning, on unsupervised learning. While effective, it is difficult to evaluate the results of these models absent a benchmark model.
In response, probabilistic models emerged that assume an explicit document generation process and provide algorithms to reverse engineer this process and recover the underlying topics.
This table highlights key milestones in the model evolution that we will address in more detail in the following sections:
Model |
Year |
Description |
Latent Semantic Indexing (LSI) |
1988 |
Reduces the word space dimensionality to capture semantic document-term relationships by |
Probabilistic Latent Semantic Analysis (pLSA) |
1999 |
Reverse-engineers a process that assumes words generate a topic and documents are a mix of topics |
Latent Dirichlet Allocation (LDA) |
2003 |
Adds a generative process for documents: a three-level hierarchical Bayesian model |