One of the best way to gain a deep understanding of a topic is to try to repeat the experiments of researchers and then modify
them in some way. That’s how the best professors and mentors “teach” their students, by just encouraging them to try to duplicate
the results of other researchers they’re interested in. You can’t help but tweak an approach if you spend enough time trying
to get it to work for you.
Vector space models and semantic search
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines (https://arxiv.org/pdf/1706.00957.pdf)—Jan Rygl et al. were able to use a conventional inverted index to implement efficient semantic search for all of Wikipedia.
Learning Low-Dimensional Metrics (https://papers.nips.cc/paper/7002-learning-low-dimensional-metrics.pdf)—Lalit Jain et al. were able to incorporate human judgement into pairwise distance metrics, which can be used for better decision-making and unsupervised clustering of word
vectors and topic vectors. For example, recruiters can use this to steer a content-based recommendation engine that matches
resumes with job descriptions.
RAND-WALK: A latent variable model approach to word embeddings (https://arxiv.org/pdf/1502.03520.pdf) by Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski—Explains the latest (2016) understanding of the
“vector-oriented reasoning” of Word2vec and other word vector space models, particularly analogy questions
Efficient Estimation of Word Representations in Vector Space (https://arxiv.org/pdf/1301.3781.pdf) by Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean at Google, Sep 2013—First publication of the Word2vec model,
including an implementation in C++ and pretrained models using a Google News corpus
From Distributional to Semantic Similarity (https://www.era.lib.ed.ac.uk/bitstream/handle/1842/563/IP030023.pdf) 2003 Ph.D. Thesis by James Richard Curran —Lots of classic information retrieval (full-text search) research, including
TF-IDF normalization and page rank techniques for web search
Building a Quantitative Trading Strategy to Beat the S&P 500 (https://www.youtube.com/watch?v=ll6Tq-wTXXw)—At PyCon 2016, Karen Rubin explained how she discovered that female CEOs are predictive of rising stock prices, though not
as strongly as she initially thought.
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation (https://arxiv.org/pdf/1406.1078.pdf) by Kyunghyun Cho et al., 2014—Paper that first introduced gated recurrent units, making LSTMs more efficient for NLP
LSTMs and RNNs
We had a lot of difficulty understanding the terminology and architecture of LSTMs. This is a gathering of the most cited
references, so you can let the authors “vote” on the right way to talk about LSTMs. The state of the Wikipedia page (and Talk
page discussion) on LSTMs is a pretty good indication of the lack of consensus about what LSTM means:
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (https://arxiv.org/pdf/1406.1078.pdf) by Cho et al.—Explains how the contents of the memory cells in an LSTM layer can be used as an embedding that can encode
variable length sequences and then decode them to a new variable length sequence with a potentially different length, translating
or transcoding one sequence into another.
Reinforcement Learning with Long Short-Term Memory (https://papers.nips.cc/paper/1953-reinforcement-learning-with-long-short-term-memory.pdf) by Bram Bakker—Application of LSTMs to planning and anticipation cognition with demonstrations of a network that can solve
the T-maze navigation problem and an advanced pole-balancing (inverted pendulum) problem.
Supervised Sequence Labelling with Recurrent Neural Networks (https://mediatum.ub.tum.de/doc/673554/file.pdf)—Thesis by Alex Graves with advisor B. Brugge; a detailed explanation of the mathematics for the exact gradient for LSTMs
as first proposed by Hochreiter and Schmidhuber in 1997. But Graves fails to define terms like CEC or LSTM block/cell rigorously.
Theano LSTM documentation (http://deeplearning.net/tutorial/lstm.html) by Pierre Luc Carrier and Kyunghyun Cho—Diagram and discussion to explain the LSTM implementation in Theano and Keras.
Learning to Forget: Continual Prediction with LSTM (http://mng.bz/4v5V) by Felix A. Gers, Jurgen Schmidhuber, and Fred Cummins—Uses nonstandard notation for layer inputs (yin) and outputs (yout) and internal hidden state (h). All math and diagrams are “vectorized.”
Long Short-Term Memory (http://www.bioinf.jku.at/publications/older/2604.pdf) by Sepp Hochreiter and Jurgen Schmidhuber, 1997—Original paper on LSTMs with outdated terminology and inefficient implementation,
but detailed mathematical derivation.