GitHub - craigboman/gutenberg: Librarian working with project gutenberg data, for NLP and machine learning purposes (https://github.com/craigboman/gutenberg).
Longitudial Detection of Dementia Through Lexical and Syntactic Changes in Writing (ftp://ftp.cs.toronto.edu/dist/gh/Le-MSc-2010.pdf)—Masters thesis by Xuan Le on psychology diagnosis with NLP.
Time Series Matching: a Multi-filter Approach by Zhihua Wang (https://www.cs.nyu.edu/web/Research/Theses/wang_zhihua.pdf)—Songs, audio clips, and other time series can be discretized and searched with dynamic programming algorithms analogous
to Levenshtein distance.
NELL, Never Ending Language Learning (http://rtw.ml.cmu.edu/rtw/publications)—CMU’s constantly evolving knowledge base that learns by scraping natural language text.
The artificial-adversary (https://github.com/airbnb/artificial-adversary) package by Jack Dai, an intern at Airbnb—Obfuscates natural language text (turning phrases like ‘you are great’ into ‘ur
gr8’). You could train a machine learning classifier to detect and translate English into obfuscated English or L33T (https://sites.google.com/site/inhainternetlanguage/different-internet-languages/l33t). You could also train a stemmer (an autoencoder with the obfuscator generating character features) to decipher obfuscated
words so your NLP pipeline can handle obfuscated text without retraining. Thank you Aleck.