Datasets

Natural language data is everywhere you look. Language is the superpower of the human race, and your pipeline should take advantage of it:

Google’s Dataset Search (http://toolbox.google.com/datasetsearch)—A search engine similar to Google Scholar (http://scholar.google.com), but for data.
Stanford Datasets (https://nlp.stanford.edu/data/)—Pretrained word2vec and GloVE models, multilingual language models and datasets, multilingual dictionaries, lexica, and corpora.
Pretrained word vector models (https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-model)—The README for a word vector web API provides links to several word vector models, including the 300D Wikipedia GloVE model.
A list of datasets/corpora for NLP tasks, in reverse chronological order (https://github.com/karthikncode/nlp-datasets) by Karthik Narasimhan.
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets).
Datasets and tools for basic natural language processing (https://github.com/googlei18n/language-resources)—Google’s international tools for i18n.
nlpia (https://github.com/totalgood/nlpia)—Python package with data loaders (nlpia.loaders) and preprocessors for all the NLP data you’ll ever need... until you finish this book ;).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Datasets