!python -m spacy download en_core_web_lg command 132, 209, 390
!python -m spacy download en_core_web_md command 132, 209, 390
!python -m spacy download en_core_web_sm command 132, 209, 390
with help of lexicon 235 – 237
with sentiment lexicon 251 – 259
collecting sentiment scores from lexicon 252 – 255
detecting review polarity 255 – 259
evaluating which tree is better using node impurity 178 – 184
selection of best split in 184 – 185
linguistic feature engineering for 194 – 228
feature engineering for authorship attribution 200 – 226
machine-learning pipeline 196 – 200
practical use of authorship attribution and user profiling 226 – 227
machine-learning pipeline 157 – 175
setting up benchmark 169 – 175
testing generalization behavior 163 – 168
benchmark machine learning model 157
Boolean search algorithm 83 – 86
chunk.root.head.text function 141
CISI (Centre for Inventions and Scientific Information) dataset 75
implementing spam filter 46 – 49
implementing spam filter 62 – 65
evaluating performance of 196 – 197
implementing spam filter 53 – 61
evaluation of topic clustering algorithm 338 – 341
cross_val_predict functionality 293
cross_val_score functionality 293
CSV (comma-separated values) 404
loading and preprocessing 240 – 242
implementing spam filter 46 – 49
NER (named-entity recognition) 403 – 406
morphological processing 90 – 95
supervised ML (machine learning) 308 – 312
evaluating which tree is better using node impurity 178 – 184
selection of best split in 184 – 185
dependence on context 277 – 295
evaluating with cross-validation 292 – 295
extracting features from text 284 – 289
Scikit-learn machine-learning pipeline 289 – 292
displaCy visualization tool 142, 416
document similarity retrieval 104 – 105
inverse document frequency 100 – 103
implementing spam filter 62 – 65
implementing spam filter 50 – 52
feature engineering for authorship attribution 200 – 226
counts of stopwords and proportion of stopwords as features 207 – 211
distribution of word suffixes as features 219 – 222
distributions of parts of speech as features 212 – 218
unique words as features 223 – 226
word and sentence length statistics as features 201 – 207
features 25, 33, 152, 195
generalization behavior, testing 163 – 168
get_feature_names() function 287
get_feature_names_out() function 287
Gibbs Sampling for the Uninitiated (Resnik and Hardisty) 355
IDE (integrated development environment), Python 9
idf (inverse document frequency) 100 – 103
information extraction 114 – 150
building information extraction algorithm 144 – 148
part-of-speech tagging 124 – 137
dependency parsing with spaCy 139 – 144
with NER (named-entity recognition) 410 – 415
information search 5, 71 – 113
overview 5 – 16
morphological processing 90 – 95
document similarity retrieval 104 – 105
evaluation of results 106 – 111
Boolean search algorithm 83 – 86
data and data structures 75 – 83
with inverse document frequency 100 – 103
integrated development environment (IDE) 9
inverse document frequency (idf) 100 – 103
KMeans clustering algorithm 337
language data, on Decision Trees 185 – 191
LDA (latent Dirichlet allocation) 307, 348 – 360
estimating parameters for 352 – 356
length of sentiment-bearing features 295 – 297
aggregating sentiment scores with 235 – 237
collecting sentiment scores from 252 – 255
detecting review polarity 255 – 259
linguistic feature engineering 194 – 228
feature engineering for authorship attribution 200 – 226
counts of stopwords and proportion of stopwords as features 207 – 211
distribution of word suffixes as features 219 – 222
distributions of parts of speech as features 212 – 218
unique words as features 223 – 226
word and sentence length statistics as features 201 – 207
machine-learning pipeline 196 – 200
evaluating performance of classifier 196 – 197
further evaluation measures 197 – 200
practical use of authorship attribution and user profiling 226 – 227
lower bound on algorithm’s performance 173
machine learning. See ML
mean reciprocal rank (MRR) 109
metrics functionality 204, 320
addressing dependence on context with 277 – 295
evaluating with cross-validation 292 – 295
extracting features from text 284 – 289
Scikit-learn machine-learning pipeline 289 – 292
machine-learning pipeline 157 – 175
linguistic feature engineering 196 – 200
evaluating performance of classifier 196 – 197
further evaluation measures 197 – 200
machine-learning pipeline, Scikit-learn 289 – 292
topic classification as supervised task 307 – 325
evaluation of results 320 – 325
topic discovery as unsupervised task 325 – 341
evaluation of topic clustering algorithm 338 – 341
unsupervised ML (machine-learning) approaches 325 – 329
morphological processing 90 – 95
MRR (mean reciprocal rank) 109
multiclass classification 33, 307 – 325
Natural Language Processing Toolkit (NLTK) 49
NER (named-entity recognition) 384 – 421
as sequence labeling task 392 – 403
sequential solution for NER 397 – 403
named entity (NE) types 388 – 390
practical applications of 403 – 418
data loading and exploration 403 – 406
information extraction 410 – 415
named entities visualization 416 – 418
named entity types exploration with spaCy 406 – 410
neural-based language modeling 24
neural machine translation (NMT) 28
n-grams 22, 24, 280
NLP (natural language processing) 1 – 30
deploying spam filter in practice 65 – 66
implementing spam filter 46 – 65
advanced information search 16 – 18
conversational agents and intelligent virtual assistants 18 – 20
spell- and grammar checking 28 – 29
text prediction and language generation 20 – 25
nltk.download() command 50, 84, 89, 159, 269
NLTK (Natural Language Processing Toolkit) 49
NMT (neural machine translation) 28
implementing spam filter 50 – 52
operator’s itemgetter functionality 104
ORG (organization) type 385, 410
pandas read_csv functionality 404
part-of-speech taggers 18, 128
part-of-speech tagging 124 – 137
setting up benchmark 169 – 175
testing generalization behavior 163 – 168
linguistic feature engineering 196 – 200
evaluating performance of classifier 196 – 197
further evaluation measures 197 – 200
data loading and preprocessing 240 – 242
plot_confusion_matrix functionality 323
pobj (prepositional object) 142
prepositional object (pobj) 142
probability estimation 21, 136 – 137
morphological processing 90 – 95
Scikit-learn machine-learning pipeline 289 – 292
search algorithm 73, 103 – 111
document similarity retrieval 104 – 105
evaluation of results 106 – 111
sentences, length statistics 201 – 207
addressing dependence on context with machine learning 277 – 295
evaluating with cross-validation 292 – 295
extracting features from text 284 – 289
Scikit-learn machine-learning pipeline 289 – 292
aggregating sentiment scores with sentiment lexicon 251 – 259
collecting sentiment scores from lexicon 252 – 255
detecting review polarity 255 – 259
negation handling for 298 – 301
data loading and preprocessing 240 – 242
aggregating sentiment score with help of lexicon 235 – 237
learning to detect sentiment in data-driven way 237 – 238
varying length of sentiment-bearing features 295 – 297
sentiment lexicon-based approach 235
sequential solution for NER 397 – 403
simple heuristic algorithm 252
simple_preprocess functionality 365
singular value decomposition (SVD) 332
SMT (statistical machine translation) 28
dependency parsing with 139 – 144
named entity types exploration with 406 – 410
part-of-speech tagging with 128 – 137
deploying spam filter in practice 65 – 66
implementing spam filter 46 – 65
defining data and classes 46 – 49
extracting and normalizing features 50 – 52
splitting text into words 49 – 50
extracting and normalizing features 42 – 43
splitting text into words 37 – 42
Speech and Language Processing (Jurafsky and Martin) 2, 392
implementing spam filter 49 – 50
statistical machine translation (SMT) 28
count of and proportion of as features 207 – 211
StratifiedShuffleSplit function 167
stratified shuffling split 166
supervised ML (machine learning) 34, 153, 307 – 325
evaluation of results 320 – 325
SVD (singular value decomposition) 332
dependency parsing with spaCy 139 – 144
text, extracting features from 284 – 289
TF-IDF (term frequency—inverse document frequency) 314
tokenizer tool 9, 41, 84, 129
topic classification as supervised ML task 307 – 325
evaluation of results 320 – 325
topic discovery as unsupervised ML task 325 – 341
evaluation of topic clustering algorithm 338 – 341
unsupervised approaches 325 – 329
implementation of topic modeling algorithm 360 – 378
with LDA (latent Dirichlet allocation) 349 – 360
estimating parameters for 352 – 356
implementing spam filter 53 – 61
unsupervised ML (machine learning) 325 – 341
evaluation of topic clustering algorithm 338 – 341
upper bound on algorithm’s performance 173
visualization, named entities 416 – 418
distribution of suffixes as features 219 – 222
length statistics as features 201 – 207
unique words as features 223 – 226
with inverse document frequency 100 – 103