Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

index

Symbols

!python -m spacy download en_core_web_lg command 132, 209, 390

!python -m spacy download en_core_web_md command 132, 209, 390

!python -m spacy download en_core_web_sm command 132, 209, 390

accuracy 45, 196

additive smoothing 318

add-one smoothing 318

adjectives lexicons 253

adposition 130

ADP POS tag 144

ADP tag 130

aggregating sentiment scores

with help of lexicon 235 – 237

with sentiment lexicon 251 – 259

collecting sentiment scores from lexicon 252 – 255

detecting review polarity 255 – 259

algorithms

topic clustering 338 – 341

topic modeling 360 – 378

applying LDA model 371 – 374

exploring results 375 – 378

loading data 361 – 363

preprocessing data 363 – 371

alpha parameter 317

argmax 401

arrays 7

author profiling 151 – 193

authorship attribution

overview 154

practical use of 226 – 227

Decision Trees 175

classifier basics 175 – 177

evaluating which tree is better using node impurity 178 – 184

on language data 185 – 191

selection of best split in 184 – 185

linguistic feature engineering for 194 – 228

feature engineering for authorship attribution 200 – 226

machine-learning pipeline 196 – 200

practical use of authorship attribution and user profiling 226 – 227

machine-learning pipeline 157 – 175

original data 157 – 163

setting up benchmark 169 – 175

testing generalization behavior 163 – 168

user profiling 155 – 157

bag-of-words models 397

base form 91

baseline 46

benchmark machine learning model 157

bigram modeling 22

Binarizer tool 289

binary classification 33, 159

binary method 258

BIO scheme 393 – 394

Blei, David 381

Boolean search algorithm 83 – 86

chunk.root.dep function 141

chunk.root.head.text function 141

chunk.root.text function 141

chunk.text function 141

CISI (Centre for Inventions and Scientific Information) dataset 75

classes

defining 37

implementing spam filter 46 – 49

classification 32

classifiers

Decision Trees 175 – 177

evaluating

implementing spam filter 62 – 65

overview 45 – 46

evaluating performance of 196 – 197

Naïve Bayes 207, 312 – 320

training

implementing spam filter 53 – 61

overview 43 – 44

class label 33

clustering

evaluation of topic clustering algorithm 338 – 341

for topic discovery 330 – 337

codecs.open function 47

completeness 339

concordance function 64

conditional probability 54

confusion matrices 196

content words 18

convergence 329

conversational agents 18, 20

cosine similarity 13, 104

Counter functionality 216

CountVectorizer tool 285

cross-validation 292 – 295

cross_val_predict functionality 293

cross_val_score functionality 293

CSV (comma-separated values) 404

.csv format 404

data

data structures 75 – 83

defining 37

for sentiment analysis

analyzing 243 – 251

loading and preprocessing 240 – 242

implementing spam filter 46 – 49

NER (named-entity recognition) 403 – 406

processing 87 – 95

morphological processing 90 – 95

removing stopwords 87 – 90

supervised ML (machine learning) 308 – 312

topic modeling algorithm

applying LDA model 371 – 374

exploring results 375 – 378

loading data 361 – 363

preprocessing data 363 – 371

DATE type 410

decision rule 176

Decision Trees 175

classifier basics 175 – 177

evaluating which tree is better using node impurity 178 – 184

on language data 185 – 191

selection of best split in 184 – 185

deep learning 4

def keyword 15

dependence on context 277 – 295

evaluating with cross-validation 292 – 295

extracting features from text 284 – 289

preparing data 278 – 284

Scikit-learn machine-learning pipeline 289 – 292

dependency parsing 139 – 144

dependents 140

df.shape function 405

displaCy visualization tool 142, 416

dobj (direct object) 142

dobj relation 145

documents

document similarity retrieval 104 – 105

inverse document frequency 100 – 103

dot product 14

downstream tasks 386

Enron dataset 47

Euclidean distance 11

Euclidean space 11

evaluating classifiers

implementing spam filter 62 – 65

overview 45 – 46

evidence 58

extracting features

implementing spam filter 50 – 52

overview 42 – 43

F1 measure 199

F1 score 199

feature engineering for authorship attribution 200 – 226

counts of stopwords and proportion of stopwords as features 207 – 211

distribution of word suffixes as features 219 – 222

distributions of parts of speech as features 212 – 218

unique words as features 223 – 226

word and sentence length statistics as features 201 – 207

features 25, 33, 152, 195

feature selection 187

feature sparsity 187

feature vector 25, 201

fit method 204, 288

fit-predict routine 289

fit_transform method 286

format functionality 133

formatted string literals 49

frequent words lexicons 253

functions 25, 43

function words 18, 88

generalization behavior, testing 163 – 168

generative models 357

gensim functionality 368

get_feature_names() function 287

get_feature_names_out() function 287

Gibbs Sampling for the Uninitiated (Resnik and Hardisty) 355

Gini impurity 182

gold standard labels 75

GPE (geopolitical entity) 385

GPE type 410

grammar checking 28 – 29

ground truth labels 75

Gutenberg Project data 159

IDE (integrated development environment), Python 9

idf (inverse document frequency) 100 – 103

if statement 34

information extraction 114 – 150

building information extraction algorithm 144 – 148

part-of-speech tagging 124 – 137

with spaCy 128 – 137

word types 124 – 128

syntactic parsing 137 – 144

dependency parsing with spaCy 139 – 144

sentence structure 137 – 139

task 120 – 124

use cases 116 – 120

with NER (named-entity recognition) 410 – 415

information retrieval 5

information search 5, 71 – 113

advanced 16 – 18

overview 5 – 16

processing data 87 – 95

morphological processing 90 – 95

removing stopwords 87 – 90

search algorithm 103 – 111

deploying 111

document similarity retrieval 104 – 105

evaluation of results 106 – 111

tasks 72 – 86

Boolean search algorithm 83 – 86

data and data structures 75 – 83

weighing words 96 – 103

with inverse document frequency 100 – 103

with term frequency 97 – 100

input functionality 66

installation instructions 422

integrated development environment (IDE) 9

inverse document frequency (idf) 100 – 103

Jurafsky, Dan 2, 392

keywords 96

k-fold cross-validation 293

K-means clustering 337

KMeans clustering algorithm 337

language data, on Decision Trees 185 – 191

language generation 19 – 25

language modeling 24

Laplace smoothing 318

latent factors 336

LDA (latent Dirichlet allocation) 307, 348 – 360

as generative model 356 – 360

estimating parameters for 352 – 356

lemmas base forms 130

lemmatization 130

lemmatizer tool 18, 130

length-normalized vectors 13

length of sentiment-bearing features 295 – 297

lexicons, sentiment 251 – 259

aggregating sentiment scores with 235 – 237

collecting sentiment scores from 252 – 255

detecting review polarity 255 – 259

linguistic feature engineering 194 – 228

feature engineering for authorship attribution 200 – 226

counts of stopwords and proportion of stopwords as features 207 – 211

distribution of word suffixes as features 219 – 222

distributions of parts of speech as features 212 – 218

unique words as features 223 – 226

word and sentence length statistics as features 201 – 207

machine-learning pipeline 196 – 200

evaluating performance of classifier 196 – 197

further evaluation measures 197 – 200

practical use of authorship attribution and user profiling 226 – 227

list comprehensions 48

LOC (location) 385

lower bound on algorithm’s performance 173

machine learning. See ML

machine translation 26 – 28

majority class baseline 46

Markov models 396

Martin, James H. 2, 392

math functionality 12

mean precision 107

mean precision @k 107

mean reciprocal rank (MRR) 109

metrics functionality 204, 320

ML (machine learning)

addressing dependence on context with 277 – 295

evaluating with cross-validation 292 – 295

extracting features from text 284 – 289

preparing data 278 – 284

Scikit-learn machine-learning pipeline 289 – 292

author profiling 151 – 193

authorship attribution 154

Decision Trees 175

machine-learning pipeline 157 – 175

user profiling 155 – 157

linguistic feature engineering 196 – 200

evaluating performance of classifier 196 – 197

further evaluation measures 197 – 200

machine-learning pipeline, Scikit-learn 289 – 292

topic classification as supervised task 307 – 325

data 308 – 312

evaluation of results 320 – 325

with Naïve Bayes 312 – 320

topic discovery as unsupervised task 325 – 341

clustering 330 – 337

evaluation of topic clustering algorithm 338 – 341

unsupervised ML (machine-learning) approaches 325 – 329

morphological forms 90

morphological processing 90 – 95

morphology 90

MRR (mean reciprocal rank) 109

multiclass classification 33, 307 – 325

Naïve Bayes 207, 312 – 320

Natural Language Processing Toolkit (NLTK) 49

negation 298 – 301

NEG marker 298

NER (named-entity recognition) 384 – 421

20 Newsgroups dataset 308

as sequence labeling task 392 – 403

BIO scheme 393 – 394

sequential solution for NER 397 – 403

sequential tasks 395 – 397

BIOES scheme 393

challenges in 390 – 392

IO scheme 393

named entity (NE) types 388 – 390

practical applications of 403 – 418

data loading and exploration 403 – 406

information extraction 410 – 415

named entities visualization 416 – 418

named entity types exploration with spaCy 406 – 410

neural-based language modeling 24

neural machine translation (NMT) 28

n-grams 22, 24, 280

NLP (natural language processing) 1 – 30

history of 2 – 4

spam filtering 31 – 70

deploying spam filter in practice 65 – 66

implementing spam filter 46 – 65

overview 31 – 35

tasks 36 – 46

tasks 5 – 29

advanced information search 16 – 18

conversational agents and intelligent virtual assistants 18 – 20

information search 5 – 16

machine translation 26 – 28

spam filtering 25

spell- and grammar checking 28 – 29

text prediction and language generation 20 – 25

nlp pipeline 130

nltk.download() command 50, 84, 89, 159, 269

NLTK (Natural Language Processing Toolkit) 49

NMT (neural machine translation) 28

node impurity 178 – 184

normalizing features

implementing spam filter 50 – 52

overview 42 – 43

noun phrases 140

NP (noun phrase) 145

nsubj relation 145

operator’s itemgetter functionality 104

operator functionality 220

ORDINAL type 410

ORG (organization) type 385, 410

os functionality 240

os module 47

pandas 404

pandas read_csv functionality 404

parsers 18, 140

part-of-speech taggers 18, 128

part-of-speech tagging 124 – 137

with spaCy 128 – 137

word types 124 – 128

parts of speech 212 – 218

PART tag 134

PERSON type 410

Pipeline functionality 289

pipelines

author profiling 157 – 175

original data 157 – 163

setting up benchmark 169 – 175

testing generalization behavior 163 – 168

linguistic feature engineering 196 – 200

evaluating performance of classifier 196 – 197

further evaluation measures 197 – 200

sentiment analysis 239 – 251

analyzing data 243 – 251

data loading and preprocessing 240 – 242

plot_confusion_matrix functionality 323

pobj (prepositional object) 142

polarity, sentiment 255 – 259

POS taggers 128

precision 106, 198

predict method 204, 289

prepositional object (pobj) 142

prior probability 58

probabilistic classifier 53

probability estimation 21, 136 – 137

processing data 87 – 95

morphological processing 90 – 95

removing stopwords 87 – 90

proper nouns 130

PROPN tag 130, 134

PUNCT (punctuation marks) 134

pyLDAvis 377

Pythagorean theorem 11

question answering 116

random functionality 330

random_state parameter 204

Scikit-learn machine-learning pipeline 289 – 292

search algorithm 73, 103 – 111

deploying 111

document similarity retrieval 104 – 105

evaluation of results 106 – 111

sentences, length statistics 201 – 207

sentiment analysis 229 – 303

addressing dependence on context with machine learning 277 – 295

evaluating with cross-validation 292 – 295

extracting features from text 284 – 289

preparing data 278 – 284

Scikit-learn machine-learning pipeline 289 – 292

aggregating sentiment scores with sentiment lexicon 251 – 259

collecting sentiment scores from lexicon 252 – 255

detecting review polarity 255 – 259

negation handling for 298 – 301

setting up pipeline 239 – 251

analyzing data 243 – 251

data loading and preprocessing 240 – 242

task 234 – 238

aggregating sentiment score with help of lexicon 235 – 237

learning to detect sentiment in data-driven way 237 – 238

use cases 231 – 234

varying length of sentiment-bearing features 295 – 297

with SentiWordNet 266 – 276

sentiment lexicon-based approach 235

sentiment lexicons 264

SentiWordNet 266 – 276

sequence labeling 392 – 403

BIO scheme 393 – 394

sequential solution for NER 397 – 403

sequential tasks 395 – 397

show_topic functionality 375

simple heuristic algorithm 252

simple_preprocess functionality 365

singular value decomposition (SVD) 332

sklearn’s function 167

smoothing 318

SMT (statistical machine translation) 28

SnowballStemmer algorithm 364

sorting algorithm 73

spaCy

dependency parsing with 139 – 144

named entity types exploration with 406 – 410

part-of-speech tagging with 128 – 137

spaCy’s functionality 209

spacy.load command 131

spam 37

spam class 43

spam filtering 25, 31 – 70

deploying spam filter in practice 65 – 66

implementing spam filter 46 – 65

defining data and classes 46 – 49

evaluating classifier 62 – 65

extracting and normalizing features 50 – 52

splitting text into words 49 – 50

training classifier 53 – 61

overview 31 – 35

tasks 36 – 46

defining data and classes 37

evaluating classifier 45 – 46

extracting and normalizing features 42 – 43

splitting text into words 37 – 42

training classifier 43 – 44

spam filters 25

Speech and Language Processing (Jurafsky and Martin) 2, 392

spell-checking 28 – 29

splitting text

implementing spam filter 49 – 50

overview 37 – 42

statistical machine translation (SMT) 28

stemmer tools 18, 93

stemming 92

stopping criteria 329

stopwords 18, 88, 151

count of and proportion of as features 207 – 211

removing 87 – 90

stratified data split 166

StratifiedShuffleSplit function 167

stratified shuffling split 166

string module 89

suffixes 219 – 222

supervised ML (machine learning) 34, 153, 307 – 325

data 308 – 312

evaluation of results 320 – 325

with Naïve Bayes 312 – 320

SVD (singular value decomposition) 332

synset 267

syntactic parsing 137 – 144

dependency parsing with spaCy 139 – 144

sentence structure 137 – 139

term frequency 97 – 100

terminal (lower) leaves 177

test set 44, 160, 231

text, extracting features from 284 – 289

text classification 25

text prediction 20 – 25

TF-IDF (term frequency—inverse document frequency) 314

TfidfVectorizer 314

tf (term frequency) 97

token.i attribute 131

tokenization 42

tokenizer tool 9, 41, 84, 129

token.lemma attribute 131

token.lower attribute 131

token object 131

token.pos attribute 131

token.text attribute 131

topic analysis 304 – 345

topic classification as supervised ML task 307 – 325

data 308 – 312

evaluation of results 320 – 325

with Naïve Bayes 312 – 320

topic discovery as unsupervised ML task 325 – 341

clustering 330 – 337

evaluation of topic clustering algorithm 338 – 341

unsupervised approaches 325 – 329

topic modeling 346 – 383

implementation of topic modeling algorithm 360 – 378

applying LDA model 371 – 374

exploring results 375 – 378

loading data 361 – 363

preprocessing data 363 – 371

with LDA (latent Dirichlet allocation) 349 – 360

as generative model 356 – 360

estimating parameters for 352 – 356

training classifiers

implementing spam filter 53 – 61

overview 43 – 44

training set 44, 231

transform function 333

transform method 316

trigram modeling 22

true positives 106

unique() function 406

unsupervised ML (machine learning) 325 – 341

approaches to 325 – 329

clustering 330 – 337

evaluation of topic clustering algorithm 338 – 341

upper bound on algorithm’s performance 173

user profiling

overview 155 – 157

practical use of 226 – 227

validation set 231

vector array 9

vectors 7

virtual assistants 18 – 20

visualization, named entities 416 – 418

V-measure 340

WordNet 267

words 103

distribution of suffixes as features 219 – 222

length statistics as features 201 – 207

types 124 – 128

unique words as features 223 – 226

weighing 96

with inverse document frequency 100 – 103

with term frequency 97 – 100

word unigram 21

Zipf’s law 189

zip function 133

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for index

Create new playlist

Sign In

Sign Up

index

Table of Contents for
index