Parsed document content is iterable, and each element has numerous attributes produced by the processing pipeline. The following sample illustrates how to access the following attributes:
- .text: Original word text
- .lemma_: Word root
- .pos_: Basic POS tag
- .tag_: Detailed POS tag
- .dep_: Syntactic relationship or dependency between tokens
- .shape_: The shape of the word regarding capitalization, punctuation, or digits
- .is alpha: Check whether the token is alphanumeric
- .is stop: Check whether the token is on a list of common words for the given language
We iterate over each token and assign its attributes to a pd.DataFrame:
pd.DataFrame([[t.text, t.lemma_, t.pos_, t.tag_, t.dep_, t.shape_, t.is_alpha, t.is_stop] for t in doc],
columns=['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop'])
Which produces the following output:
text |
lemma |
pos |
tag |
dep |
shape |
is_alpha |
is_stop |
Apple |
apple |
PROPN |
NNP |
nsubj |
Xxxxx |
TRUE |
FALSE |
is |
be |
VERB |
VBZ |
aux |
xx |
TRUE |
TRUE |
looking |
look |
VERB |
VBG |
ROOT |
xxxx |
TRUE |
FALSE |
at |
at |
ADP |
IN |
prep |
xx |
TRUE |
TRUE |
buying |
buy |
VERB |
VBG |
pcomp |
xxxx |
TRUE |
FALSE |
U.K. |
u.k. |
PROPN |
NNP |
compound |
X.X. |
FALSE |
FALSE |
startup |
startup |
NOUN |
NN |
dobj |
xxxx |
TRUE |
FALSE |
for |
for |
ADP |
IN |
prep |
xxx |
TRUE |
TRUE |
$ |
$ |
SYM |
$ |
quantmod |
$ |
FALSE |
FALSE |
1 |
1 |
NUM |
CD |
compound |
d |
FALSE |
FALSE |
billion |
billion |
NUM |
CD |
pobj |
xxxx |
TRUE |
FALSE |
We can visualize syntactic dependency in a browser or notebook using the following:
displacy.render(doc, style='dep', options=options, jupyter=True)
The result is a dependency tree:
We can get additional insights into the meaning of attributes using spacy.explain(), as here:
spacy.explain("VBZ")
verb, 3rd person singular present