spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian, and Dutch, as well as a multi-language model for NER. Cross-language usage is straightforward since the API does not change.
We will illustrate the Spanish language model using a parallel corpus of TED Talk subtitles (see the GitHub repo for data source references). For this purpose, we instantiate both language models:
model = {}
for language in ['en', 'es']:
model[language] = spacy.load(language)
We then read small corresponding text samples in each model:
text = {}
path = Path('../data/TED')
for language in ['en', 'es']:
file_name = path / 'TED2013_sample.{}'.format(language)
text[language] = file_name.read_text()
Sentence boundary detection uses the same logic but finds a different breakdown:
parsed, sentences = {}, {}
for language in ['en', 'es']:
parsed[language] = model[language](text[language])
sentences[language] = list(parsed[language].sents)
print('Sentences:', language, len(sentences[language]))
Sentences: en 19
Sentences: es 22
POS tagging also works in the same way:
pos = {}
for language in ['en', 'es']:
pos[language] = pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[language][0]],
columns=['Token', 'POS Tag', 'Meaning'])
pd.concat([pos['en'], pos['es']], axis=1).head()
The result is the side-by-side token annotations for the English and Spanish documents:
Token |
POS Tag |
Meaning |
Token |
POS Tag |
Meaning |
There |
ADV |
adverb |
Existe |
VERB |
verb |
s |
VERB |
verb |
una |
DET |
determiner |
a |
DET |
determiner |
estrecha |
ADJ |
adjective |
tight |
ADJ |
adjective |
y |
CONJ |
conjunction |
and |
CCONJ |
coordinating conjunction |
sorprendente |
ADJ |
adjective |
The next section illustrates how to use parsed and annotated tokens to build a document-term matrix that can be used for text classification.