How to do it...

In order to cluster the documents, we first need to extract the features from our articles. Note that the following text is abbreviated for space considerations—refer to the GitHub repository for the full code:

articles = spark.createDataFrame([
('''
The Andromeda Galaxy, named after the mythological
Princess Andromeda, also known as Messier 31, M31,
or NGC 224, is a spiral galaxy approximately 780
kiloparsecs (2.5 million light-years) from Earth,
and the nearest major galaxy to the Milky Way.
Its name stems from the area of the sky in which it
appears, the constellation of Andromeda. The 2006
observations by the Spitzer Space Telescope revealed
that the Andromeda Galaxy contains approximately one
trillion stars, more than twice the number of the
Milky Way’s estimated 200-400 billion stars. The
Andromeda Galaxy, spanning approximately 220,000 light
years, is the largest galaxy in our Local Group,
which is also home to the Triangulum Galaxy and
other minor galaxies. The Andromeda Galaxy's mass is
estimated to be around 1.76 times that of the Milky
Way Galaxy (~0.8-1.5×1012 solar masses vs the Milky
Way's 8.5×1011 solar masses).
''','Galaxy', 'Andromeda')
(...)
, ('''
Washington, officially the State of Washington, is a state in the Pacific
Northwest region of the United States. Named after George Washington,
the first president of the United States, the state was made out of the
western part of the Washington Territory, which was ceded by Britain in
1846 in accordance with the Oregon Treaty in the settlement of the
Oregon boundary dispute. It was admitted to the Union as the 42nd state
in 1889. Olympia is the state capital. Washington is sometimes referred
to as Washington State, to distinguish it from Washington, D.C., the
capital of the United States, which is often shortened to Washington.
''','Geography', 'Washington State')
], ['articles', 'Topic', 'Object'])

splitter = feat.RegexTokenizer(
inputCol='articles'
, outputCol='articles_split'
, pattern='s+|[,."]'
)

sw_remover = feat.StopWordsRemover(
inputCol=splitter.getOutputCol()
, outputCol='no_stopWords'
)

count_vec = feat.CountVectorizer(
inputCol=sw_remover.getOutputCol()
, outputCol='vector'
)

lda_clusters = clust.LDA(
k=3
, optimizer='online'
, featuresCol=count_vec.getOutputCol()
)

topic_pipeline = Pipeline(
stages=[
splitter
, sw_remover
, count_vec
, lda_clusters
]
)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset