How to do it...

In order to cluster the documents, we first need to extract the features from our articles. Note that the following text is abbreviated for space considerations—refer to the GitHub repository for the full code:

articles = spark.createDataFrame([
    ('''
        The Andromeda Galaxy, named after the mythological 
        Princess Andromeda, also known as Messier 31, M31, 
        or NGC 224, is a spiral galaxy approximately 780 
        kiloparsecs (2.5 million light-years) from Earth, 
        and the nearest major galaxy to the Milky Way. 
        Its name stems from the area of the sky in which it 
        appears, the constellation of Andromeda. The 2006 
        observations by the Spitzer Space Telescope revealed 
        that the Andromeda Galaxy contains approximately one 
        trillion stars, more than twice the number of the 
        Milky Way’s estimated 200-400 billion stars. The 
        Andromeda Galaxy, spanning approximately 220,000 light 
        years, is the largest galaxy in our Local Group, 
        which is also home to the Triangulum Galaxy and 
        other minor galaxies. The Andromeda Galaxy's mass is 
        estimated to be around 1.76 times that of the Milky 
        Way Galaxy (~0.8-1.5×1012 solar masses vs the Milky 
        Way's 8.5×1011 solar masses).
    ''','Galaxy', 'Andromeda')
    (...) 
    , ('''
        Washington, officially the State of Washington, is a state in the Pacific 
        Northwest region of the United States. Named after George Washington, 
        the first president of the United States, the state was made out of the 
        western part of the Washington Territory, which was ceded by Britain in 
        1846 in accordance with the Oregon Treaty in the settlement of the 
        Oregon boundary dispute. It was admitted to the Union as the 42nd state 
        in 1889. Olympia is the state capital. Washington is sometimes referred 
        to as Washington State, to distinguish it from Washington, D.C., the 
        capital of the United States, which is often shortened to Washington.
    ''','Geography', 'Washington State') 
], ['articles', 'Topic', 'Object'])

splitter = feat.RegexTokenizer(
    inputCol='articles'
    , outputCol='articles_split'
    , pattern='s+|[,."]'
)

sw_remover = feat.StopWordsRemover(
    inputCol=splitter.getOutputCol()
    , outputCol='no_stopWords'
)

count_vec = feat.CountVectorizer(
    inputCol=sw_remover.getOutputCol()
    , outputCol='vector'
)

lda_clusters = clust.LDA(
    k=3
    , optimizer='online'
    , featuresCol=count_vec.getOutputCol()
)

topic_pipeline = Pipeline(
    stages=[
        splitter
        , sw_remover
        , count_vec
        , lda_clusters
    ]
)

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...