How to do it...

A general process that aims to extract data from text and convert it into something a machine learning model can use starts with the free-flow text. The first step is to take each sentence of the text and split it on the space character (most often). Next, all the stop words are removed. Finally, simply counting distinct words in the text or using a hashing trick takes us into the realm of numerical representations of free-flow text.

Here's how to achieve this with Spark's ML module:

some_text = spark.createDataFrame([
['''
Apache Spark achieves high performance for both batch
and streaming data, using a state-of-the-art DAG scheduler,
a query optimizer, and a physical execution engine.
''']
, ['''
Apache Spark is a fast and general-purpose cluster computing
system. It provides high-level APIs in Java, Scala, Python
and R, and an optimized engine that supports general execution
graphs. It also supports a rich set of higher-level tools including
Spark SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming.
''']
, ['''
Machine learning is a field of computer science that often uses
statistical techniques to give computers the ability to "learn"
(i.e., progressively improve performance on a specific task)
with data, without being explicitly programmed.
''']
], ['text'])

splitter = feat.RegexTokenizer(
inputCol='text'
, outputCol='text_split'
, pattern='s+|[,."]'
)

sw_remover = feat.StopWordsRemover(
inputCol=splitter.getOutputCol()
, outputCol='no_stopWords'
)

hasher = feat.HashingTF(
inputCol=sw_remover.getOutputCol()
, outputCol='hashed'
, numFeatures=20
)

idf = feat.IDF(
inputCol=hasher.getOutputCol()
, outputCol='features'
)

pipeline = Pipeline(stages=[splitter, sw_remover, hasher, idf])

pipelineModel = pipeline.fit(some_text)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset