There's more...

Almost exclusively, every estimator (or, in other words, an ML model) found in the ML module expects to see a single column as an input; the column should contain all the features a data scientist wants such a model to use. The .VectorAssembler(...) method, as the name suggests, collates multiple features into a single column.

Consider the following example:

vectorAssembler = (
feat.VectorAssembler(
inputCols=forest.columns,
outputCol='feat'
)
)

pca = (
feat.PCA(
k=5
, inputCol=vectorAssembler.getOutputCol()
, outputCol='pca_feat'
)
)

(
pca
.fit(vectorAssembler.transform(forest))
.transform(vectorAssembler.transform(forest))
.select('feat','pca_feat')
.take(1)
)

First, we use the .VectorAssembler(...) method to collate all columns from our forest DataFrame.

Note that the  .VectorAssembler(...) method, unlike other Transformers, has the inputCols parameter, not inputCol, as it accepts a list of columns, not just a single column.

We then use the feat column (which is now a SparseVector of all the features) in the PCA(...) method to extract the top five most significant principal components.

Notice how we can now use the .getOutputCol() method to get the name of the output column? It should become more apparent why we do this when we introduce pipelines.

The output of the preceding code should look somewhat as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset