Before going deeper, at first, we need to see how to create the Spark session. It can be done as follows:
spark = SparkSession
.builder
.appName("PCAExample")
.getOrCreate()
Now under this code block, you should place your codes, for example:
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
The preceding code demonstrates how to compute principal components on a RowMatrix and use them to project the vectors into a low-dimensional space. For a clearer picture, refer to the following code that shows how to use the PCA algorithm on PySpark:
import os
import sys
try:
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
print ("Successfully imported Spark Modules")
except ImportErrorase:
print ("Can not import Spark Modules", e)
sys.exit(1)
spark = SparkSession
.builder
.appName("PCAExample")
.getOrCreate()
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
spark.stop()
The output is as follows:
Figure 6: PCA result after successful execution of the Python script