Getting started with PySpark

Before going deeper, at first, we need to see how to create the Spark session. It can be done as follows:

spark = SparkSession
         .builder
         .appName("PCAExample")
         .getOrCreate()

Now under this code block, you should place your codes, for example:

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
 df = spark.createDataFrame(data, ["features"])

 pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
 model = pca.fit(df)

 result = model.transform(df).select("pcaFeatures")
 result.show(truncate=False)

The preceding code demonstrates how to compute principal components on a RowMatrix and use them to project the vectors into a low-dimensional space. For a clearer picture, refer to the following code that shows how to use the PCA algorithm on PySpark:

import os
import sys

try:
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
print ("Successfully imported Spark Modules")

except ImportErrorase:
print ("Can not import Spark Modules", e)
 sys.exit(1)

spark = SparkSession
   .builder
   .appName("PCAExample")
   .getOrCreate()

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])

pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

spark.stop()

The output is as follows:

Figure 6: PCA result after successful execution of the Python script

Table of Contents for Getting started with PySpark

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting started with PySpark