Let's see how to read data in LIBSVM format using the read API and the load() method by specifying the format of the data (that is, libsvm) as follows:
# Creating DataFrame from libsvm dataset
myDF = spark.read.format("libsvm").load("C:/Exp//mnist.bz2")
The preceding MNIST dataset can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2. This will essentially return a DataFrame and the content can be seen by calling the show() method as follows:
myDF.show()
The output is as follows:
Figure 7: A snap of the handwritten dataset in LIBSVM format
You can also specify other options such as how many features of the raw dataset you want to give to your DataFrame as follows:
myDF= spark.read.format("libsvm")
.option("numFeatures", "780")
.load("data/Letterdata_libsvm.data")
Now if you want to create an RDD from the same dataset, you can use the MLUtils API from pyspark.mllib.util as follows:
Creating RDD from the libsvm data file
myRDD = MLUtils.loadLibSVMFile(spark.sparkContext, "data/Letterdata_libsvm.data")
Now you can save the RDD in your preferred location as follows:
myRDD.saveAsTextFile("data/myRDD")