Reading data from files

For this recipe, we will create an RDD by reading a local file in PySpark. To create RDDs in Apache Spark, you will need to first install Spark as noted in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples. Note that while this recipe is specific to reading local files, a similar syntax can be applied for Hadoop, AWS S3, Azure WASBs, and/or Google Cloud Storage:

Storage type	Example
Local files	`sc.textFile('/local folder/filename.csv')`
Hadoop HDFS	`sc.textFile('hdfs://folder/filename.csv')`
AWS S3 (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html)	`sc.textFile('s3://bucket/folder/filename.csv')`
Azure WASBs (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage)	`sc.textFile('wasb://bucket/folder/filename.csv')`
Google Cloud Storage (https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters)	`sc.textFile('gs://bucket/folder/filename.csv')`
Databricks DBFS (https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html)	`sc.textFile('dbfs://folder/filename.csv')`

Table of Contents for Reading data from files

Create new playlist

Sign In

Sign Up

Table of Contents for
Reading data from files