Spark SQL can read data from external storage systems such as files, Hive tables, and JDBC databases through the DataFrameReader interface.
The format of the API call is spark.read.inputtype
- Parquet
- CSV
- Hive Table
- JDBC
- ORC
- Text
- JSON
Let's look at a couple of simple examples of reading CSV files into DataFrames:
scala> val statesPopulationDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv")
statesPopulationDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field]
scala> val statesTaxRatesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesTaxRates.csv")
statesTaxRatesDF: org.apache.spark.sql.DataFrame = [State: string, TaxRate: double]