Hadoop Distributed File System (HDFS) is the storage component of the Hadoop framework for Big Data. HDFS is a distributed filesystem, which spreads data on multiple systems, and is inspired by the Google File System used by Google for its search engine. HDFS requires a Java Runtime Environment (JRE), and it uses a NameNode
server to keep track of the files. The system also replicates the data so that losing a few nodes doesn't lead to data loss. The typical use case for HDFS is processing large read-only files. Apache Spark, also covered in this chapter, can use HDFS too.
Install Hadoop and a JRE. As these are not Python frameworks, you will have to check what the appropriate procedure is for your operating system. I used Hadoop 2.7.1 with Java 1.7.0_60 for this recipe. This can be a complicated process, but there are many resources online that can help you troubleshoot for your specific system.
We can configure HDFS with several XML files found in your Hadoop install. Some of the steps in this section serve only as example and you should implement them as appropriate for your operating system, environment, and personal preferences:
core-site.xml
file so that it has the following content (comments omitted):<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration>
hdfs-site.xml
file so that it has the following content (comments omitted), setting the replication of each file to just 1, to run HDFS locally:<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
$ ssh-keygen -t dsa -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ bin/hdfs namenode –format
$ sbin/stop-dfs.sh
):$ sbin/start-dfs.sh
$ hadoop fs -mkdir direct_marketing
direct_marketing.csv
file in the Spark recipe, you need to copy it into HDFS, as follows:$ hadoop fs -copyFromLocal <path to file>/direct_marketing.csv direct_marketing