Using HDFS

Hadoop Distributed File System (HDFS) is the storage component of the Hadoop framework for Big Data. HDFS is a distributed filesystem, which spreads data on multiple systems, and is inspired by the Google File System used by Google for its search engine. HDFS requires a Java Runtime Environment (JRE), and it uses a NameNode server to keep track of the files. The system also replicates the data so that losing a few nodes doesn't lead to data loss. The typical use case for HDFS is processing large read-only files. Apache Spark, also covered in this chapter, can use HDFS too.

Getting ready

Install Hadoop and a JRE. As these are not Python frameworks, you will have to check what the appropriate procedure is for your operating system. I used Hadoop 2.7.1 with Java 1.7.0_60 for this recipe. This can be a complicated process, but there are many resources online that can help you troubleshoot for your specific system.

How to do it…

We can configure HDFS with several XML files found in your Hadoop install. Some of the steps in this section serve only as example and you should implement them as appropriate for your operating system, environment, and personal preferences:

  1. Edit the core-site.xml file so that it has the following content (comments omitted):
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://localhost:8020</value>
        </property>
    </configuration>
  2. Edit the hdfs-site.xml file so that it has the following content (comments omitted), setting the replication of each file to just 1, to run HDFS locally:
    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
  3. If necessary, enable Remote login on your system to SSH into localhost and generate keys (Windows users can use putty):
    $ ssh-keygen -t dsa -f ~/.ssh/id_dsa
    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
    
  4. Format the filesystem from the root of the Hadoop directory:
    $ bin/hdfs namenode –format
    
  5. Start the NameNode server, as follows (the opposite command is $ sbin/stop-dfs.sh):
    $ sbin/start-dfs.sh
    
  6. Create a directory in HDFS with the following command:
    $ hadoop fs -mkdir direct_marketing
    
  7. Optionally, if you want to use the direct_marketing.csv file in the Spark recipe, you need to copy it into HDFS, as follows:
    $ hadoop fs -copyFromLocal <path to file>/direct_marketing.csv direct_marketing
    
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset