Running Hadoop HDFS

A distributed processing framework wouldn't be complete without distributed storage. One of them is HDFS. Even if Spark is run on local mode, it can still use a distributed file system at the backend. Like Spark breaks computations into subtasks, HDFS breaks a file into blocks and stores them across a set of machines. For HA, HDFS stores multiple copies of each block, the number of copies is called replication level, three by default (refer to Figure 3-5).

NameNode is managing the HDFS storage by remembering the block locations and other metadata such as owner, file permissions, and block size, which are file-specific. Secondary Namenode is a slight misnomer: its function is to merge the metadata modifications, edits, into fsimage, or a file that serves as a metadata database. The merge is required, as it is more practical to write modifications of fsimage to a separate file instead of applying each modification to the disk image of the fsimage directly (in addition to applying the corresponding changes in memory). Secondary Namenode cannot serve as a second copy of the Namenode. A Balancer is run to move the blocks to maintain approximately equal disk usage across the servers—the initial block assignment to the nodes is supposed to be random, if enough space is available and the client is not run within the cluster. Finally, the Client communicates with the Namenode to get the metadata and block locations, but after that, either reads or writes the data directly to the node, where a copy of the block resides. The client is the only component that can be run outside the HDFS cluster, but it needs network connectivity with all the nodes in the cluster.

If any of the node dies or disconnects from the network, the Namenode notices the change, as it constantly maintains the contact with the nodes via heartbeats. If the node does not reconnect to the Namenode within 10 minutes (by default), the Namenode will start replicating the blocks in order to achieve the required replication level for the blocks that were lost on the node. A separate block scanner thread in the Namenode will scan the blocks for possible bit rot—each block maintains a checksum—and will delete corrupted and orphaned blocks:

Running Hadoop HDFS

Figure 03-5. This is the HDFS architecture. Each block is stored in three separate locations (the replication level).

  1. To start HDFS on your machine (with replication level 1), download a Hadoop distribution, for example, from http://hadoop.apache.org:
    $ wget ftp://apache.cs.utah.edu/apache.org/hadoop/common/h/hadoop-2.6.4.tar.gz
    --2016-05-12 00:10:55--  ftp://apache.cs.utah.edu/apache.org/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz
               => 'hadoop-2.6.4.tar.gz.1'
    Resolving apache.cs.utah.edu... 155.98.64.87
    Connecting to apache.cs.utah.edu|155.98.64.87|:21... connected.
    Logging in as anonymous ... Logged in!
    ==> SYST ... done.    ==> PWD ... done.
    ==> TYPE I ... done.  ==> CWD (1) /apache.org/hadoop/common/hadoop-2.6.4 ... done.
    ==> SIZE hadoop-2.6.4.tar.gz ... 196015975
    ==> PASV ... done.    ==> RETR hadoop-2.6.4.tar.gz ... done.
    ...
    $ wget ftp://apache.cs.utah.edu/apache.org/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz.mds
    --2016-05-12 00:13:58--  ftp://apache.cs.utah.edu/apache.org/hadoop/common/hadoop-2.6.4/hadoop-2.6.4.tar.gz.mds
               => 'hadoop-2.6.4.tar.gz.mds'
    Resolving apache.cs.utah.edu... 155.98.64.87
    Connecting to apache.cs.utah.edu|155.98.64.87|:21... connected.
    Logging in as anonymous ... Logged in!
    ==> SYST ... done.    ==> PWD ... done.
    ==> TYPE I ... done.  ==> CWD (1) /apache.org/hadoop/common/hadoop-2.6.4 ... done.
    ==> SIZE hadoop-2.6.4.tar.gz.mds ... 958
    ==> PASV ... done.    ==> RETR hadoop-2.6.4.tar.gz.mds ... done.
    ...
    $ shasum -a 512 hadoop-2.6.4.tar.gz
    493cc1a3e8ed0f7edee506d99bfabbe2aa71a4776e4bff5b852c6279b4c828a0505d4ee5b63a0de0dcfecf70b4bb0ef801c767a068eaeac938b8c58d8f21beec  hadoop-2.6.4.tar.gz
    $ cat !$.mds
    hadoop-2.6.4.tar.gz:    MD5 = 37 01 9F 13 D7 DC D8 19  72 7B E1 58 44 0B 94 42
    hadoop-2.6.4.tar.gz:   SHA1 = 1E02 FAAC 94F3 35DF A826  73AC BA3E 7498 751A 3174
    hadoop-2.6.4.tar.gz: RMD160 = 2AA5 63AF 7E40 5DCD 9D6C  D00E EBB0 750B D401 2B1F
    hadoop-2.6.4.tar.gz: SHA224 = F4FDFF12 5C8E754B DAF5BCFC 6735FCD2 C6064D58
                                  36CB9D80 2C12FC4D
    hadoop-2.6.4.tar.gz: SHA256 = C58F08D2 E0B13035 F86F8B0B 8B65765A B9F47913
                                  81F74D02 C48F8D9C EF5E7D8E
    hadoop-2.6.4.tar.gz: SHA384 = 87539A46 B696C98E 5C7E352E 997B0AF8 0602D239
                                  5591BF07 F3926E78 2D2EF790 BCBB6B3C EAF5B3CF
                                  ADA7B6D1 35D4B952
    hadoop-2.6.4.tar.gz: SHA512 = 493CC1A3 E8ED0F7E DEE506D9 9BFABBE2 AA71A477
                                  6E4BFF5B 852C6279 B4C828A0 505D4EE5 B63A0DE0
                                  DCFECF70 B4BB0EF8 01C767A0 68EAEAC9 38B8C58D
                                  8F21BEEC
    
    $ tar xf hadoop-2.6.4.tar.gz
    $ cd hadoop-2.6.4
    
  2. To get the minimal HDFS configuration, modify the core-site.xml and hdfs-site.xml files, as follows:
    $ cat << EOF > etc/hadoop/core-site.xml
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:8020</value>
        </property>
    </configuration>
    EOF
    $ cat << EOF > etc/hadoop/hdfs-site.xml
    <configuration>
       <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
    EOF
    

    This will put the Hadoop HDFS metadata and data directories under the /tmp/hadoop-$USER directories. To make this more permanent, we can add the dfs.namenode.name.dir, dfs.namenode.edits.dir, and dfs.datanode.data.dir parameters, but we will leave these out for now. For a more customized distribution, one can download a Cloudera version from http://archive.cloudera.com/cdh.

  3. First, we need to write an empty metadata:
    $ bin/hdfs namenode -format
    16/05/12 00:55:40 INFO namenode.NameNode: STARTUP_MSG: 
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = alexanders-macbook-pro.local/192.168.1.68
    STARTUP_MSG:   args = [-format]
    STARTUP_MSG:   version = 2.6.4
    STARTUP_MSG:   classpath =
    ...
    
  4. Then start the namenode, secondarynamenode, and datanode Java processes (I usually open three different command-line windows to see the logs, but in a production environment, these are usually daemonized):
    $ bin/hdfs namenode &
    ...
    $ bin/hdfs secondarynamenode &
    ...
    $ bin/hdfs datanode &
    ...
    
  5. We are now ready to create the first HDFS file:
    $ date | bin/hdfs dfs –put – date.txt
    ...
    $ bin/hdfs dfs –ls
    Found 1 items
    -rw-r--r-- 1 akozlov supergroup 29 2016-05-12 01:02 date.txt
    $ bin/hdfs dfs -text date.txt
    Thu May 12 01:02:36 PDT 2016
    
  6. Of course, in this particular case, the actual file is stored only on one node, which is the same node we run datanode on (localhost). In my case, it is the following:
    $ cat /tmp/hadoop-akozlov/dfs/data/current/BP-1133284427-192.168.1.68-1463039756191/current/finalized/subdir0/subdir0/blk_1073741827
    Thu May 12 01:02:36 PDT 2016
    
  7. The Namenode UI can be found at http://localhost:50070 and displays a host of information, including the HDFS usage and the list of DataNodes, the slaves of the HDFS Master node as follows:
    Running Hadoop HDFS

    Figure 03-6. A snapshot of HDFS NameNode UI.

The preceding figure shows HDFS Namenode HTTP UI in a single node deployment (usually, http://<namenode-address>:50070). The Utilities | Browse the file system tab allows you to browse and download the files from HDFS. Nodes can be added by starting DataNodes on a different node and pointing to the Namenode with the fs.defaultFS=<namenode-address>:8020 parameter. The Secondary Namenode HTTP UI is usually at http:<secondarynamenode-address>:50090.

Scala/Spark by default will use the local file system. However, if the core-site/xml file is on the classpath or placed in the $SPARK_HOME/conf directory, Spark will use HDFS as the default.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset