Hadoop Notes

The below content is copied from various blogs and websites in order to get the internals of BIG DATA precisely.



             The NameNode stores the metadata of the HDFS. The state of HDFS is stored in a file called fsimage and is the base of the metadata. During the runtime modifications are just written to a log file called edits. On the next start-up of the NameNode the state is read from fsimage, the changes from edits are applied to that and the new state is written back to fsimage. After this edits is cleared and contains is now ready for new log entries.




The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster. Another side effect of a larger edits file is that next restart of NameNode takes longer.

The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode.

The start of the checkpoint process on the secondary NameNode is controlled by two configuration parameters.

  • dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpoints, and
  • dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.

The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode’s directory. So that the check pointed image is always ready to be read by the primary NameNode if necessary.

Usage: hdfs secondarynamenode [-checkpoint [force]] | [-format] | [-geteditsize]

-checkpoint [force] Checkpoints the SecondaryNameNode if EditLog size >= fs.checkpoint.size. If force is used, checkpoint irrespective of EditLog size.
-format Format the local storage during startup.
-geteditsize Prints the number of uncheckpointed transactions on the NameNode.

Checkpoint Node:

A Checkpoint Node was introduced to solve the drawbacks of the NameNode. The changes are just written to edits and not merged to fsimage during the runtime. If the NameNode runs for a while edits gets huge and the next startup will take even longer because more changes have to be applied to the state to determine the last state of the metadata.

The Checkpoint Node fetches periodically fsimage and edits from the NameNode and merges them. The resulting state is called checkpoint. After this is uploads the result to the NameNode.

There was also a similiar type of node called “Secondary Node” but it doesn’t have the “upload to NameNode” feature. So the NameNode need to fetch the state from the Secondary NameNode. It also was confussing because the name suggests that the Secondary NameNode takes the request if the NameNode fails which isn’t the case.

Backup Node:

The Backup Node provides the same functionality as the Checkpoint Node, but is synchronized with the NameNode. It doesn’t need to fetch the changes periodically because it receives a strem of file system edits. from the NameNode. It holds the current state in-memory and just need to save this to an image file to create a new checkpoint

The Backup node provides the same checkpointing functionality as the Checkpoint node, as well as maintaining an in-memory, up-to-date copy of the file system namespace that is always synchronized with the active NameNode state. Along with accepting a journal stream of file system edits from the NameNode and persisting this to disk, the Backup node also applies those edits into its own copy of the namespace in memory, thus creating a backup of the namespace.

The Backup node does not need to download fsimage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary NameNode, since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local fsimage file and reset edits.

As the Backup node maintains a copy of the namespace in memory, its RAM requirements are the same as the NameNode.

The NameNode supports one Backup node at a time. No Checkpoint nodes may be registered if a Backup node is in use. Using multiple Backup nodes concurrently will be supported in the future.

The Backup node is configured in the same manner as the Checkpoint node. It is started with bin/hdfs namenode -backup.

The location of the Backup (or Checkpoint) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.

Use of a Backup node provides the option of running the NameNode with no persistent storage, delegating all responsibility for persisting the state of the namespace to the Backup node. To do this, start the NameNode with the -importCheckpoint option, along with specifying no persistent storage directories of type edits dfs.namenode.edits.dir for the NameNode configuration.


Secondary Namenode:
In Hadoop 1.x and 2.x, the secondary namenode means the same. It does CPU intensive tasks for Namenode. In more details, it combines the Edit log and fs_image and returns the consolidated file to Namenode. Namenode then loads that file into RAM. But, secondary namenode doesn’t provide failover capabilities.  So, in case of Namenode failure, Hadoop admins have to manually recover the data from Secondary Namenode.

Now, what’s Standby Namenode?
In Hadoop 2.0, with the introduction of HA, the Standby Namenode came into picture. The standby namenode is the node that removes the problem of SPOF (Single Point Of Failure) that was there in Hadoop 1.x. The standby namenode provides automatic failover in case Active Namenode (can be simply called ‘Namenode’ if HA is not enabled) fails.

Moreover, enabling HA is not mandatory. But, when it is enabled, you can’t use Secondary Namenode. So, either Secondary Namenode is enabled OR Standby Namenode is enabled.


A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.

DataNode instances can talk to each other, which is what they do when they are replicating data.

  • There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.

  • An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.

  • Avoid using NFS for data storage in production system.



An SPOF is a single part of a system, which, if it fails, takes down the entire system.

In Hadoop, the NameNode is a SPOF. DNS may also be SPOF, unless you have failover. The network may be a SPOF, depending on is redundancy, as can be the room’s power supply.

People can be SPOFs too. If only one person knows the database admin password, you can be sure they are on vacation when the password is needed. If only one developer understands how the mission-critical MapReduce algorithm works, they are the SPOF.

There is always a SPOF. Even when you think you have anticipated and eliminated every SPOF, there is one out there, waiting. How you find it? You wait.