In this article I will explain how datanode maintains metadata information in directory configured using dfs.datanode.name.dir in hdfs-site.xml file. In my case dfs.datanode.name.dir is configured to /hadoop/hdfs/datanode location. So lets start with listing on this directory.
ls -1 /hadoop/hdfs/datanode current in_use.lock
There are two entries namely
This is lock file held by datanode process. It is used to prevent concurrent modification of directory by multiple datanode processes.
current: This is directory. Lets do tree listing on this
tree current/ current/ |-- BP-1469059006-127.0.0.1-1449042391563 | |-- current | | |-- VERSION | | |-- finalized | | | `-- subdir0 | | | `-- subdir0 | | | |-- blk_1073741825 | | | `-- blk_1073741825_1001.meta | | |-- rbw | |-- dncp_block_verification.log.curr | |-- dncp_block_verification.log.prev | `-- tmp `-- VERSION
There are lot of files and directories, lets explore one by one.
This is a Storage information file with following content:
#Wed Dec 02 13:16:39 IST 2015 storageID=DS-c25c62e1-a512-451e-87b2-e9175afca9f4 clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97 cTime=0 datanodeUuid=ad7ecbe4-b4a2-4b52-8146-5240ec849119 storageType=DATA_NODE layoutVersion=-56
You can refer to
org.apache.hadoop.hdfs.server.common.Storage.java for more information.
It is unique to the datanode, and same across all storage directories on datanode. Namenode uses this id, to uniquely identify the datanode.
It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365
creation time of file system, this field is updated during HDFS upgrades.
Unique identifier of a datanode, introduced in HDFS-5233
It’ll be DATA_NODE.
Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.
This is unique block pool id, where BP stands for Block Pool, it is followed by unique random integer, IP address of namenode and block pool creation time.Block pool collects a set of blocks whihc belongs to a namespace.
This directory contains block which are completed. Each block file contains hdfs data.
This directory contains blocks that are still being written to by HDFS client. Here rbw stands for replic being written.
This file tracks the last time each block was verified by comparing its contents against the checksum. This file is rolled periodically, so
dncp_block_verification.log.curr is current file and
dncp_block_verification.log.prev this is old file which has been rolled back.
Background block verification work happens in ascending order of last verification time.