HDFS Metadata - Datanode
In this article I will explain how datanode maintains metadata information in directory configured using dfs.datanode.name.dir
in hdfs-site.xml
file.
In my case dfs.datanode.name.dir
is configured to /hadoop/hdfs/datanode
location. So lets start with listing on this directory.
ls -1 /hadoop/hdfs/datanode
current
in_use.lock
There are two entries namely
in_use.lock :
This is lock file held by datanode process. It is used to prevent concurrent modification of directory by multiple datanode processes.
current: This is directory. Lets do tree listing on this
tree current/
current/
|-- BP-1469059006-127.0.0.1-1449042391563
| |-- current
| | |-- VERSION
| | |-- finalized
| | | `-- subdir0
| | | `-- subdir0
| | | |-- blk_1073741825
| | | `-- blk_1073741825_1001.meta
| | |-- rbw
| |-- dncp_block_verification.log.curr
| |-- dncp_block_verification.log.prev
| `-- tmp
`-- VERSION
There are lot of files and directories, lets explore one by one.
VERSION:
This is a Storage information file with following content:
#Wed Dec 02 13:16:39 IST 2015
storageID=DS-c25c62e1-a512-451e-87b2-e9175afca9f4
clusterID=CID-59abe9cc-89c7-4cf8-ada2-6c6409c98c97
cTime=0
datanodeUuid=ad7ecbe4-b4a2-4b52-8146-5240ec849119
storageType=DATA_NODE
layoutVersion=-56
You can refer to org.apache.hadoop.hdfs.server.common.StorageInfo.java
and org.apache.hadoop.hdfs.server.common.Storage.java
for more information.
storageID:
It is unique to the datanode, and same across all storage directories on datanode. Namenode uses this id, to uniquely identify the datanode.
clusterID:
It identifies a cluster, and it has to be unique during the life time of a cluster. This is important for federated deployment. Introduced in HDFS-1365
cTime:
creation time of file system, this field is updated during HDFS upgrades.
datanodeUuid:
Unique identifier of a datanode, introduced in HDFS-5233
storageType:
It’ll be DATA_NODE.
layoutVersion:
Layout version of storage data. Whenever new features related to metadata are added to HDFS project, this version is changed.
BP-randomInteger-NameNodeIpAddress-creationTime:
This is unique block pool id, where BP stands for Block Pool, it is followed by unique random integer, IP address of namenode and block pool creation time.Block pool collects a set of blocks whihc belongs to a namespace.
finalized:
This directory contains block which are completed. Each block file contains hdfs data.
rbw:
This directory contains blocks that are still being written to by HDFS client. Here rbw stands for replic being written.
dncp_block_verification.log.*:
This file tracks the last time each block was verified by comparing its contents against the checksum. This file is rolled periodically, so dncp_block_verification.log.curr
is current file and dncp_block_verification.log.prev
this is old file which has been rolled back.
Background block verification work happens in ascending order of last verification time.