Hadoop学习十四:Hadoop-Hdfs FSDataset源码



 Hadoop学习十四:Hadoop-Hdfs FSDataset源码_第1张图片

二.FSVolume FSDir物理概念

 Hadoop学习十四:Hadoop-Hdfs FSDataset源码_第2张图片



  1.  Block类只代表一个block的标识,看Block类的属性便知;Block类不代表block文件。
  2. blk_1150083481087817002是block;%hadoop_home%/dfs/data/current/blk_115008348108781700是block文件。
  3. block包含block blk_1150083481087817002和block元数据 blk_1150083481087817002_1007.meta。本系列博客中没有特别说明时,block只表示block blk_1150083481087817002。
public class Block implements Writable, Comparable {

	//change fileName to id
	static long filename2id(String name) {
		return Long.parseLong(name.substring("blk_".length()));
	//change id to fileName
	public String getBlockName() {
		return "blk_" + String.valueOf(blockId);

	private long blockId;			//block id:1150083481087817002
	private long numBytes;		//block大小
	private long generationStamp; //从1000L开始:1007 当两个块进行比较的时候,当它们的hashcode相同时,便用generationStamp进行比较

	public Block() {
		this(0, 0, 0);

	public boolean equals(Object o) {
		if (!(o instanceof Block)) {
			return false;
		final Block that = (Block) o;
		return this.blockId == that.blockId
				&& GenerationStamp.equalsWithWildcard(this.generationStamp,

	public int hashCode() {
		return 37 * 17 + (int) (blockId ^ (blockId >>> 32));





// block与block文件的对应关系
	static class BlockAndFile implements Comparable {
		final Block block;
		// absolute path eg:%hadoop_home%/dfs/data/current/blk_1150083481087817002
		final File pathfile; 

		BlockAndFile(File fullpathname, Block block) {
			this.pathfile = fullpathname;
			this.block = block;

		public int compareTo(BlockAndFile o) {
			return this.block.compareTo(o.block);



  1.  DatanodeBlockInfo保存了block在文件系统上的信息,包含block存放的卷(FSVolume),文件名和detach状态。
  2. detach状态:系统在升级时会创建一个snapshot,snapshot的文件和current里的数据块文件和数据块元文件是通过硬链接,指向了相同的内容。当我们需要改变current里的文件时,如果不进行detach操作,那么,修改的内容就会影响snapshot里的文件,这时,我们需要将对应的硬链接解除掉。方法很简单,就是在临时文件夹里,复制文件,然后将临时文件改名成为current里的对应文件,这样的话,current里的文件和snapshot里的文件就detach了。这样的技术,也叫copy-on-write,是一种有效提高系统性能的方法。DatanodeBlockInfo中的detachBlock,能够对Block对应的数据文件和元数据文件进行detach操作。
    class DatanodeBlockInfo {
      private FSVolume volume;       // block所在的FSVolume
      private File     file;         // block文件
      private boolean detached;      // copy-on-write done for block
      DatanodeBlockInfo(FSVolume vol, File file) {
        this.volume = vol;
        this.file = file;
        detached = false;
       * 1. Copy specified file into a temporary file. 
       * 2. Then rename the temporary file to the original name. 
       * This will cause any hardlinks to the original file to be removed. 
       * The temporary files are created in the detachDir. 
       * The temporary files will be recovered (especially on Windows) on datanode restart.
      private void detachFile(File file, Block b) throws IOException {
       * Returns true if this block was copied, otherwise returns false.
      boolean detachBlock(Block block, int numLinks) throws IOException {


  1. FSDir是保存block的文件夹。
  2. FSDir是一个树状结构,最外层是%hadoop_home%/dfs/data/current。
  3. 初始化FSDir时,迭代初始化%hadoop_home%/dfs/data/current下的所有children FSDir,构成FSDir树。
  4. FSDir的重要方法
  • addBlock:向此FSDir中添加block,返回这个block对应的block文件。
  • getBlockAndFileInfo:获得此FSDir下所有BlockAndFile。
  • getVolumeMap:获得此FSDir下所有block到DatanodeBlockInfo的映射关系。
    // 保存block的文件夹
    	class FSDir {
    		File dir; // FSDir会有一个根目录,最外面的当然是/current
    		int numBlocks = 0; // FSDir下的block数量
    		FSDir children[]; // FSDir下可以继续包含FSDir
    		int lastChildIdx = 0; // 存储上一个数据块的子目录序号
    		// 初始化时,构建FSDir树
    		public FSDir(File dir) throws IOException {
    			this.dir = dir;
    			this.children = null;
    			File[] files = FileUtil.listFiles(dir);
    			int numChildren = 0;
    			for (int idx = 0; idx < files.length; idx++) {
    				if (files[idx].isDirectory()) {
    				} else if (Block.isBlockFilename(files[idx])) {
    			if (numChildren > 0) {
    				children = new FSDir[numChildren];
    				int curdir = 0;
    				for (int idx = 0; idx < files.length; idx++) {
    					if (files[idx].isDirectory()) {
    						// 迭代初始化children FSDir
    						children[curdir] = new FSDir(files[idx]);
    		public File addBlock(Block b, File src) throws IOException {
    			// First try without creating subdirectories
    			File file = addBlock(b, src, false, false);
    			return (file != null) ? file : addBlock(b, src, true, true);
    		private File addBlock(Block b, File src, boolean createOk,
    				boolean resetIdx) throws IOException {
    			// DataNode节点会首先把文件的数据块存储到存储路径的子目录current/下
    			if (numBlocks < maxBlocksPerDir) {
    				// src:tmp下
    				// dest:current下
    				File dest = new File(dir, b.getBlockName());
    				// metaData:tmp下
    				// newmeta:current下
    				File metaData = getMetaFile(src, b);
    				File newmeta = getMetaFile(dest, b);
    				// tmp下metaData移到current下,tmp下block移到current下
    				if (!metaData.renameTo(newmeta) || !src.renameTo(dest)) {
    					throw new IOException("could not move files for " + b
    							+ " from tmp to " + dest.getAbsolutePath());
    				numBlocks += 1;
    				return dest;
    			// 当子目录current/中已经存储了maxBlocksPerDir个数据块之后
    			// 就会在目录current/下创建maxBlocksPerDir个子目录,然后从中选择一个子目录,把数据块存储到这个子目录中;
    			// 如果选择的子目录也已经存储了maxBlocksPerDir个数据块,则又在这个子目录下创建maxBlocksPerDir个子目录,从这些子目录中选一个来存储数据块
    			// 就这样一次递归下去,直到存储路径的剩余存储空间不够存储一个数据块为止。
    			// maxBlocksPerDir的默认值是64,但也可以通过DataNode的配置文件来设置,它对应的配置选项是dsf.datanode.numblocks。
    			if (lastChildIdx < 0 && resetIdx) {
    				// reset so that all children will be checked
    				lastChildIdx = random.nextInt(children.length);
    			if (lastChildIdx >= 0 && children != null) {
    				// Check if any child-tree has room for a block.
    				for (int i = 0; i < children.length; i++) {
    					int idx = (lastChildIdx + i) % children.length;
    					File file = children[idx].addBlock(b, src, false, resetIdx);
    					if (file != null) {
    						lastChildIdx = idx;
    						return file;
    				lastChildIdx = -1;
    			if (!createOk) {
    				return null;
    			if (children == null || children.length == 0) {
    				children = new FSDir[maxBlocksPerDir];
    				for (int idx = 0; idx < maxBlocksPerDir; idx++) {
    					children[idx] = new FSDir(new File(dir,
    							DataStorage.BLOCK_SUBDIR_PREFIX + idx));
    			// now pick a child randomly for creating a new set of subdirs.
    			lastChildIdx = random.nextInt(children.length);
    			return children[lastChildIdx].addBlock(b, src, true, false);
    		// 获得此FSDir下所有BlockAndFile
    		void getBlockAndFileInfo(TreeSet blockSet) {
    			// 迭代children FSDir
    			if (children != null) {
    				for (int i = 0; i < children.length; i++) {
    			File blockFiles[] = dir.listFiles();
    			for (int i = 0; i < blockFiles.length; i++) {
    				if (Block.isBlockFilename(blockFiles[i])) {
    					long genStamp = FSDataset.getGenerationStampFromFile(
    							blockFiles, blockFiles[i]);
    					Block block = new Block(blockFiles[i],
    							blockFiles[i].length(), genStamp);
    					blockSet.add(new BlockAndFile(blockFiles[i]
    							.getAbsoluteFile(), block));
    		// 建立Block到DatanodeBlockInfo的映射关系
    		void getVolumeMap(HashMap volumeMap, FSVolume volume) {
    			// 迭代children FSDir
    			if (children != null) {
    				for (int i = 0; i < children.length; i++) {
    					children[i].getVolumeMap(volumeMap, volume);
    			File blockFiles[] = dir.listFiles();
    			if (blockFiles != null) {
    				for (int i = 0; i < blockFiles.length; i++) {
    					if (Block.isBlockFilename(blockFiles[i])) {
    						long genStamp = FSDataset.getGenerationStampFromFile(
    								blockFiles, blockFiles[i]);
    								new Block(blockFiles[i],
    										blockFiles[i].length(), genStamp),
    								new DatanodeBlockInfo(volume, blockFiles[i]));



  1.  FSVolume对应着DataNode上的一个Storage。一个DataNode可以配置多个Storage,一个DataNode包含多个FSVolume。
  2. FSVolume的重要方法
  • getDfsUsed磁盘使用量 getCapacity磁盘大小  getAvailable磁盘可用量
  • addBlock:向FSVolume中添加block,调用FSDir.addBlock完成。
  • getVolumeMap:获得此FSVolume下所有block到DatanodeBlockInfo的映射关系,调用FSDir.getVolumeMap完成。 
    // FSVolume对应一个Storage
    	// 一个DataNode可以配置多个Storage,一个DataNode包含多个FSVolume
    	class FSVolume {
    		private File currentDir;
    		private FSDir dataDir;
    		private File tmpDir;
    		private File blocksBeingWritten; // clients write here
    		private File detachDir; // copy on write for blocks in snapshot
    		private DF usage;
    		private DU dfsUsage;
    		//  dfs.datanode.du.reserved   
    		//	1024   
    		private long reserved;
    		// 初始化一个FSVolume
    		FSVolume(File currentDir, Configuration conf) throws IOException {
    			this.reserved = conf.getLong("dfs.datanode.du.reserved", 0);
    			this.dataDir = new FSDir(currentDir);
    			this.currentDir = currentDir;
    			//根据parent初始化下面各属性,parent is %hadoop_home%/dfs/data
    			File parent = currentDir.getParentFile();
    			this.detachDir = new File(parent, "detach");
    			// remove all blocks from "tmp" directory. These were either created
    			// by pre-append clients (0.18.x) or are part of replication
    			// request.
    			// They can be safely removed.
    			this.tmpDir = new File(parent, "tmp");
    			if (tmpDir.exists()) {
    			// Files that were being written when the datanode was last shutdown
    			// should not be deleted.
    			blocksBeingWritten = new File(parent, "blocksBeingWritten");
    			this.usage = new DF(parent, conf);
    			this.dfsUsage = new DU(parent, conf);
    		//getDfsUsed getCapacity  getAvailable
    		long get*() throws IOException {
    			return dfsUsage.get*();
    		File addBlock(Block b, File f) throws IOException {
    			File blockFile = dataDir.addBlock(b, f);
    			File metaFile = getMetaFile(blockFile, b);
    			// add 后,磁盘使用量增加
    			dfsUsage.incDfsUsed(b.getNumBytes() + metaFile.length());
    			return blockFile;
    		void getVolumeMap(HashMap volumeMap) {
    			dataDir.getVolumeMap(volumeMap, this);


  1.  DF被设计用来获取dirPath路径所在的磁盘的空间状态信息,对应的unix的shell脚本命令格式是:df -k path。
  2. DU类实现了unix的du命令,显示文件或目录dirPath占用磁盘空间的大小信息。
    public class DF extends Shell {
      /** Default DF refresh interval. */
      public static final long DF_INTERVAL_DEFAULT = 3 * 1000;
      private final String dirPath;	//执行df命令所在工作目录	
      private final File dirFile;	//执行df命令所在工作目录文件夹	
      private String filesystem;	//磁盘设备名   
      private String mount;	//磁盘挂载位置 
      //初始化dirPath and dirFile
      public DF(File path, long dfInterval) throws IOException {
        this.dirPath = path.getCanonicalPath();
        this.dirFile = path.getCanonicalFile();
      //getCapacity getUsed getAvailable
      public long get*() {
        return dirFile.get*();
    public class DU extends Shell {
      private String  dirPath;		//执行du命令所在工作目录	



  1.  管理一个DataNode下所有的FSVolume。
  2. FSVolume的重要方法
  • getVolumeMap:获得FSVolume[]下所有block到DatanodeBlockInfo的映射关系,叠加FSVolume.getVolumeMap实现。 
  • getDfsUsed磁盘使用量 getCapacity磁盘大小  getRemaining磁盘可用量,叠加FSVolume.x实现。
    	static class FSVolumeSet {
    		FSVolume[] volumes = null;
    		int curVolume = 0;
    		FSVolumeSet(FSVolume[] volumes) {
    			this.volumes = volumes;
    		synchronized FSVolume getNextVolume(long blockSize) throws IOException {
    			// make sure we are not out of bounds
    			if (curVolume >= volumes.length) {
    				curVolume = 0;
    			int startVolume = curVolume;
    			while (true) {
    				FSVolume volume = volumes[curVolume];
    				curVolume = (curVolume + 1) % volumes.length;
    				if (volume.getAvailable() > blockSize) {
    					return volume;
    				if (curVolume == startVolume) {
    					throw new DiskOutOfSpaceException("Insufficient space for an additional block");
    		long get*() throws IOException {
    		synchronized void getVolumeMap(HashMap volumeMap) {
    			for (int idx = 0; idx < volumes.length; idx++) {



static class ActiveFile {
		final File file;
		final List threads = new ArrayList(2);

		ActiveFile(File f, List list) {
			this(f, false);
			if (list != null) {



  1. FSDataset manages a set of data blocks.通过FSVolumeSet 管理。
  2. FSDataset实现了FSDatasetInterface接口,FSDatasetInterface接口是DataNode对底层存储的抽象。
    public class FSDataset implements FSConstants, FSDatasetInterface {
    	FSVolumeSet volumes;
    	HashMap volumeMap = new HashMap();;
    	private HashMap ongoingCreates = new HashMap();
    	//初始化FSDataset时初始化volumes and volumeMap
    	public FSDataset(DataStorage storage, Configuration conf)throws IOException {
    		FSVolume[] volArray = new FSVolume[storage.getNumStorageDirs()];
    		for (int idx = 0; idx < storage.getNumStorageDirs(); idx++) {
    			volArray[idx] = new FSVolume(storage.getStorageDir(idx)
    					.getCurrentDir(), conf);
    		volumes = new FSVolumeSet(volArray);
    	//=================================== 根据block 的几个方法 开始===================================
    	public synchronized File getBlockFile(Block b) throws IOException ;
    	protected File getMetaFile(Block b) throws IOException  ;
    	public long getMetaDataLength(Block b) throws IOException  ;
    	//得到InputStream MetaDataInputStream包含block长度
    	public MetaDataInputStream getMetaDataInputStream(Block b) throws IOException;
    	public InputStream getBlockInputStream(Block b) throws IOException
    	//获得block对应元数据文件的inputstream, 从指定位置开始读
    	public InputStream getBlockInputStream(Block b, long seekOffset) throws IOException;
    	public BlockInputStreams getTmpInputStreams(Block b, long blkoff, long ckoff) throws IOException;
    	//当DataNode需要为Block ID为3148782637964391313创建写流时,DataNode创建文件tmp/blk_3148782637964391313做为临时数据文件,
    	public BlockWriteStreams writeToBlock(Block b, boolean isRecovery, boolean isReplicationRequest) throws IOException;
    	//以blk_3148782637964391313为例,当DataNode提交Block ID为3148782637964391313数据块文件时,DataNode将把tmp/blk_3148782637964391313移到current下某一个目录,
    	public void finalizeBlock(Block b) throws IOException;
    	public void updateBlock(Block oldblock, Block newblock) throws IOException;
    	public void unfinalizeBlock(Block b) throws IOException;
    	//=================================== 根据block 的几个方法 结束===================================
    	//getDfsUsed getCapacity  getRemaining
    	public long get*() throws IOException {
    		return volumes.get*();



