产生问题于数据集群的数节点存储磁盘大小不同,造成使用一段时间以后容量小的磁盘空间紧张。
其实,早期配置了磁盘使用存储策略,就能解决该问题,部分网来上说这个策略无效,再hadoop2.0.1 本版有效,该版本应用于CHD4.6中。
为了找到准确的程序定位点,参考了以下的Hadoop设计文档。
参考
http://blog.csdn.net/chenpingbupt/article/details/7972589
文档中给出:
在一个DN的disk中,每个DN具有三个目录:current em bw,current包含finallized的replica,tmp包含temporary replica,rbw包含rbw,rwr,rur replicas。当一个replica第一次被dfs client发起请求而创建的时候,将会放到rbw中。当第一次创建是在block replication和clust balance过程中发起的话,replica就会放置到tmp中。一旦一个replica被finallized,他就会被move到current中。当一个DN重启之后,tmp中的replica将会被删除,rbw中的将会被加载为rwr状态,current中的会load为finallized状态
我们就从tmp 或 rbw 文件创建开始。
1.参见java class BlockPoolSlice
01.
/**
02.
* A block pool slice represents a portion of a block pool stored on a volume.
03.
* Taken together, all BlockPoolSlices sharing a block pool ID across a
04.
* cluster represent a single block pool.
05.
*
06.
* This class is synchronized by {@link FsVolumeImpl}.
07.
*/
08.
class
BlockPoolSlice {
09.
private
final
String bpid;
10.
private
final
FsVolumeImpl volume;
// volume to which this BlockPool belongs to
11.
private
final
File currentDir;
// StorageDirectory/current/bpid/current
12.
private
final
LDir finalizedDir;
// directory store Finalized replica
13.
private
final
File rbwDir;
// directory store RBW replica
14.
private
final
File tmpDir;
// directory store Temporary replica
01.
/**
02.
* Temporary files. They get moved to the finalized block directory when
03.
* the block is finalized.
04.
*/
05.
File createTmpFile(Block b)
throws
IOException {
06.
File f =
new
File(tmpDir, b.getBlockName());
07.
return
DatanodeUtil.createTmpFile(b, f);
08.
}
09.
10.
/**
11.
* RBW files. They get moved to the finalized block directory when
12.
* the block is finalized.
13.
*/
14.
File createRbwFile(Block b)
throws
IOException {
15.
File f =
new
File(rbwDir, b.getBlockName());
16.
return
DatanodeUtil.createTmpFile(b, f);
17.
}
2.该方法的实现
01.
/** Provide utility methods for Datanode. */
02.
@InterfaceAudience
.Private
03.
public
class
DatanodeUtil {
04.
public
static
final
String UNLINK_BLOCK_SUFFIX =
".unlinked"
;
05.
06.
public
static
final
String DISK_ERROR =
"Possible disk error: "
;
07.
08.
/** Get the cause of an I/O exception if caused by a possible disk error
09.
* @param ioe an I/O exception
10.
* @return cause if the I/O exception is caused by a possible disk error;
11.
* null otherwise.
12.
*/
13.
static
IOException getCauseIfDiskError(IOException ioe) {
14.
if
(ioe.getMessage()!=
null
&& ioe.getMessage().startsWith(DISK_ERROR)) {
15.
return
(IOException)ioe.getCause();
16.
}
else
{
17.
return
null
;
18.
}
19.
}
20.
21.
/**
22.
* Create a new file.
23.
* @throws IOException
24.
* if the file already exists or if the file cannot be created.
25.
*/
26.
public
static
File createTmpFile(Block b, File f)
throws
IOException {
27.
if
(f.exists()) {
28.
throw
new
IOException(
"Failed to create temporary file for "
+ b
29.
+
". File "
+ f +
" should not be present, but is."
);
30.
}
31.
// Create the zero-length temp file
32.
final
boolean
fileCreated;
33.
try
{
34.
fileCreated = f.createNewFile();
35.
}
catch
(IOException ioe) {
36.
throw
new
IOException(DISK_ERROR +
"Failed to create "
+ f, ioe);
37.
}
38.
if
(!fileCreated) {
39.
throw
new
IOException(
"Failed to create temporary file for "
+ b
40.
+
". File "
+ f +
" should be creatable, but is already present."
);
41.
}
42.
return
f;
43.
}
在调用该方法创建数据block时,并没有我们关心的存储路径的选择策略。
3.我们再来查找createRbwFile调用出处
1.
/**************************************************
2.
* FSDataset manages a set of data blocks. Each block
3.
* has a unique name and an extent on disk.
4.
*
5.
***************************************************/
6.
@InterfaceAudience
.Private
7.
class
FsDatasetImpl
implements
FsDatasetSpi<FsVolumeImpl> {
8.
static
final
Log LOG = LogFactory.getLog(FsDatasetImpl.
class
);
01.
@Override
// FsDatasetSpi
02.
public
synchronized
ReplicaInPipeline createRbw(ExtendedBlock b)
03.
throws
IOException {
04.
ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(),
05.
b.getBlockId());
06.
if
(replicaInfo !=
null
) {
07.
throw
new
ReplicaAlreadyExistsException(
"Block "
+ b +
08.
" already exists in state "
+ replicaInfo.getState() +
09.
" and thus cannot be created."
);
10.
}
11.
// create a new block
12.
FsVolumeImpl v = volumes.getNextVolume(b.getNumBytes());
13.
// create a rbw file to hold block in the designated volume
14.
File f = v.createRbwFile(b.getBlockPoolId(), b.getLocalBlock());
15.
ReplicaBeingWritten newReplicaInfo =
new
ReplicaBeingWritten(b.getBlockId(),
16.
b.getGenerationStamp(), v, f.getParentFile());
17.
volumeMap.add(b.getBlockPoolId(), newReplicaInfo);
18.
return
newReplicaInfo;
19.
}
这里发现了我们关系的volumes,它是配置的存储路径。
4.查看volumes 的初始
volumnes是在构造函数中初始化的,使用了volArray
01.
/**
02.
* An FSDataset has a directory where it loads its data files.
03.
*/
04.
FsDatasetImpl(DataNode datanode, DataStorage storage, Configuration conf
05.
)
throws
IOException {
06.
this
.datanode = datanode;
07.
// The number of volumes required for operation is the total number
08.
// of volumes minus the number of failed volumes we can tolerate.
09.
final
int
volFailuresTolerated =
10.
conf.getInt(DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_KEY,
11.
DFSConfigKeys.DFS_DATANODE_FAILED_VOLUMES_TOLERATED_DEFAULT);
12.
13.
String[] dataDirs = conf.getTrimmedStrings(DFSConfigKeys.DFS_DATANODE_DATA_DIR_KEY);
14.
15.
int
volsConfigured = (dataDirs ==
null
) ?
0
: dataDirs.length;
16.
int
volsFailed = volsConfigured - storage.getNumStorageDirs();
17.
this
.validVolsRequired = volsConfigured - volFailuresTolerated;
18.
19.
if
(volFailuresTolerated <
0
|| volFailuresTolerated >= volsConfigured) {
20.
throw
new
DiskErrorException(
"Invalid volume failure "
21.
+
" config value: "
+ volFailuresTolerated);
22.
}
23.
if
(volsFailed > volFailuresTolerated) {
24.
throw
new
DiskErrorException(
"Too many failed volumes - "
25.
+
"current valid volumes: "
+ storage.getNumStorageDirs()
26.
+
", volumes configured: "
+ volsConfigured
27.
+
", volumes failed: "
+ volsFailed
28.
+
", volume failures tolerated: "
+ volFailuresTolerated);
29.
}
30.
31.
final
List<FsVolumeImpl> volArray =
new
ArrayList<FsVolumeImpl>(
32.
storage.getNumStorageDirs());
33.
for
(
int
idx =
0
; idx < storage.getNumStorageDirs(); idx++) {
34.
final
File dir = storage.getStorageDir(idx).getCurrentDir();
35.
volArray.add(
new
FsVolumeImpl(
this
, storage.getStorageID(), dir, conf));
36.
LOG.info(
"Added volume - "
+ dir);
37.
}
38.
volumeMap =
new
ReplicaMap(
this
);
39.
40.
@SuppressWarnings
(
"unchecked"
)
41.
final
VolumeChoosingPolicy<FsVolumeImpl> blockChooserImpl =
42.
ReflectionUtils.newInstance(conf.getClass(
43.
DFSConfigKeys.DFS_DATANODE_FSDATASET_VOLUME_CHOOSING_POLICY_KEY,
44.
RoundRobinVolumeChoosingPolicy.
class
,
45.
VolumeChoosingPolicy.
class
), conf);
46.
volumes =
new
FsVolumeList(volArray, volsFailed, blockChooserImpl);
47.
volumes.getVolumeMap(volumeMap);
48.
49.
File[] roots =
new
File[storage.getNumStorageDirs()];
50.
for
(
int
idx =
0
; idx < storage.getNumStorageDirs(); idx++) {
51.
roots[idx] = storage.getStorageDir(idx).getCurrentDir();
52.
}
53.
asyncDiskService =
new
FsDatasetAsyncDiskService(datanode, roots);
54.
registerMBean(storage.getStorageID());
55.
}
1.
final
List<FsVolumeImpl> volArray =
new
ArrayList<FsVolumeImpl>(
2.
storage.getNumStorageDirs());
3.
for
(
int
idx =
0
; idx < storage.getNumStorageDirs(); idx++) {
4.
final
File dir = storage.getStorageDir(idx).getCurrentDir();
5.
volArray.add(
new
FsVolumeImpl(
this
, storage.getStorageID(), dir, conf));
6.
LOG.info(
"Added volume - "
+ dir);
7.
}
到此,我们找到了需要的存储路径,下面再找到如何选择的路径的就容易多了。
5.路径选择从getNextVolume开始
01.
class
FsVolumeList {
02.
/**
03.
* Read access to this unmodifiable list is not synchronized.
04.
* This list is replaced on modification holding "this" lock.
05.
*/
06.
volatile
List<FsVolumeImpl> volumes =
null
;
07.
08.
private
final
VolumeChoosingPolicy<FsVolumeImpl> blockChooser;
09.
private
volatile
int
numFailedVolumes;
10.
11.
FsVolumeList(List<FsVolumeImpl> volumes,
int
failedVols,
12.
VolumeChoosingPolicy<FsVolumeImpl> blockChooser) {
13.
this
.volumes = Collections.unmodifiableList(volumes);
14.
this
.blockChooser = blockChooser;
15.
this
.numFailedVolumes = failedVols;
16.
}
17.
18.
int
numberOfFailedVolumes() {
19.
return
numFailedVolumes;
20.
}
21.
22.
/**
23.
* Get next volume. Synchronized to ensure {@link #curVolume} is updated
24.
* by a single thread and next volume is chosen with no concurrent
25.
* update to {@link #volumes}.
26.
* @param blockSize free space needed on the volume
27.
* @return next volume to store the block in.
28.
*/
29.
synchronized
FsVolumeImpl getNextVolume(
long
blockSize)
throws
IOException {
30.
return
blockChooser.chooseVolume(volumes, blockSize);
31.
}
6.继续chooseVolume 源自于 blockChooser 类型是 VolumeChoosingPolicy ,该方法实现在下面的类中:
01.
/**
02.
* A DN volume choosing policy which takes into account the amount of free
03.
* space on each of the available volumes when considering where to assign a
04.
* new replica allocation. By default this policy prefers assigning replicas to
05.
* those volumes with more available free space, so as to over time balance the
06.
* available space of all the volumes within a DN.
07.
*/
08.
public
class
AvailableSpaceVolumeChoosingPolicy<V
extends
FsVolumeSpi>
09.
implements
VolumeChoosingPolicy<V>, Configurable {
10.
11.
private
static
final
Log LOG = LogFactory.getLog(AvailableSpaceVolumeChoosingPolicy.
class
);
12.
13.
private
static
final
Random RAND =
new
Random();
14.
15.
private
long
balancedSpaceThreshold = DFS_DATANODE_AVAILABLE_SPACE_VOLUME_CHOOSING_POLICY_BALANCED_SPACE_THRESHOLD_DEFAULT;
16.
private
float
balancedPreferencePercent = DFS_DATANODE_AVAILABLE_SPACE_VOLUME_CHOOSING_POLICY_BALANCED_SPACE_PREFERENCE_FRACTION_DEFAULT;
7.策略实现就是这样的:
01.
@Override
02.
public
synchronized
V chooseVolume(List<V> volumes,
03.
final
long
replicaSize)
throws
IOException {
04.
if
(volumes.size() <
1
) {
05.
throw
new
DiskOutOfSpaceException(
"No more available volumes"
);
06.
}
07.
08.
AvailableSpaceVolumeList volumesWithSpaces =
09.
new
AvailableSpaceVolumeList(volumes);
10.
11.
if
(volumesWithSpaces.areAllVolumesWithinFreeSpaceThreshold()) {
12.
// If they're actually not too far out of whack, fall back on pure round
13.
// robin.
14.
V volume = roundRobinPolicyBalanced.chooseVolume(volumes, replicaSize);
15.
if
(LOG.isDebugEnabled()) {
16.
LOG.debug(
"All volumes are within the configured free space balance "
+
17.
"threshold. Selecting "
+ volume +
" for write of block size "
+
18.
replicaSize);
19.
}
20.
return
volume;
21.
}
else
{
22.
V volume =
null
;
23.
// If none of the volumes with low free space have enough space for the
24.
// replica, always try to choose a volume with a lot of free space.
25.
long
mostAvailableAmongLowVolumes = volumesWithSpaces
26.
.getMostAvailableSpaceAmongVolumesWithLowAvailableSpace();
27.
28.
List<V> highAvailableVolumes = extractVolumesFromPairs(
29.
volumesWithSpaces.getVolumesWithHighAvailableSpace());
30.
List<V> lowAvailableVolumes = extractVolumesFromPairs(
31.
volumesWithSpaces.getVolumesWithLowAvailableSpace());
32.
33.
float
preferencePercentScaler =
34.
(highAvailableVolumes.size() * balancedPreferencePercent) +
35.
(lowAvailableVolumes.size() * (
1
- balancedPreferencePercent));
36.
float
scaledPreferencePercent =
37.
(highAvailableVolumes.size() * balancedPreferencePercent) /
38.
preferencePercentScaler;
39.
if
(mostAvailableAmongLowVolumes < replicaSize ||
40.
RAND.nextFloat() < scaledPreferencePercent) {
41.
volume = roundRobinPolicyHighAvailable.chooseVolume(
42.
highAvailableVolumes,
43.
replicaSize);
44.
if
(LOG.isDebugEnabled()) {
45.
LOG.debug(
"Volumes are imbalanced. Selecting "
+ volume +
46.
" from high available space volumes for write of block size "
47.
+ replicaSize);
48.
}
49.
}
else
{
50.
volume = roundRobinPolicyLowAvailable.chooseVolume(
51.
lowAvailableVolumes,
52.
replicaSize);
53.
if
(LOG.isDebugEnabled()) {
54.
LOG.debug(
"Volumes are imbalanced. Selecting "
+ volume +
55.
" from low available space volumes for write of block size "
56.
+ replicaSize);
57.
}
58.
}
59.
return
volume;
60.
}
61.
}
花费了接近3天的时间,纯代码看着实累,可以步进就好了。
相关的配置说明。
dfs.datanode.fsdataset.volume.choosing.policy
dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
10737418240
Only used when the dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy. This setting controls how much DN volumes are allowed to differ in terms of bytes of free disk space before they are considered imbalanced. If the free space of all the volumes are within this range of each other, the volumes will be considered balanced and block assignments will be done on a pure round robin basis.
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction
0.75f
Only used when the dfs.datanode.fsdataset.volume.choosing.policy is set to org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy. This setting controls what percentage of new block allocations will be sent to volumes with more available disk space than others. This setting should be in the range 0.0 - 1.0, though in practice 0.5 - 1.0, since there should be no reason to prefer that volumes with less available disk space receive more block allocations.
另附上其他的一些类分析:
FSDataset:所有和数据块相关的操作,都在FSDataset相关的类。详细分析参考 http://caibinbupt.iteye.com/blog/284365
DataXceiverServer:处理数据块的流读写的的服务器,处理逻辑由DataXceiver完成。详细分析参考 http://caibinbupt.iteye.com/blog/284979
DataXceiver:处理数据块的流读写的线程。详细分析参考 http://caibinbupt.iteye.com/blog/284979
还有处理非读写的非主流的流程。详细分析参考 http://caibinbupt.iteye.com/blog/286533
BlockReceiver:完成数据块的流写操作。详细分析参考 http://caibinbupt.iteye.com/blog/286259
BlockSender:完成数据块的流读操作。
DataBlockScanner:用于定时对数据块文件进行校验。详细分析参考http://caibinbupt.iteye.com/blog/286650