Offline Image Viewer是一个将fsimage的内容转换为可以阅读的工具,为了运行理想分析并Hadoop集群的检查。
工具可以较快速的处理较大的image文件将他们转换为几种输出格式之一。工具可以处理Hadoop16及以上版本,如
果不能处理,他会退出。他是一个离线操作,不需要集群环境。
统计提供了几种处理方式:
1 Ls是一个默认的输出处理方式。他模拟lsr命令,包括目录或文件名、权限、复制、所有者、组、文件大小、修改日期、完整路径。不同于lsr命令的是包括根路径。和ls的命令的最大区别是输出并没有按照文件夹名和内容排序。相反,是按照在fsimage文件当中的储存顺序排序的。因此,他不能喝ls命令的输出直接比较,这个ls处理器包含使用Inode块去计算文件的大小如果想忽略可以使用-skipBlocks
2 缩进视图提供一个fsimage内容的完整的视图,包括在image所有的信息如image的版本,生成戳和索引节点以及block的列表。这个处理器使用缩进构成一个有层次的输出结构。ls的格式更能够方便人们的理解
3 分隔方式提供了一个文件每一行包括路径文件组成的路径,复制,修改时间,访问时间,块大小、数量的块,文件大小,名称空间配额,磁盘空间配额,权限,用户名和组名。如果对一个不包含这些字段的fsimage运行,字段的列将列出,但没有数据。默认记录分隔符是一个tab,但这可以通过-delimiter命令行参数改变。这个命令可以创建出可以被其他工具识别分析的格式,比如apache的pig。参见[37]使用该处理器分析fsimage文件的内容进一步分析结果部分进行分析。
4 XML处理器创建一个fsimage 的XML的文档包括lsr处理器类似的fsimage 的所有信息。这个输出比较适合XML工具自动的处理和分析。由于xml的语法复杂,这个输出的量也是最大的。
5 FileDistribution是一个分析namespace的image中文件大小的工具。运行此工具应该通过定义一个整数范围(0,最大尺寸)指定最大容量和步骤。会被分为整数个段[0, s[1], ..., s[n-1], maxSize],处理器会计算系统中每一个段中文件的多少。注意文件大于maxSize的部分会落入最后一段。输出文件的格式是一个用tab分割的两列:大小和文件数。Size标示开始的段,numFiles开始的段往后的文件数。
最简单的用法是通过-i和-o设定输入输出文件:
bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
将会使用Ls处理器创建一个文件叫做 fsimage.txt在当前文件夹下。如果image文件较大,需要一些时间
通过-p可以设置处理器。例如:
bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
或者
bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
这会运行XML或者Indented 处理器。
-skipBlocks可以明确地找到在命名空间内的构成一个文件的所有的块。这对于大文件来说很有意义。启用这个选项可以有效地控制输出的结果,结果并不包括点的的块。注意,Ls处理器会需要列举块,并且会覆盖此选项。
考虑下面的命名空间:
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:17 /anotherDir
-rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 21:15 /anotherDir/biggerfile
-rw-r--r-- 3 theuser supergroup 8754 2009-03-16 21:17 /anotherDir/smallFile
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
drwx-wx-wx - theuser supergroup 0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:12 /one/two
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:16 /user
drwxr-xr-x - theuser supergroup 0 2009-03-16 21:19 /user/theuser
运行Image处理器处理结果如下:
machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:17 /anotherDir
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:16 /user
-rw-r--r-- 3 theuser supergroup 286631664 2009-03-16 14:15 /anotherDir/biggerfile
-rw-r--r-- 3 theuser supergroup 8754 2009-03-16 14:17 /anotherDir/smallFile
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
drwx-wx-wx - theuser supergroup 0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:12 /one/two
drwxr-xr-x - theuser supergroup 0 2009-03-16 14:19 /user/theuser
运行Indented 处理器生成的结果开始部分如下:
machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
FSImage
ImageVersion = -19
NamespaceID = 2109123098
GenerationStamp = 1003
INodes [NumInodes = 12]
Inode
INodePath =
Replication = 0
ModificationTime = 2009-03-16 14:16
AccessTime = 1969-12-31 16:00
BlockSize = 0
Blocks [NumBlocks = -1]
NSQuota = 2147483647
DSQuota = -1
Permissions
Username = theuser
GroupName = supergroup
PermString = rwxr-xr-x
...remaining output omitted...
Flag | Description |
-i u007C--inputFileinput file | 必填。输入的文件 |
-o u007C--outputFile output file | 必填。输出文件。 |
-p u007C--processorprocessor | 指明处理器类型,现在可以是Ls (default), XML 和Indented.. |
-skipBlocks | 不列举文件中的块。这将节省处理时间和输出的文件大小,Ls处理器读取块确定文件大小,会忽略这个选项。 |
-printToScreen | 输出到控制台和文件。 |
-delimiter arg | 结合使用分隔的处理器时,替换默认选项卡指定的分隔符的字符串参数。 |
-h u007C--help | help |
Offline Image Viewer可以方便的收集hdfs 命名空间的数据。这些信息可以匹配文件系统或者使用任意的特殊文件或者其他类型的分析。分隔处理器可以输出适合进一步处理的格式如pig。Pig提供了一个很好地分析工具,他即可以从一个小的fsimage处理输出,也可以作用于整个大的系统。
分隔处理器生成默认以tab分开的文本包括所有的字段。示例脚本提供了输出完成三个任务:确定文件系统中用户创建的文件的数量,找到文件已经创建但是没有被访问的,通过比较文件大小找到大文件的副本。
每一个脚本假设你已经使用分隔处理器生成了一个文件叫做foo并且Pig分析的结果储存在一个叫做results的文件夹里。
This script processes each path within the namespace, groups them by the file owner and determines the total number of files each user owns.
numFilesOfEachUser.pig:
-- This script determines the total number of files each user has in
-- the namespace. Its output is of the form:
-- username, totalNumFiles
-- Load all of the fields from the file
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab just the path and username
B = FOREACH A GENERATE path, username;
-- Generate the sum of the number of paths for each user
C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
-- Save results
STORE C INTO '$outputFile';
This script can be run against pig with the following command:
bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
The output file's content will be similar to that below:
bart 1
lisa 16
homer 28
marge 2456
This script finds files that were created but whose access times were never changed, meaning they were never opened or viewed.
neverAccessed.pig:
-- This script generates a list of files that were created but never
-- accessed, based on their AccessTime
-- Load all of the fields from the file
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab just the path and last time the file was accessed
B = FOREACH A GENERATE path, accessTime;
-- Drop all the paths that don't have the default assigned last-access time
C = FILTER B BY accessTime == '1969-12-31 16:00';
-- Drop the accessTimes, since they're all the same
D = FOREACH C GENERATE path;
-- Save results
STORE D INTO '$outputFile';
This script can be run against pig with the following command and its output file's content will be a list of files that were created but never viewed afterwards.
bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
This script groups files together based on their size, drops any that are of less than 100mb and returns a list of the file size, number of files found and a tuple of the file paths. This can be used to find likely duplicates within the filesystem namespace.
probableDuplicates.pig:
-- This script finds probable duplicate files greater than 100 MB by
-- grouping together files based on their byte size. Files of this size
-- with exactly the same number of bytes can be considered probable
-- duplicates, but should be checked further, either by comparing the
-- contents directly or by another proxy, such as a hash of the contents.
-- The scripts output is of the type:
-- fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
-- Load all of the fields from the file
A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
replication:int,
modTime:chararray,
accessTime:chararray,
blockSize:long,
numBlocks:int,
fileSize:long,
NamespaceQuota:int,
DiskspaceQuota:int,
perms:chararray,
username:chararray,
groupname:chararray);
-- Grab the pathname and filesize
B = FOREACH A generate path, fileSize;
-- Drop files smaller than 100 MB
C = FILTER B by fileSize > 100L * 1024L * 1024L;
-- Gather all the files of the same byte size
D = GROUP C by fileSize;
-- Generate path, num of duplicates, list of duplicates
E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
-- Drop all the files where there are only one of them
F = FILTER E by numDupes > 1L;
-- Sort by the size of the files
G = ORDER F by fileSize;
-- Save results
STORE G INTO '$outputFile';
This script can be run against pig with the following command:
bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
The output file's content will be similar to that below:
1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
Each line includes the file size in bytes that was found to be duplicated, the number of duplicates found, and a list of the duplicated paths. Files less than 100MB are ignored, providing a reasonable likelihood that files of these exact sizes may be duplicates.