对比到我们的业务 大约就是一下场景:
实时计算引擎的中间结果
日志文件
maoreduce的多个reduce产生的多个文件
上报数据多个文件
可以考虑将任务中缓存文件及数据状态使用第三方外部存储,如redis等 存储任务的中间状态,减少对hdfs的读写和文件生成(非常的不靠谱,直接忽略)
可以考虑HAR 文件,或把文件合并成一个文件。在重新load进hive中
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name
SELECT *
FROM db_name.table_name;
这个方法其实就是使用Hive作业从一个表或分区中读取数据然后重新覆盖写入到相同的路径下。必须为合并文件的Hive作业指定一些类似上面章节提到的一些参数,以控制写入HDFS的文件的数量和大小。
合并一个非分区表的小文件方法:
SET mapreduce.job.reduces = <table_size_MB/256>;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name
SELECT *
FROM db_name.table_name
SORT BY 1;
合并一个表分区的小文件:
SET mapreduce.job.reduces = <table_size_MB/256>;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name
PARTITION (part_col = '' )
SELECT col1, col2, ..., coln
FROM db_name.table_name
WHERE part_col = ''
SORT BY 1;
合并一个范围内的表分区的小文件:
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.dynamic.partition = true;
INSERT OVERWRITE TABLE db_name.table_name
PARTITION (part_col)
SELECT col1, col2, ..., coln, part_col
FROM db_name.table_name
WHERE part_col BETWEEN '' AND '' ;
Hive会在本身的SQL作业执行完毕后会单独起一个MapReduce任务来合并输出的小文件。
但这个设置仅对Hive创建的文件生效,比如使用Sqoop导数到Hive表,或者直接抽数到HDFS等,该方法都不会起作用。
定义文件输出个数大小,定义reduce数量,减少输出文件数
fsimage文件是NameNode中关于元数据的镜像,一般称为检查点,它是在NameNode启动时对整个文件系统的快照。包括mr操作中 都会对edit以及fsimage做操作修改,可以说fsimage就是整个hdfs的目录清单,通过对其进行分析,可以分析出hdfs上小文件的分布情况
CREATE TABLE fsimage_info_csv(
path string,
replication int,
modificationtime string,
accesstime string,
preferredblocksize bigint,
blockscount int,
filesize bigint,
nsquota string,
dsquota string,
permission string,
username string,
groupname string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://Direction_Wind/apps/hive/warehouse/Direction_Wind.db/fsimage_info_csv';
hdfs dfsadmin -fetchImage /data
hdfs oiv -i /data/fsimage_0000000003621277730 -t /temp/dir -o /data/fs_distribution -p Delimited -delimiter “,”(-t使用临时文件处理中间数据 不加的话全部使用内存 容易OOM)
(hdfs oiv -p FileDistribution -i fsimage_0000000003621277730 -o fs_distribution )
hdfs dfs -put /data/fs_distribution hdfs://Direction_Wind/apps/hive/warehouse/Direction_Wind.db/fsimage_info_csv/
Hive : MSCK REPAIR TABLE fsimage_info_csv;
hdfs oiv -p FileDistribution -i fsimage_0000000003621277730 -o fs_distribution
当前文件总体状况(3.21)
totalFiles = 64324882
totalDirectories = 3895729
totalBlocks = 62179776
totalSpace = 331986259384110
maxFileSize = 269556045187
SELECT
dir_path ,
COUNT(*) AS small_file_num
FROM
( SELECT
relative_size,
dir_path
FROM
( SELECT
(
CASE filesize < 4194304
WHEN TRUE
THEN 'small'
ELSE 'large'
END) AS relative_size,
concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2], '/'
,split(PATH,'\/')[3], '/',split(PATH,'\/')[4], '/', split(
PATH,'\/')[5]) AS dir_path
FROM
Direction_Wind.fsimage_info_csv
WHERE
permission not LIKE 'd%'
) t1
WHERE
relative_size='small') t2
GROUP BY
dir_path desc
ORDER BY
small_file_num
limit 1000
表数据量多大 中途用了多种拆分方式 分散做统计
from (
select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` ,floor(rand() * 8) as part from fsimage_info_csv
) t
insert into fsimage_info_csv_pt1 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 0
insert into fsimage_info_csv_pt2 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 1
insert into fsimage_info_csv_pt3 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 2
insert into fsimage_info_csv_pt4 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 3
insert into fsimage_info_csv_pt5 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 4
insert into fsimage_info_csv_pt6 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 5
insert into fsimage_info_csv_pt7 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 6
insert into fsimage_info_csv_pt8 select `path`,`replication`,`modificationtime`,`accesstime`,`preferredblocksize`,`blockscount`,`filesize`,`nsquota`,`dsquota`,`permission`,`username`,`groupname` where part = 7
insert overwrite table fsimage_info_csv_partition2 partition(pt)
select *
, concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]) as pt
from fsimage_info_csv;
select dir_path,countn from (
select dir_path,sum(countn) as countn
from (
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt1
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt2
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt3
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt4
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt5
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt6
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt7
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
union all
SELECT
dir_path
,count(*) as countn
from (
select concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2]
) as dir_path
FROM
Direction_Wind.fsimage_info_csv_pt8
where concat('/',split(PATH,'\/')[1], '/',split(PATH,'\/')[2] ) != '/apps/hive'
) t1
group by dir_path
) unionp
group by dir_path
) orderp
order by countn desc
limit 30
这一步就是在sql上增加一个 filesize 大小的where条件
不包含hive排序结果:
使用Hive来压缩表中小文件的一个缺点是,如果表中既包含小文件又包含大文件,则必须将这些大小文件一起处理然后重新写入磁盘。如上一节所述,也即没有办法只处理表中的小文件,而保持大文件不变。
FileCrusher使用MapReduce作业来合并一个或多个目录中的小文件,而不会动大文件。它支持以下文件格式的表:
它还可以压缩合并后的文件,不管这些文件以前是否被压缩,从而减少占用的存储空间。默认情况下FileCrusher使用Snappy压缩输出数据。
FileCrusher不依赖于Hive,而且处理数据时不会以Hive表为单位,它直接工作在HDFS数据之上。一般需要将需要合并的目录信息以及存储的文件格式作为输入参数传递给它。
为了简化使用FileCrusher压缩Hive表,我们创建了一个“包装脚本”(wrapper script)来将Hive表的相关参数正确解析后传递给FileCrusher。
crush_partition.sh脚本将表名(也可以是分区)作为参数,并执行以下任务:
当FileCrusher运行时,它会将符合压缩条件的文件合并压缩为更大的文件,然后使用合并后的文件替换原始的小文件。合并后的文件格式为:
“crushed_file--
原始文件不会被删除,它们会被移动的备份目录,备份目录的路径会在作业执行完毕后打印到终端。原始文件的绝对路径在备份目录中保持不变,因此,如果需要回滚,则很容易找出你想要拷贝回去的目录地址。例如,如果原始小文件的目录为:
/user/hive/warehouse/prod.db/user_transactions/000000_1/user/hive/warehouse/prod.db/user_transactions/000000_2
合并后会成为一个文件:
/user/hive/warehouse/prod.db/user_transactions/crushed_file-20161118102300-0-0
原始文件我们会移动到备份目录,而且它之前的原始路径我们依旧会保留:
/user/admin/filecrush_backup/user/hive/warehouse/prod.db/user_transactions/000000_1/user/admin/filecrush_backup/user/hive/warehouse/prod.db/user_transactions/000000_2
本文提到的crush_partition.sh github全路径为:
https://github.com/asdaraujo/filecrush/tree/master/bin
脚本的方法如下所示:
Syntax: crush_partition.sh
[compression] [threshold] [max_reduces]
具体参数解释如下:
db_name - (必须)表所存储的数据库名
table_name -(必须)需要合并的表名
partition_spec -(必须)需要合并的分区参数,有效值为: