Hadoop数据迁移工具DistCp

最近需要做两个集群之间的数据迁移,这里记录一下DistCp用法。

官方说明

1. 概述

DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。 它使用Map/Reduce实现文件分发,错误处理和恢复,以及报告生成。 

DistCp可以在千兆网络下实现TB/小时级别数据拷贝,对于大数据环境的迁移效率很高。

注意:需要源HDFS集群的/etc/hosts中添加目标HDFS集群的host。

2. 用法:

1. 常见用法

从旧集群 hdfs://hadoop-master:8020 迁移Hive到全新集群 hdfs://cdh-master:8020,在目标集群上运行:

注意:CDH环境HDFS端口是8020,Hadoop环境HDFS端口是9000

hadoop distcp src des
hadoop distcp hdfs://hadoop-master:8020/user/hive/* hdfs://cdh-master:8020/user/hive

2. 不同Hadoop版本之间迁移

两个集群之间的版本不同,使用hftp协议或者webhdfs协议。源集群的格式是 hftp:/// ,默认dfs.http.address是 :50070。

新的webhdfs协议代替了hftp后,源地址和目标地址都可以使用http协议webhdfs,可以完全兼容 。

hadoop distcp src des

# 实例
sudo -uhdfs hadoop distcp webhdfs://hadoop-master:50070/tmp/hive-export/ \
webhdfs://cdh-master:9870/tmp/hive-export/

3. 带参数:

hadoop distcp -skipcrccheck -update -m 500 -D dfs.checksum.type=CRC32 \
hdfs://hadoop-master:8020/user/hive/warehouse/cp_data.db/test \
hdfs://cdh-master:8020/data/hive/cp_data.db/test

3. 常用参数:

-update 如果源和目标的大小不一样则进行覆盖 执行覆盖的唯一标准是源文件和目标文件大小是否相同;如果不同,则源文件替换目标文件。 
-skipcrccheck 忽略crc检查 hadoop版本相同则不需要,-skipcrccheck必须与-update同时使用才能生效。
-m 同时拷贝的最大数目 指定了拷贝数据时map的数目。请注意并不是map数越多吞吐量越大。
-i 忽略失败 这个选项会比默认情况提供关于拷贝的更精确的统计, 同时它还将保留失败拷贝操作的日志,这些日志信息可以用于调试。最后,如果一个map失败了,但并没完成所有分块任务的尝试,这不会导致整个作业的失败。
-overwrite 覆盖目标 如果一个map失败并且没有使用-i选项,不仅仅那些拷贝失败的文件,这个分块任务中的所有文件都会被重新拷贝。
-log 记录日志到 DistCp为每个文件的每次尝试拷贝操作都记录日志,并把日志作为map的输出。 如果一个map失败了,当重新执行时这个日志不会被保留。

4. 参数说明:

distcp OPTIONS [source_path...]
-append Reuse existing data in target files and append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth Specify bandwidth per map in MB
-blocksperchunk If set to a positive value, fileswith more blocks than this value will be split into chunks of blocks to be transferred in parallel, and reassembled on the destination. By default, is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements getBlockLocations method and the target file system implements concat method
-copybuffersize Size of the copy buffer to use. By default is 8192B.
-delete Delete from target, files missing in source
-diff Use snapshot diff report to identify the difference between source and target
-f List of files that need to be copied
-filelimit (Deprecated!) Limit number of files copied to <= n
-filters The path to a file containing a list of strings for paths to be excluded from the copy.
-i Ignore failures during copy
-log Folder on DFS where distcp execution logs are saved
-m Max number of concurrent maps to use for copy
-mapredSslConf Configuration for ssl config file, to use with hftps://. Must be in the classpath.
-numListstatusThreads Number of threads to use for building file listing (max 40).
-overwrite Choose to overwrite target files unconditionally, even if they exist.
-p preserve status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no , then preserves replication, block size, user, group, permission, checksum type and timestamps. raw.* xattrs are preserved when both the source and destination paths are in the /.reserved/raw hierarchy (HDFS only). raw.* xattrpreservation is independent of the -p flag. Refer to the DistCp documentation for more details.
-rdiff Use target snapshot diff report to identify changes made on target
-sizelimit  (Deprecated!) Limit number of files copied to <= n bytes
-skipcrccheck Whether to skip CRC checks between source and target paths.
-strategy Copy strategy to use. Default is dividing work based on file sizes
-tmp   Intermediate work path to be used for atomic commit
-update Update target, copying only missingfiles or directories

 

 

 

你可能感兴趣的:(Hadoop)