DistCP是Apache Hadoop上下文中的Distributed Copy(分布式拷贝)的缩写。它基本上是一个工具,可以使用在我们需要复制大量的数据/文件在集群内/集群设置。 在后台,DisctCP使用MapReduce分发和复制数据,这意味着操作分布在集群中的多个可用节点上。这使得它更有效和有效的复制工具。
hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination
hadoop distcp hdfs://quickstart.cloudera:8020/user/access_logs hdfs://quickstart.cloudera:8020/user/destination_access_logs
15/12/01 17:13:07 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://quickstart.cloudera:8020/user/access_logs], targetPath=hdfs://quickstart.cloudera:8020/user/destination_access_logs, targetPathExists=false, preserveRawXattrs=false, filtersFile='null'}
15/12/01 17:13:07 INFO client.RMProxy: Connecting to ResourceManager at /
15/12/01 17:13:08 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 2; dirCnt = 1
15/12/01 17:13:08 INFO tools.SimpleCopyListing: Build file listing completed.
15/12/01 17:13:08 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
15/12/01 17:13:08 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
15/12/01 17:13:08 INFO tools.DistCp: Number of paths in the copy list: 2
15/12/01 17:13:08 INFO tools.DistCp: Number of paths in the copy list: 2
15/12/01 17:13:08 INFO client.RMProxy: Connecting to ResourceManager at /
15/12/01 17:13:09 INFO mapreduce.JobSubmitter: number of splits:2
15/12/01 17:13:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449017643353_0001
15/12/01 17:13:10 INFO impl.YarnClientImpl: Submitted application application_1449017643353_0001
15/12/01 17:13:10 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1449017643353_0001/
15/12/01 17:13:10 INFO tools.DistCp: DistCp job-id: job_1449017643353_0001
15/12/01 17:13:10 INFO mapreduce.Job: Running job: job_1449017643353_0001
15/12/01 17:13:20 INFO mapreduce.Job: Job job_1449017643353_0001 running in uber mode : false
15/12/01 17:13:20 INFO mapreduce.Job: map 0% reduce 0%
15/12/01 17:13:32 INFO mapreduce.Job: map 50% reduce 0%
15/12/01 17:13:34 INFO mapreduce.Job: map 100% reduce 0%
15/12/01 17:13:34 INFO mapreduce.Job: Job job_1449017643353_0001 completed successfully
15/12/01 17:13:35 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=228770
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=39594819
HDFS: Number of bytes written=39593868
HDFS: Number of read operations=28
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=20530
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=20530
Total vcore-seconds taken by all map tasks=20530
Total megabyte-seconds taken by all map tasks=21022720
Map-Reduce Framework
Map input records=2
Map output records=0
Input split bytes=276
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=94
CPU time spent (ms)=1710
Physical memory (bytes) snapshot=257175552
Virtual memory (bytes) snapshot=3006455808
Total committed heap usage (bytes)=121503744
File Input Format Counters
Bytes Read=675
File Output Format Counters
Bytes Written=0
hadoop fs -ls /user/destination_access_logs
hadoop distcp -f hdfs://namenode:port/sourceListFile hdfs://namenode:port/destination
hadoop distcp hdfs://namenode:port/source1 hdfs://namenode:port/source2 hdfs://namenode:port/source3 hdfs://namenode:port/destination
hadoop distcp -update hdfs://namenode:port/source hdfs://namenode:port/destination
hadoop distcp -overwrite hdfs://namenode:port/source hdfs://namenode:port/destination
hadoop distcp -i hdfs://namenode:port/source hdfs://namenode:port/destination
如果用户想要指定可以分配给distcp执行的映射任务的最大数量,则有另一个标志 -m
hadoop distcp -m 5 hdfs://namenode:port/source hdfs://namenode:port/destination
hadoop distcp -m 1 hdfs://quickstart.cloudera:8020/user/access_logs hdfs://quickstart.cloudera:8020/user/destination_access_logs_3
15/12/01 17:19:33 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=1, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://quickstart.cloudera:8020/user/access_logs], targetPath=hdfs://quickstart.cloudera:8020/user/destination_access_logs_3, targetPathExists=false, preserveRawXattrs=false, filtersFile='null'}
15/12/01 17:19:33 INFO client.RMProxy: Connecting to ResourceManager at /
15/12/01 17:19:34 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 2; dirCnt = 1
15/12/01 17:19:34 INFO tools.SimpleCopyListing: Build file listing completed.
15/12/01 17:19:34 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
15/12/01 17:19:34 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
15/12/01 17:19:34 INFO tools.DistCp: Number of paths in the copy list: 2
15/12/01 17:19:34 INFO tools.DistCp: Number of paths in the copy list: 2
15/12/01 17:19:34 INFO client.RMProxy: Connecting to ResourceManager at /
15/12/01 17:19:35 INFO mapreduce.JobSubmitter: number of splits:1
15/12/01 17:19:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1449017643353_0003
15/12/01 17:19:35 INFO impl.YarnClientImpl: Submitted application application_1449017643353_0003
15/12/01 17:19:35 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1449017643353_0003/
15/12/01 17:19:35 INFO tools.DistCp: DistCp job-id: job_1449017643353_0003
15/12/01 17:19:35 INFO mapreduce.Job: Running job: job_1449017643353_0003
15/12/01 17:19:44 INFO mapreduce.Job: Job job_1449017643353_0003 running in uber mode : false
15/12/01 17:19:44 INFO mapreduce.Job: map 0% reduce 0%
15/12/01 17:19:52 INFO mapreduce.Job: map 100% reduce 0%
15/12/01 17:19:52 INFO mapreduce.Job: Job job_1449017643353_0003 completed successfully
15/12/01 17:19:52 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=114389
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=39594404
HDFS: Number of bytes written=39593868
HDFS: Number of read operations=20
HDFS: Number of large read operations=0
HDFS: Number of write operations=5
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=5686
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=5686
Total vcore-seconds taken by all map tasks=5686
Total megabyte-seconds taken by all map tasks=5822464
Map-Reduce Framework
Map input records=2
Map output records=0
Input split bytes=138
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=45
CPU time spent (ms)=1250
Physical memory (bytes) snapshot=123002880
Virtual memory (bytes) snapshot=1504280576
Total committed heap usage (bytes)=60751872
File Input Format Counters
Bytes Read=398
File Output Format Counters
Bytes Written=0
在这个例子中,我们看到在Apache Hadoop中使用distcp命令来复制大量的数据。有关distcp命令和所有可用选项的更多帮助和详细信息,请使用以下命令检查内置帮助:
hadoop distcp