hadoop distcp命令的使用

hadoop distcp -update -skipcrccheck -m $num_map  $old_table_location  $new_table_location

命令的使用。
简单介绍:http://blog.csdn.net/stark_summer/article/details/45869945
如何在两个集群之间进行表数据的拷贝呢?

  1. 复制表结构;
  2. 获取旧表的Location、在获取新表的Location,通过下面的命令进行复制:
  3. 使用msck repair table new_table命令,修复新表的分区元数据(分区表的必备)。
    下面我进行相应的操作:
hive> select *
    > from t441;
OK
30      beijing dongdong        man
40      shanghai        lisi    woman
Time taken: 0.078 seconds
hive> desc formatted t441;
OK
# col_name              data_type               comment             
                 
id                      int                     None                
city                    string                  None                
name                    string                  None                
sex                     string                  None                
                 
# Detailed Table Information             
Database:               fdm                      
Owner:                  root                     
CreateTime:             Mon May 01 09:09:36 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop:9000/warehouse/fdm.db/t441         
Table Type:             MANAGED_TABLE            

从上面我们可以获取到旧表的Location。
而后我们在获取新表的Location。

hive> desc formatted t444;
OK
# col_name              data_type               comment             
                 
id                      int                     None                
city                    string                  None                
name                    string                  None                
sex                     string                  None                
                 
# Detailed Table Information             
Database:               fdm                      
Owner:                  root                     
CreateTime:             Mon May 01 09:56:57 PDT 2017     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop60:9000/warehouse/fdm.db/t444       
Table Type:             MANAGED_TABLE  

最后我们进行拷贝数据操作:

[root@hadoop local]# hadoop distcp -update -skipcrccheck hdfs://hadoop:9000/warehouse/fdm.db/t441  hdfs://hadoop60:9000/warehouse/fdm.db/t444 ; 
17/05/01 10:09:10 INFO tools.DistCp: srcPaths=[hdfs://hadoop:9000/warehouse/fdm.db/t441]
17/05/01 10:09:10 INFO tools.DistCp: destPath=hdfs://hadoop60:9000/warehouse/fdm.db/t444
17/05/01 10:09:10 INFO tools.DistCp: sourcePathsCount=2
17/05/01 10:09:10 INFO tools.DistCp: filesToCopyCount=1
17/05/01 10:09:10 INFO tools.DistCp: bytesToCopyCount=47.0
17/05/01 10:09:11 INFO mapred.JobClient: Running job: job_201705010710_0010
17/05/01 10:09:12 INFO mapred.JobClient:  map 0% reduce 0%
17/05/01 10:09:17 INFO mapred.JobClient:  map 100% reduce 0%
17/05/01 10:09:17 INFO mapred.JobClient: Job complete: job_201705010710_0010
17/05/01 10:09:17 INFO mapred.JobClient: Counters: 22
17/05/01 10:09:17 INFO mapred.JobClient:   Map-Reduce Framework
17/05/01 10:09:17 INFO mapred.JobClient:     Spilled Records=0
17/05/01 10:09:17 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=289374208
17/05/01 10:09:17 INFO mapred.JobClient:     Map input records=1
17/05/01 10:09:17 INFO mapred.JobClient:     SPLIT_RAW_BYTES=152
17/05/01 10:09:17 INFO mapred.JobClient:     Map output records=0
17/05/01 10:09:17 INFO mapred.JobClient:     Physical memory (bytes) snapshot=38797312
17/05/01 10:09:17 INFO mapred.JobClient:     Map input bytes=130
17/05/01 10:09:17 INFO mapred.JobClient:     CPU time spent (ms)=130
17/05/01 10:09:17 INFO mapred.JobClient:     Total committed heap usage (bytes)=16252928
17/05/01 10:09:17 INFO mapred.JobClient:   distcp
17/05/01 10:09:17 INFO mapred.JobClient:     Bytes copied=47
17/05/01 10:09:17 INFO mapred.JobClient:     Bytes expected=47
17/05/01 10:09:17 INFO mapred.JobClient:     Files copied=1
17/05/01 10:09:17 INFO mapred.JobClient:   File Input Format Counters 
17/05/01 10:09:17 INFO mapred.JobClient:     Bytes Read=230
17/05/01 10:09:17 INFO mapred.JobClient:   FileSystemCounters
17/05/01 10:09:17 INFO mapred.JobClient:     HDFS_BYTES_READ=429
17/05/01 10:09:17 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=53786
17/05/01 10:09:17 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=47
17/05/01 10:09:17 INFO mapred.JobClient:   File Output Format Counters 
17/05/01 10:09:17 INFO mapred.JobClient:     Bytes Written=0
17/05/01 10:09:17 INFO mapred.JobClient:   Job Counters 
17/05/01 10:09:17 INFO mapred.JobClient:     Launched map tasks=1
17/05/01 10:09:17 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
17/05/01 10:09:17 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/05/01 10:09:17 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=4939
17/05/01 10:09:17 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0

我们随后查看一下数据:

hive> select *  
    > from t444;
OK
30      beijing dongdong        man
40      shanghai        lisi    woman
Time taken: 0.069 seconds

OK,此时我们的数据拷贝成功。

当然,源表和目标的名字可以不一样的:

hadoop  distcp  -m 200  -pb  -bandwidth 40 -update -delete  hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp    hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county


[bdp_client@BJLFRZ-Client-50-10 total_env]$ hadoop fs -ls hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county
Found 2 items
-rw-r--r--   3 mart_cfo mart_cfo      57340 2019-04-20 05:41 hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county/000000_0
-rw-r--r--   3 mart_cfo mart_cfo      57770 2019-04-20 05:41 hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county/part-m-00000
[bdp_client@BJLFRZ-Client-50-10 total_env]$ hadoop fs -ls hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp 
Found 2 items
-rwxrwxrwx   3 mart_mobile mart_mobile      57340 2019-01-17 11:01 hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp/000000_0
-rwxrwxrwx   3 mart_mobile mart_mobile      57770 2019-04-01 11:03 hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp/part-m-00000
[bdp_client@BJLFRZ-Client-50-10 total_env]$ hadoop fs -du -s hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county
115110  hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county
[bdp_client@BJLFRZ-Client-50-10 total_env]$ hadoop fs -du -s hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county
115110  hdfs://172.21.2.137:8020/user/mart_cfo/app.db/app_jxkh_county
[bdp_client@BJLFRZ-Client-50-10 total_env]$ hadoop fs -du -s hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp
115110  hdfs://10.198.21.227:8020/user/mart_mobile/app.db/app_jxkh_county_tmp

你可能感兴趣的:(数据仓库,hadoop,集群,数据拷贝,distcp)