Hadoop集群间distcp数据同步小记

1、本地hdfs文件查看

hadoop dfs -ls /usr/hive/warehouse/dwf.db/dwf_user_kuanbiao_full_1d/dt=2019-09-21
hadoop dfs -ls /usr/hive/warehouse/dwf.db/
hadoop dfs -ls /usr/hive/warehouse/dwf.db/dwf_user_kuanbiao_full_1d

2、本地hdfs文件同步到远端

hadoop distcp -update -skipcrccheck  /usr/hive/warehouse/dwf.db/dwf_user_kuanbiao_full_1d/dt=2019-09-21/*   hdfs://emr1-ip:4007/usr/hive/warehouse/dwf.db/dwf_user_kuanbiao_full_1d/

3、hadoop不同版本的同步
不同hadoop版本之间的distcp,必须是将远端的hdfs文件同步到本地hdfs,而不能反向同步。
Hadoop远端版本:

[hadoop@10 ~]$ hadoop version
Hadoop 2.7.3
Subversion Unknown -r Unknown
Compiled by hadoop on 2018-12-07T03:27Z
Compiled with protoc 2.5.0
From source with checksum f33d17d31f3269acb993fd0f5ee9bdd
This command was run using /usr/local/service/hadoop/share/hadoop/common/hadoop-common-2.7.3.jar

Hadoop本地版本:

[hadoop@ai-etl-c2-11 hive2hbase]$ hadoop version
Hadoop 3.0.0-cdh6.3.0
Source code repository http://github.com/cloudera/hadoop -r 7f07ef8e6df428a8eb53009dc8d9a249dbbb50ad
Compiled by jenkins on 2019-07-18T17:09Z
Compiled with protoc 2.5.0
From source with checksum 48a7f6a4c240f5772750c02cdebff6d3
This command was run using /opt/cloudera/parcels/CDH-6.3.0-1.cdh6.3.0.p0.1279813/jars/hadoop-common-3.0.0-cdh6.3.0.jar

同步命令:

/bin/hadoop distcp -update -skipcrccheck hdfs://emr2-ip:4007/usr/hive/warehouse/tmp.db/tmp_user_push_personas_full_1d_textfile/* hdfs://emr3-ip:8020/user/hive/warehouse/dwf.db/dwf_user_push_personas_full_1d/

说明:用【-update -skipcrccheck】省去许多不必要的麻烦,在实际使用时,可以先将本地要同步的文件删除掉。

4、同步Shell脚本

[hadoop@ai-etl-c2-11 hive2hbase]$ cat hive2hbase_real.sh 
#!/bin/bash
export yesterday=`date -d 'last day' +%Y-%m-%d`
export today=`date "+%Y-%m-%d %H:%M:%S"`

/bin/hadoop dfs -rmr /user/hive/warehouse/dwf.db/dwf_user_push_personas_full_1d/*
/bin/hadoop distcp -update -skipcrccheck hdfs://emr2-ip:4007/usr/hive/warehouse/tmp.db/tmp_user_push_personas_full_1d_textfile/* hdfs://emr3-ip:8020/user/hive/warehouse/dwf.db/dwf_user_push_personas_full_1d/
#/bin/hive -e "use dwf;insert into table hbaseinner_user_push_personas_full_1d
#select userinfo_id key,userinfo_id,name,user_name,idcode,birthday,sex,age,symbol,age_stage,memberlevel,register_time,is_blacklist,is_subscribe,verify_type,is_sold,is_residue,register_sc,terminal_type,brand,net_type,register_province,register_city,receive_province,receive_city,receive_county,active_time_range_30d,bid_times_all,bid_sales_all,bid_min_amt,bid_max_amt,bid_avg_amt,sold_cnt_all,residue_cnt_all,residue_amt_all,residue_rate_all,unit_price_all,sold_pay_rate,dt
#from dwf_user_push_personas_full_1d
#where userinfo_id is not null;"

5、定时

[hadoop@ai-etl-c2-11 hive2hbase]$ crontab -l
13 8 * * * sh /home/hadoop/hive2hbase/hive2hbase_real.sh >> /home/hadoop/hive2hbase/hive2hbase-cron.log 2>&1 &

 

你可能感兴趣的:(Hadoop)