airbnb 开源reAir 工具 用法及源码解析(一)

原文链接: http://www.cnblogs.com/jiangxiaoxian/p/9917790.html
airbnb 开源reAir 工具 用法及源码解析(一)

reAir 有批量复制与增量复制功能 今天我们先来看看批量复制功能

批量复制使用方式:

cd reair
./gradlew shadowjar -p main -x test
# 如果是本地table-list 一样要加file:/// ; 如果直接写  --table-list ~/reair/local_table_list ,此文件必须在hdfs上!
hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hive.MetastoreReplicationJob --config-files my_config_file.xml --table-list file:///reair/local_table_list

1.table_list 内容,用户要复制的库名.表名

db_name1.table_name1
db_name1.table_name2
db_name2.table_name3
...
  1. my_config_file.xml 配置




  
    airbnb.reair.clusters.src.name
    ns8
    
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    
  

  
    airbnb.reair.clusters.src.metastore.url
    thrift://192.168.200.140:9083
    Source metastore Thrift URL.
  

  
    airbnb.reair.clusters.src.hdfs.root
    hdfs://ns8/
    Source cluster HDFS root. Note trailing slash.
  

  
    airbnb.reair.clusters.src.hdfs.tmp
    hdfs://ns8/tmp
    
      Directory for temporary files on the source cluster.
    
  

  
    airbnb.reair.clusters.dest.name
    ns1
    
      Name of the destination cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    
  

  
    airbnb.reair.clusters.dest.metastore.url
    thrift://dev04:9083
    Destination metastore Thrift URL.
  

  
    airbnb.reair.clusters.dest.hdfs.root
    hdfs://ns1/
    Destination cluster HDFS root. Note trailing slash.
  

  
    airbnb.reair.clusters.dest.hdfs.tmp
    hdfs://ns1/tmp
    
      Directory for temporary files on the source cluster. Table / partition
      data is copied to this location before it is moved to the final location,
      so it should be on the same filesystem as the final location.
    
  

  
    airbnb.reair.clusters.batch.output.dir
    /tmp/replica
    
      This configuration must be provided. It gives location to store each stage
      MR job output.
    
  

  
    airbnb.reair.clusters.batch.metastore.blacklist
    testdb:test.*,tmp_.*:.*
    
      Comma separated regex blacklist. dbname_regex:tablename_regex,...
    
  

  
    airbnb.reair.batch.metastore.parallelism
    5
    
      The parallelism to use for jobs requiring metastore calls. This translates to the number of
      mappers or reducers in the relevant jobs.
    
  

  
    airbnb.reair.batch.copy.parallelism
    5
    
      The parallelism to use for jobs that copy files. This translates to the number of reducers
      in the relevant jobs.
    
  

  
    airbnb.reair.batch.overwrite.newer
    true
    
      Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.
    
  
 
  
    mapreduce.map.speculative
    false
    
      Speculative execution is currently not supported for batch replication.
    
  

  
    mapreduce.reduce.speculative
    false
    
      Speculative execution is currently not supported for batch replication.
    
  


reAir批量复制步骤,共三个Stage

Stage 1

Mapper:
  1. 读取用户配置.xml文件配置(setup方法)
  2. 读取用户配置的table-list内容
    1). 过滤与黑名单匹配表名
    2). 生成TaskEstimate对象,通过各种策略封装了TaskEstimate.TaskType.xxx 的操作
    3). 如果src 是分区表,通过desc metastore thrift 执行 create/alter/noop 相应表
    4). 如果是分区表,会增加Check_Partition 任务(为每个分区创建任务,如dt=20180801等)
    5). 返回List result,由 \t 分割
        result结构:TaskType isUpdateMeta isUpdateData srcPath destPath db_name table_name partition_name
        TaskEstimate对象:  TaskType isUpdateMeta isUpdateData srcPath destPath
        HiveObjectSpec对象:db_name table_name partition_name
    6). map 输出结果:
   
  
对mapper 2-3)的补充说明:
    1). COPY_UNPARTITION_TABLE
    2). NO_OP
    3). COPY_PARTITION_TABLE
        1)). CREATE(如果dest没有对应表); ALTER(如果dest存在相应表(ExistingTable), ExistingTable与我们期待的根据srcTable生成的expectTable不同); NO_OP(默认不操作)
        2)). CHECK_PARTITION(生成分区表所有分区,如dt=20180101,dt=20180102等等)
Reduce:
    1. 反序列化map的vaule值,Pair
    2. 如果taskType == CHKER_PARTITION,重新对spec做任务类型校验(重复map端检验逻辑)
    3. 输出的result与map的value相同
       

Stage 2

Mapper:
    1. 反序列化stage1的result,如果estimate.TaskType == COPY_PARTITION_TABLE OR COPY_UNPARTITION_TABLE 并且需要更新 data 就走updateDir()方法,否则NO_OP
    2. updateDir(): 
        1). 清理destPath,并重新创建
        2). 以srcPath递归所有非隐藏文件
    3. 输出:

Reduce:
    1. 反序列化value
    2. 进行校验copy(重试三次)
        1). 成功日志: COPIED srcPathFileName DestPath srcFileSize "" currentSystemTime
        2). 失败(或未通过校验)日志:SKIPPED srcPathFileName DestPath srcFileSize Expection.toString currentSystemTime
        

Stage 3

只有mapper任务
Mapper:
    1. Input为Stage1的输出()
    2. 判断Stage1的结果反序列化得到Estimate.TaskType,进行src/dest元数据的校验,并根据策略进行相应的 CREATE/ALTER/NOOP 操作

官方图参考:

airbnb 开源reAir 工具 用法及源码解析(一)_第1张图片

posted on 2018-11-06 20:27 姜小嫌 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/jiangxiaoxian/p/9917790.html

你可能感兴趣的:(airbnb 开源reAir 工具 用法及源码解析(一))