sqoop import -- hcatalog 源码跟踪分析

简介

Sqoop是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
由于线上系统使用的是mysql,分析使用的是hive,所以平时sqoop用得最多的就是将mysql的数据导到hive上,本文就以import为例,来跟踪分析sqoop的源码。
目前正在使用的sqoop版本是sqoop-1.4.6-cdh5.12.1。

平时用得最多的sqoop命令如下:

sqoop import  --connect  \
jdbc:mysql://mysql_host/test_db?tinyInt1isBit=false \
--username xxx --password xxx  \
--table test_table \
--hcatalog-database test_db \
--hcatalog-table test_table \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m 5 --create-hcatalog-table   \
--map-column-hive createTime=timestamp \
--hive-overwrite  --split-by id --fetch-size 10000 --null-string '\\N' --null-non-string '\\N'

这里的命令就不作说明了,能执行sqoop命令,那就得有一个脚本,我们在sqoop目录的bin下,可以看到一个sqoop的脚本,执行以上命令的时候就是调用这个脚本,打开脚本分析,可以看到,sqoop命令其实就是执行org.apache.sqoop.Sqoop这个类,打开源码就能看到这个类的main函数了

public static void main(String [] args) {
    if (args.length == 0) {
      System.err.println("Try 'sqoop help' for usage.");
      System.exit(1);
    }

    int ret = runTool(args);
    System.exit(ret);
  }

main -> runTool -> runTool
runTool类

public static int runTool(String [] args, Configuration conf) {
  // Expand the options
  String[] expandedArgs = null;
  try {
    expandedArgs = OptionsFileUtil.expandArguments(args);
  } catch (Exception ex) {
    LOG.error("Error while expanding arguments", ex);
    System.err.println(ex.getMessage());
    System.err.println("Try 'sqoop help' for usage.");
    return 1;
  }

  String toolName = expandedArgs[0];
  Configuration pluginConf = SqoopTool.loadPlugins(conf);
  SqoopTool tool = SqoopTool.getTool(toolName);
  if (null == tool) {
    System.err.println("No such sqoop tool: " + toolName
        + ". See 'sqoop help'.");
    return 1;
  }


  Sqoop sqoop = new Sqoop(tool, pluginConf);
  return runSqoop(sqoop,
      Arrays.copyOfRange(expandedArgs, 1, expandedArgs.length));
}

expandedArgs = OptionsFileUtil.expandArguments(args);代码主要是处理一下额外的参数,比如有时候将命令参数放到某个配置文件当中去,就需要加载进来了。
Configuration pluginConf = SqoopTool.loadPlugins(conf);
SqoopTool tool = SqoopTool.getTool(toolName);
加载SqoopTool,sqoop有各种功能,比如导入、导出、列出数据库等功能,sqoop将这些功能看作是SqoopTool,在org.apache.sqoop.tool包下可以看到这些Tool,这里就不作详细介绍。
接着是runSqoop方法

public static int runSqoop(Sqoop sqoop, String [] args) {
    String[] toolArgs = sqoop.stashChildPrgmArgs(args);
    try {
      return ToolRunner.run(sqoop.getConf(), sqoop, toolArgs);
    } catch (Exception e) {
      LOG.error("Got exception running Sqoop: " + e.toString());
      e.printStackTrace();
      rethrowIfRequired(toolArgs, e);
      return 1;
    }
  }

发现它是调用ToolRunner.run方法的,代码点进去,其实发现是到了org.apache.hadoop.util.ToolRunner里面,这是Hadoop包的类。当然这里肯定不会直接调用Hadoop里面的方法的,如果什么事情还没有做就直接调用Hadoop的方法,那么sqoop就没有必要存在了。我们再仔细看看org.apache.sqoop.Sqoop类,发现它其实是实现了Hadoop中的Tool接口,然后重写了run方法,所以ToolRunner.run这里其实还是调用当前类的run方法。
run方法

@Override
  /**
   * Actual main entry-point for the program
   */
  public int run(String [] args) {
    if (options.getConf() == null) {
      // Configuration wasn't initialized until after the ToolRunner
      // got us to this point. ToolRunner gave Sqoop itself a Conf
      // though.
      options.setConf(getConf());
    }

    try {
      options = tool.parseArguments(args, null, options, false);
      tool.appendArgs(this.childPrgmArgs);
      tool.validateOptions(options);
    } catch (Exception e) {
      ...
    }

    return tool.run(options);
  }

options = tool.parseArguments(args, null, options, false);
tool.appendArgs(this.childPrgmArgs);
tool.validateOptions(options);
这里主要是使用Apache Common CLI处理命令行参数
return tool.run(options);
由于我们是使用import,那么我们就直接跳到ImportTool类的run方法

Override
  /** {@inheritDoc} */
  public int run(SqoopOptions options) {
    HiveImport hiveImport = null;

    if (allTables) {
      ...
    }

    if (!init(options)) {
      return 1;
    }

    codeGenerator.setManager(manager);

    try {
      if (options.doHiveImport()) {
        hiveImport = new HiveImport(options, manager, options.getConf(), false);
      }

      // Import a single table (or query) the user specified.
      importTable(options, options.getTableName(), hiveImport);
    } catch (IllegalArgumentException iea) {
        ...
    } catch (IOException ioe) {
      ...
    } catch (ImportException ie) {
      ...
    } catch (AvroSchemaMismatchException e) {
      ...
    } finally {
      destroy(options);
    }

    return 0;
  }

if (!init(options))这里执行了init方法,这里也很重要,swoop支持不同的数据源,在这方法里面设置的manager,在整个流程当中当中都会使用到,这里也不作详细介绍。
进到importTable方法(ImportTool类)

protected boolean importTable(SqoopOptions options, String tableName,
      HiveImport hiveImport) throws IOException, ImportException {
    String jarFile = null;

    // Generate the ORM code for the tables.
    jarFile = codeGenerator.generateORM(options, tableName);

    Path outputPath = getOutputPath(options, tableName);

    // Do the actual import.
    ImportJobContext context = new ImportJobContext(tableName, jarFile,
        options, outputPath);

    // If we're doing an incremental import, set up the
    // filtering conditions used to get the latest records.
    if (!initIncrementalConstraints(options, context)) {
      return false;
    }

    if (options.isDeleteMode()) {
      deleteTargetDir(context);
    }

    if (null != tableName) {
      manager.importTable(context);
    } else {
      manager.importQuery(context);
    }

    if (options.isAppendMode()) {
      AppendUtils app = new AppendUtils(context);
      app.append();
    } else if (options.getIncrementalMode() == SqoopOptions.IncrementalMode.DateLastModified) {
      lastModifiedMerge(options, context);
    }

    // If the user wants this table to be in Hive, perform that post-load.
    if (options.doHiveImport()) {
      // For Parquet file, the import action will create hive table directly via
      // kite. So there is no need to do hive import as a post step again.
      if (options.getFileLayout() != SqoopOptions.FileLayout.ParquetFile) {
        hiveImport.importTable(tableName, options.getHiveTableName(), false);
      }
    }

    saveIncrementalState(options);

    return true;
  }

jarFile = codeGenerator.generateORM(options, tableName);
这里进到org.apache.sqoop.tool.CodeGenTool类的generateORM方法,生成一个java文件并编译打包

public String generateORM(SqoopOptions options, String tableName)
      throws IOException {
      
      ...
      
    CompilationManager compileMgr = new CompilationManager(options);
    ClassWriter classWriter = new ClassWriter(options, manager, tableName,
        compileMgr);
    classWriter.generate();
    compileMgr.compile();
    compileMgr.jar();
    String jarFile = compileMgr.getJarFilename();
    this.generatedJarFiles.add(jarFile);
    return jarFile;
}

classWriter.generate();生成java文件
compileMgr.compile();对java文件进行编译
compileMgr.jar();打包成jar包\

返回importTable方法(ImportTool类),继续跟踪代码
查看参数解析的代码,再根据我们输入的命令参数可知,tableName肯定是不会null的,所以肯定会执行manager.importTable(context);这段代码,然后是MySQLManager类的importTable方法,MySQLManager父类(SqlManager类)的importTable方法
importTable方法(SqlManager类)

 public void importTable(com.cloudera.sqoop.manager.ImportJobContext context)
      throws IOException, ImportException {
    String tableName = context.getTableName();
    String jarFile = context.getJarFile();
    SqoopOptions opts = context.getOptions();

    context.setConnManager(this);

    ImportJobBase importer;
    if (opts.getHBaseTable() != null) {
      // Import to HBase.
      ...
    } else if (opts.getAccumuloTable() != null) {
       // Import to Accumulo.
       ...
    } else {
      // Import to HDFS.
      importer = new DataDrivenImportJob(opts, context.getInputFormat(),
              context);
    }

    checkTableImportOptions(context);

    String splitCol = getSplitColumn(opts, tableName);
    importer.runImport(tableName, jarFile, splitCol, opts.getConf());
  }

我们是导到hive上,所以用的是DataDrivenImportJob,checkTableImportOptions检查一下命令参数,主要是split-by和num-mappers。接下来就是执行runImport方法(ImportJobBase类,DataDrivenImportJob继承了ImportJobBase类)
runImport方法(ImportJobBase类)\

public void runImport(String tableName, String ormJarFile, String splitByCol,
      Configuration conf) throws IOException, ImportException {
    
    ......

    loadJars(conf, ormJarFile, tableClassName);

    Job job = createJob(conf);
    try {
      // Set the external jar to use for the job.
      job.getConfiguration().set("mapred.jar", ormJarFile);
      if (options.getMapreduceJobName() != null) {
        job.setJobName(options.getMapreduceJobName());
      }

      propagateOptionsToJob(job);
      configureInputFormat(job, tableName, tableClassName, splitByCol);
      configureOutputFormat(job, tableName, tableClassName);
      configureMapper(job, tableName, tableClassName);
      configureNumTasks(job);
      cacheJars(job, getContext().getConnManager());

      jobSetup(job);
      setJob(job);
      boolean success = runJob(job);
      if (!success) {
        throw new ImportException("Import job failed!");
      }

      completeImport(job);

      if (options.isValidationEnabled()) {
        validateImport(tableName, conf, job);
      }
      this.endTime = new Date().getTime();

      String publishClassName = conf.get(ConfigurationConstants.DATA_PUBLISH_CLASS);
      if (!StringUtils.isEmpty(publishClassName)) {
        try {
          Class publishClass =  Class.forName(publishClassName);
          Object obj = publishClass.newInstance();
          if (obj instanceof SqoopJobDataPublisher) {
            SqoopJobDataPublisher publisher = (SqoopJobDataPublisher) obj;
            if (options.doHiveImport() || options.getHCatTableName() != null) {
              // We need to publish the details
              SqoopJobDataPublisher.Data data =
                      new SqoopJobDataPublisher.Data(options, tableName, startTime, endTime);
              publisher.publish(data);
            }
          } else {
            LOG.warn("Publisher class not an instance of SqoopJobDataPublisher. Ignoring...");
          }
        } catch (Exception ex) {
          LOG.warn("Unable to publish data to publisher " + ex.getMessage(), ex);
        }
      }
    } catch (InterruptedException ie) {
      throw new IOException(ie);
    } catch (ClassNotFoundException cnfe) {
      throw new IOException(cnfe);
    } finally {
      unloadJars();
      jobTeardown(job);
    }
  }

重点都在这里面了,怎么创建一个MR Job以及怎么提交都在这里实现。
configureOutputFormat(job, tableName, tableClassName);
configureOutputFormat方法(ImportJobBase类)

/**
   * Configure the output format to use for the job.
   */
  @Override
  protected void configureOutputFormat(Job job, String tableName,
      String tableClassName) throws ClassNotFoundException, IOException {

    job.setOutputFormatClass(getOutputFormatClass());

    if (isHCatJob) {
      LOG.debug("Configuring output format for HCatalog  import job");
      SqoopHCatUtilities.configureImportOutputFormat(options, job,
        getContext().getConnManager(), tableName, job.getConfiguration());
      return;
    }

    if (options.getFileLayout() == SqoopOptions.FileLayout.SequenceFile) {
      job.getConfiguration().set("mapred.output.value.class", tableClassName);
    }

    if (options.shouldUseCompression()) {
      FileOutputFormat.setCompressOutput(job, true);

      String codecName = options.getCompressionCodec();
      Class codecClass;
      if (codecName == null) {
        codecClass = GzipCodec.class;
      } else {
        Configuration conf = job.getConfiguration();
        codecClass = CodecMap.getCodec(codecName, conf).getClass();
      }
      FileOutputFormat.setOutputCompressorClass(job, codecClass);

      if (options.getFileLayout() == SqoopOptions.FileLayout.SequenceFile) {
        SequenceFileOutputFormat.setOutputCompressionType(job,
          CompressionType.BLOCK);
      }

      // SQOOP-428: Avro expects not a fully qualified class name but a "short"
      // name instead (e.g. "snappy") and it needs to be set in a custom
      // configuration option called "avro.output.codec".
      // The default codec is "deflate".
      if (options.getFileLayout() == SqoopOptions.FileLayout.AvroDataFile) {
        if (codecName != null) {
          String shortName =
            CodecMap.getCodecShortNameByName(codecName, job.getConfiguration());
          // Avro only knows about "deflate" and not "default"
          if (shortName.equalsIgnoreCase("default")) {
            shortName = "deflate";
          }
          job.getConfiguration().set(AvroJob.OUTPUT_CODEC, shortName);
        } else {
          job.getConfiguration()
            .set(AvroJob.OUTPUT_CODEC, DataFileConstants.DEFLATE_CODEC);
        }
      }

      if (options.getFileLayout() == SqoopOptions.FileLayout.ParquetFile) {
        if (codecName != null) {
          Configuration conf = job.getConfiguration();
          String shortName = CodecMap.getCodecShortNameByName(codecName, conf);
          if (!shortName.equalsIgnoreCase("default")) {
            conf.set(ParquetJob.CONF_OUTPUT_CODEC, shortName);
          }
        }
      }
    }

    Path outputPath = context.getDestination();
    FileOutputFormat.setOutputPath(job, outputPath);
  }

我们使用的是hcatalog,所以直接进到这里

if (isHCatJob) {
  LOG.debug("Configuring output format for HCatalog  import job");
  SqoopHCatUtilities.configureImportOutputFormat(options, job,
    getContext().getConnManager(), tableName, job.getConfiguration());
  return;
}

configureImportOutputFormat方法(SqoopHCatUtilities类)
SqoopHCatUtilities.instance().configureHCat(opts, job, connMgr, dbTable,job.getConfiguration());
configureHCat方法(SqoopHCatUtilities类)
configureHCat方法主要是设置一些表的基础信息、创建HCatalog table
createHCatTable -> launchHCatCli: 创建HCatalog table,在createHCatTable方法中初始化建表语句,在launchHCatCli方法中将建表语句放到文件中,然后再使用HCatalog命令执行。
boolean success = runJob(job);提交MR Job

你可能感兴趣的:(sqoop import -- hcatalog 源码跟踪分析)