简介
Sqoop是一款开源的工具,主要用于在Hadoop(Hive)与传统的数据库(mysql、postgresql...)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
由于线上系统使用的是mysql,分析使用的是hive,所以平时sqoop用得最多的就是将mysql的数据导到hive上,本文就以import为例,来跟踪分析sqoop的源码。
目前正在使用的sqoop版本是sqoop-1.4.6-cdh5.12.1。
平时用得最多的sqoop命令如下:
sqoop import --connect \
jdbc:mysql://mysql_host/test_db?tinyInt1isBit=false \
--username xxx --password xxx \
--table test_table \
--hcatalog-database test_db \
--hcatalog-table test_table \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m 5 --create-hcatalog-table \
--map-column-hive createTime=timestamp \
--hive-overwrite --split-by id --fetch-size 10000 --null-string '\\N' --null-non-string '\\N'
这里的命令就不作说明了,能执行sqoop命令,那就得有一个脚本,我们在sqoop目录的bin下,可以看到一个sqoop的脚本,执行以上命令的时候就是调用这个脚本,打开脚本分析,可以看到,sqoop命令其实就是执行org.apache.sqoop.Sqoop这个类,打开源码就能看到这个类的main函数了
public static void main(String [] args) {
if (args.length == 0) {
System.err.println("Try 'sqoop help' for usage.");
System.exit(1);
}
int ret = runTool(args);
System.exit(ret);
}
main -> runTool -> runTool
runTool类
public static int runTool(String [] args, Configuration conf) {
// Expand the options
String[] expandedArgs = null;
try {
expandedArgs = OptionsFileUtil.expandArguments(args);
} catch (Exception ex) {
LOG.error("Error while expanding arguments", ex);
System.err.println(ex.getMessage());
System.err.println("Try 'sqoop help' for usage.");
return 1;
}
String toolName = expandedArgs[0];
Configuration pluginConf = SqoopTool.loadPlugins(conf);
SqoopTool tool = SqoopTool.getTool(toolName);
if (null == tool) {
System.err.println("No such sqoop tool: " + toolName
+ ". See 'sqoop help'.");
return 1;
}
Sqoop sqoop = new Sqoop(tool, pluginConf);
return runSqoop(sqoop,
Arrays.copyOfRange(expandedArgs, 1, expandedArgs.length));
}
expandedArgs = OptionsFileUtil.expandArguments(args);代码主要是处理一下额外的参数,比如有时候将命令参数放到某个配置文件当中去,就需要加载进来了。
Configuration pluginConf = SqoopTool.loadPlugins(conf);
SqoopTool tool = SqoopTool.getTool(toolName);
加载SqoopTool,sqoop有各种功能,比如导入、导出、列出数据库等功能,sqoop将这些功能看作是SqoopTool,在org.apache.sqoop.tool包下可以看到这些Tool,这里就不作详细介绍。
接着是runSqoop方法
public static int runSqoop(Sqoop sqoop, String [] args) {
String[] toolArgs = sqoop.stashChildPrgmArgs(args);
try {
return ToolRunner.run(sqoop.getConf(), sqoop, toolArgs);
} catch (Exception e) {
LOG.error("Got exception running Sqoop: " + e.toString());
e.printStackTrace();
rethrowIfRequired(toolArgs, e);
return 1;
}
}
发现它是调用ToolRunner.run方法的,代码点进去,其实发现是到了org.apache.hadoop.util.ToolRunner里面,这是Hadoop包的类。当然这里肯定不会直接调用Hadoop里面的方法的,如果什么事情还没有做就直接调用Hadoop的方法,那么sqoop就没有必要存在了。我们再仔细看看org.apache.sqoop.Sqoop类,发现它其实是实现了Hadoop中的Tool接口,然后重写了run方法,所以ToolRunner.run这里其实还是调用当前类的run方法。
run方法
@Override
/**
* Actual main entry-point for the program
*/
public int run(String [] args) {
if (options.getConf() == null) {
// Configuration wasn't initialized until after the ToolRunner
// got us to this point. ToolRunner gave Sqoop itself a Conf
// though.
options.setConf(getConf());
}
try {
options = tool.parseArguments(args, null, options, false);
tool.appendArgs(this.childPrgmArgs);
tool.validateOptions(options);
} catch (Exception e) {
...
}
return tool.run(options);
}
options = tool.parseArguments(args, null, options, false);
tool.appendArgs(this.childPrgmArgs);
tool.validateOptions(options);
这里主要是使用Apache Common CLI处理命令行参数
return tool.run(options);
由于我们是使用import,那么我们就直接跳到ImportTool类的run方法
Override
/** {@inheritDoc} */
public int run(SqoopOptions options) {
HiveImport hiveImport = null;
if (allTables) {
...
}
if (!init(options)) {
return 1;
}
codeGenerator.setManager(manager);
try {
if (options.doHiveImport()) {
hiveImport = new HiveImport(options, manager, options.getConf(), false);
}
// Import a single table (or query) the user specified.
importTable(options, options.getTableName(), hiveImport);
} catch (IllegalArgumentException iea) {
...
} catch (IOException ioe) {
...
} catch (ImportException ie) {
...
} catch (AvroSchemaMismatchException e) {
...
} finally {
destroy(options);
}
return 0;
}
if (!init(options))这里执行了init方法,这里也很重要,swoop支持不同的数据源,在这方法里面设置的manager,在整个流程当中当中都会使用到,这里也不作详细介绍。
进到importTable方法(ImportTool类)
protected boolean importTable(SqoopOptions options, String tableName,
HiveImport hiveImport) throws IOException, ImportException {
String jarFile = null;
// Generate the ORM code for the tables.
jarFile = codeGenerator.generateORM(options, tableName);
Path outputPath = getOutputPath(options, tableName);
// Do the actual import.
ImportJobContext context = new ImportJobContext(tableName, jarFile,
options, outputPath);
// If we're doing an incremental import, set up the
// filtering conditions used to get the latest records.
if (!initIncrementalConstraints(options, context)) {
return false;
}
if (options.isDeleteMode()) {
deleteTargetDir(context);
}
if (null != tableName) {
manager.importTable(context);
} else {
manager.importQuery(context);
}
if (options.isAppendMode()) {
AppendUtils app = new AppendUtils(context);
app.append();
} else if (options.getIncrementalMode() == SqoopOptions.IncrementalMode.DateLastModified) {
lastModifiedMerge(options, context);
}
// If the user wants this table to be in Hive, perform that post-load.
if (options.doHiveImport()) {
// For Parquet file, the import action will create hive table directly via
// kite. So there is no need to do hive import as a post step again.
if (options.getFileLayout() != SqoopOptions.FileLayout.ParquetFile) {
hiveImport.importTable(tableName, options.getHiveTableName(), false);
}
}
saveIncrementalState(options);
return true;
}
jarFile = codeGenerator.generateORM(options, tableName);
这里进到org.apache.sqoop.tool.CodeGenTool类的generateORM方法,生成一个java文件并编译打包
public String generateORM(SqoopOptions options, String tableName)
throws IOException {
...
CompilationManager compileMgr = new CompilationManager(options);
ClassWriter classWriter = new ClassWriter(options, manager, tableName,
compileMgr);
classWriter.generate();
compileMgr.compile();
compileMgr.jar();
String jarFile = compileMgr.getJarFilename();
this.generatedJarFiles.add(jarFile);
return jarFile;
}
classWriter.generate();生成java文件
compileMgr.compile();对java文件进行编译
compileMgr.jar();打包成jar包\
返回importTable方法(ImportTool类),继续跟踪代码
查看参数解析的代码,再根据我们输入的命令参数可知,tableName肯定是不会null的,所以肯定会执行manager.importTable(context);这段代码,然后是MySQLManager类的importTable方法,MySQLManager父类(SqlManager类)的importTable方法
importTable方法(SqlManager类)
public void importTable(com.cloudera.sqoop.manager.ImportJobContext context)
throws IOException, ImportException {
String tableName = context.getTableName();
String jarFile = context.getJarFile();
SqoopOptions opts = context.getOptions();
context.setConnManager(this);
ImportJobBase importer;
if (opts.getHBaseTable() != null) {
// Import to HBase.
...
} else if (opts.getAccumuloTable() != null) {
// Import to Accumulo.
...
} else {
// Import to HDFS.
importer = new DataDrivenImportJob(opts, context.getInputFormat(),
context);
}
checkTableImportOptions(context);
String splitCol = getSplitColumn(opts, tableName);
importer.runImport(tableName, jarFile, splitCol, opts.getConf());
}
我们是导到hive上,所以用的是DataDrivenImportJob,checkTableImportOptions检查一下命令参数,主要是split-by和num-mappers。接下来就是执行runImport方法(ImportJobBase类,DataDrivenImportJob继承了ImportJobBase类)
runImport方法(ImportJobBase类)\
public void runImport(String tableName, String ormJarFile, String splitByCol,
Configuration conf) throws IOException, ImportException {
......
loadJars(conf, ormJarFile, tableClassName);
Job job = createJob(conf);
try {
// Set the external jar to use for the job.
job.getConfiguration().set("mapred.jar", ormJarFile);
if (options.getMapreduceJobName() != null) {
job.setJobName(options.getMapreduceJobName());
}
propagateOptionsToJob(job);
configureInputFormat(job, tableName, tableClassName, splitByCol);
configureOutputFormat(job, tableName, tableClassName);
configureMapper(job, tableName, tableClassName);
configureNumTasks(job);
cacheJars(job, getContext().getConnManager());
jobSetup(job);
setJob(job);
boolean success = runJob(job);
if (!success) {
throw new ImportException("Import job failed!");
}
completeImport(job);
if (options.isValidationEnabled()) {
validateImport(tableName, conf, job);
}
this.endTime = new Date().getTime();
String publishClassName = conf.get(ConfigurationConstants.DATA_PUBLISH_CLASS);
if (!StringUtils.isEmpty(publishClassName)) {
try {
Class publishClass = Class.forName(publishClassName);
Object obj = publishClass.newInstance();
if (obj instanceof SqoopJobDataPublisher) {
SqoopJobDataPublisher publisher = (SqoopJobDataPublisher) obj;
if (options.doHiveImport() || options.getHCatTableName() != null) {
// We need to publish the details
SqoopJobDataPublisher.Data data =
new SqoopJobDataPublisher.Data(options, tableName, startTime, endTime);
publisher.publish(data);
}
} else {
LOG.warn("Publisher class not an instance of SqoopJobDataPublisher. Ignoring...");
}
} catch (Exception ex) {
LOG.warn("Unable to publish data to publisher " + ex.getMessage(), ex);
}
}
} catch (InterruptedException ie) {
throw new IOException(ie);
} catch (ClassNotFoundException cnfe) {
throw new IOException(cnfe);
} finally {
unloadJars();
jobTeardown(job);
}
}
重点都在这里面了,怎么创建一个MR Job以及怎么提交都在这里实现。
configureOutputFormat(job, tableName, tableClassName);
configureOutputFormat方法(ImportJobBase类)
/**
* Configure the output format to use for the job.
*/
@Override
protected void configureOutputFormat(Job job, String tableName,
String tableClassName) throws ClassNotFoundException, IOException {
job.setOutputFormatClass(getOutputFormatClass());
if (isHCatJob) {
LOG.debug("Configuring output format for HCatalog import job");
SqoopHCatUtilities.configureImportOutputFormat(options, job,
getContext().getConnManager(), tableName, job.getConfiguration());
return;
}
if (options.getFileLayout() == SqoopOptions.FileLayout.SequenceFile) {
job.getConfiguration().set("mapred.output.value.class", tableClassName);
}
if (options.shouldUseCompression()) {
FileOutputFormat.setCompressOutput(job, true);
String codecName = options.getCompressionCodec();
Class extends CompressionCodec> codecClass;
if (codecName == null) {
codecClass = GzipCodec.class;
} else {
Configuration conf = job.getConfiguration();
codecClass = CodecMap.getCodec(codecName, conf).getClass();
}
FileOutputFormat.setOutputCompressorClass(job, codecClass);
if (options.getFileLayout() == SqoopOptions.FileLayout.SequenceFile) {
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
}
// SQOOP-428: Avro expects not a fully qualified class name but a "short"
// name instead (e.g. "snappy") and it needs to be set in a custom
// configuration option called "avro.output.codec".
// The default codec is "deflate".
if (options.getFileLayout() == SqoopOptions.FileLayout.AvroDataFile) {
if (codecName != null) {
String shortName =
CodecMap.getCodecShortNameByName(codecName, job.getConfiguration());
// Avro only knows about "deflate" and not "default"
if (shortName.equalsIgnoreCase("default")) {
shortName = "deflate";
}
job.getConfiguration().set(AvroJob.OUTPUT_CODEC, shortName);
} else {
job.getConfiguration()
.set(AvroJob.OUTPUT_CODEC, DataFileConstants.DEFLATE_CODEC);
}
}
if (options.getFileLayout() == SqoopOptions.FileLayout.ParquetFile) {
if (codecName != null) {
Configuration conf = job.getConfiguration();
String shortName = CodecMap.getCodecShortNameByName(codecName, conf);
if (!shortName.equalsIgnoreCase("default")) {
conf.set(ParquetJob.CONF_OUTPUT_CODEC, shortName);
}
}
}
}
Path outputPath = context.getDestination();
FileOutputFormat.setOutputPath(job, outputPath);
}
我们使用的是hcatalog,所以直接进到这里
if (isHCatJob) {
LOG.debug("Configuring output format for HCatalog import job");
SqoopHCatUtilities.configureImportOutputFormat(options, job,
getContext().getConnManager(), tableName, job.getConfiguration());
return;
}
configureImportOutputFormat方法(SqoopHCatUtilities类)
SqoopHCatUtilities.instance().configureHCat(opts, job, connMgr, dbTable,job.getConfiguration());
configureHCat方法(SqoopHCatUtilities类)
configureHCat方法主要是设置一些表的基础信息、创建HCatalog table
createHCatTable -> launchHCatCli: 创建HCatalog table,在createHCatTable方法中初始化建表语句,在launchHCatCli方法中将建表语句放到文件中,然后再使用HCatalog命令执行。
boolean success = runJob(job);提交MR Job