sqoop随笔

基本命令

import-all-tables :导入所有表

--connect :连接的url地址

--username:mysql用户名

--password:mysql密码

--hive-database：导入到hive的数据库

-m：导入数据的进程的并发数，默认是4。如果导入的数据不大的话，不妨设置成1，这样导入更快。指定几个map任务，如果没有主键，只能有一个map 例如：-m 1

--create-hive-table :创建表，如果不创建表，hive里是没有表的。

--hive-import：导入数据到hive里

--hive-overwrite ：覆盖导入

--map-column-hive cost="DECIMAL",date="DATE" \

Hive命令

Argument Description
--hive-home

Override $HIVE_HOME
--hive-import Import tables into Hive (Uses Hive's default delimiters if none are set.)
--hive-overwrite Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exits. By default this property is false.
--hive-table Sets the table name to use when importing to Hive.
--hive-drop-import-delims Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key Name of a hive field to partition are sharded on
--hive-partition-value String-value that serves as partition key for this imported into hive in this job.
--map-column-hive Override default mapping from SQL type to Hive type for configured columns.

job命令

Job management arguments:
--create Create a new saved job
--delete Delete a saved job
--exec Run a saved job
--help Print usage instructions
--list List saved jobs
--meta-connect Specify JDBC connect string for the
metastore
--show Show the parameters for a saved job
--verbose Print more information while working

增量命令

Sqoop会使用主键来平均地分割数据。并发导入的时候可以设置相关的分割列等等，具体的做法参考官方的文档。
--check-column指定要检查的列，--incremental指定某种增加的模式，只有两个合法的值，append 和lastmodified。如果--incremental为append，则Sqoop会导入--check-column指定的列的值大于--last-value所指定的值的记录。如果--incremental为lastmodified，则Sqoop会导入--check-column指定的列的值（这是一个时间戳）近于--last-value所指定的时间戳的记录。

--incremental append --check-column num_id --last-value 0 即通过指定一个递增的列，一般默认为id
另种是可以根据时间戳，比如：
--incremental lastmodified --check-column created --last-value '2012-02-01 11:00:00'
--hive-table T_R_ACCOUNTACTIVE_DAY 指定hive中对应要导入的表
--where 导入的条件，这里将导入的条件设为时间等于当前日期的前一天范围内，就可以控制为每天增量导入数据了。

定时任务

命令：

crontab –e

这样可以已编辑模式打开个人的crontab配置文件（若没有则会新建）
以下是 crontab 文件的格式：
{minute} {hour} {day-of-month} {month} {day-of-week} {full-path-to-shell-script}
——————————
o minute: 区间为 0 – 59
o hour: 区间为0 – 23
o day-of-month: 区间为0 – 31
o month: 区间为1 – 12. 1 是1月. 12是12月.
o Day-of-week: 区间为0 – 7. 周日可以是0或7.

Crontab 示例

1、在凌晨00:01运行
1 0 * * * /home/linrui/XXXX.sh
2、每个工作日23:59都进行备份作业。
59 11 * * 1,2,3,4,5 /home/linrui/XXXX.sh
或者如下写法：
59 11 * * 1-5 /home/linrui/XXXX.sh
3、每分钟运行一次命令
*/1 * * * * /home/linrui/XXXX.sh
4、每个月的1号 14:10 运行
10 14 1 * * /home/linrui/XXXX.sh

一个很奇怪的错误

笔者在导入数据的过程中，发现一个问题，就是导入时没有提示任何错误，在HDFS中也看到生成了相关的文件，但是在Hive中使用命令show TABLES时却没有看到新导入的表格。此时有两种解决方案。一种是在Hive中手动地使用CREAT命令创建出表格，然后再SELECT一个，会发现表里有数据了。因此笔者猜测（只是猜测）可能是因为导入之后表格的元数据没有写入到Hive中。笔者查找了相关的资料，发现Hive默认是将元数据存储在derby中，所以笔者考虑将derby换成MySQL。
方法：http://blog.sina.com.cn/s/blog_3fe961ae0101925l.html
将元数据存储在MySQL后，再导入就发现没有问题了。