sqoop增量导入

sqoop的增量导入分为多种模式,有append和lastmodified两种模式。

需要应用的主要sqoop参数有:

–check-column:指定增量导入的依赖字段,通常为自增的主键id或者时间戳

–incremental:指定导入的模式(append或lastmodified)

–last-value:指定导入的上次最大值也就是这次开始的值

  • Append模式

1.建立自增主键表:

create table test(
id int(20) primary key not null AUTO_INCREMENT,
name varchar(32)
)charset=utf8;

2.插入数据:

insert into test(id,name) values(1,'xiaozhao');
insert into test(id,name) values(2,'xiaozhang');
insert into test(id,name) values(3,'xiaosun');
insert into test(id,name) values(5,'xiaoli');
insert into test(id,name) values(6,'xiaozhou');
insert into test(id,name) values(4,'xiaowu');

3.利用sqoop将表导入到HDFS的/test下:

 sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --target-dir /test/test

4.再次插入数据:

insert into test(id,name) values(7,'xiaozheng');
insert into test(id,name) values(8,'xiaowang');

5.展示数据:

mysql> select * from test;
+----+-----------+
| id | name      |
+----+-----------+
|  1 | xiaozhao  |
|  2 | xiaozhang |
|  3 | xiaosun   |
|  4 | xiaowu    |
|  5 | xiaoli    |
|  6 | xiaozhou  |
|  7 | xiaozheng |
|  8 | xiaowang  |
+----+-----------+
8 rows in set (0.00 sec)

6.利用append模式增量导入

sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --check-column id --incremental append --last-value 5 --target-dir /test/test

6.查看结果

hadoop:hadoop:/home/hadoop:>hadoop fs -ls test
18/01/28 12:12:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2018-01-28 11:49 test/_SUCCESS
-rw-r--r--   1 hadoop supergroup         62 2018-01-28 11:49 test/part-m-00000
-rw-r--r--   1 hadoop supergroup         34 2018-01-28 12:11 test/part-m-00001
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00001
18/01/28 12:12:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
6,xiaozhou
7,xiaozheng
8,xiaowang
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00000
18/01/28 12:13:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
1,xiaozhao
2,xiaozhang
3,xiaosun
4,xiaowu
5,xiaoli
6,xiaozhou

PS:大家可以发现,last-value我取5,所以sqoop从6开始导入,也是就是说这是个开区间。

  • Lastmodified模式

该模式区别于append的是可以指定为一个时间戳字段,按照时间顺序导入,另外这种模式可以指定增量数据在HDFS存在的方式,–append和append模式一样,都是附件,–merge-key 是合并的方式,最终增量结果为一个文件:part-r-00000。

1.建立带有时间戳的sql表

create table test2(
id int,
name varchar(32),
lasttime timestamp default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)charset=utf8;

2.插入数据:

insert into test2(id,name) values(1,'xiaozhao');
insert into test2(id,name) values(2,'xiaozhang');
insert into test2(id,name) values(3,'xiaosun');
insert into test2(id,name) values(4,'xiaowu');
insert into test2(id,name) values(5,'xiaoli');
insert into test2(id,name) values(6,'xiaozhou');

3.导入HDFS

sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2

4.继续插入数据

insert into test2(id,name) values(7,'xiaozheng')

5.sqoop增量导入(-append方式)


sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --append 

6.查看结果:


hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00000

18/01/28 12:28:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,xiaozhao,2018-01-28 12:23:58.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0

hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00001
18/01/28 12:35:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0

在这里有一个问题,就是大家发现,我指定的时间是第一次导入执行的时间,然而这个–append对时间这个变量来说是闭区间,导致了数据的少量冗余。而且每次增量导入就产生一个文件,的确不合适。针对这两个问题,选择另一个方式:–merge-key,这种方式可以对重复数据进行合并,同时这种方式进行一次完整的MR操作,生成part-r-00000,同时也会对数据进行更新。

7.修改数据:

update test2 set name = 'MARK' where id = 1;

8.查看数据:

mysql> select * from test2;
+------+-----------+---------------------+
| id   | name      | lasttime            |
+------+-----------+---------------------+
|    1 | MARK      | 2018-01-28 12:44:37 |
|    2 | xiaozhang | 2018-01-28 12:23:58 |
|    3 | xiaosun   | 2018-01-28 12:24:00 |
|    4 | xiaowu    | 2018-01-28 12:24:29 |
|    5 | xiaoli    | 2018-01-28 12:24:39 |
|    6 | xiaozhou  | 2018-01-28 12:24:40 |
|    7 | xiaozheng | 2018-01-28 12:29:28 |
+------+-----------+---------------------+
7 rows in set (0.00 sec)

9.sqoop增量导入(–merge-key)

 sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --merge-key id 

10.查看结果:

hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-r-00000
18/01/28 12:52:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,MARK,2018-01-28 12:44:37.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0

若泽大数据交流群:671914634

你可能感兴趣的:(sqoop)