sqoop的增量导入分为多种模式,有append和lastmodified两种模式。
需要应用的主要sqoop参数有:
–check-column:指定增量导入的依赖字段,通常为自增的主键id或者时间戳
–incremental:指定导入的模式(append或lastmodified)
–last-value:指定导入的上次最大值也就是这次开始的值
1.建立自增主键表:
create table test(
id int(20) primary key not null AUTO_INCREMENT,
name varchar(32)
)charset=utf8;
2.插入数据:
insert into test(id,name) values(1,'xiaozhao');
insert into test(id,name) values(2,'xiaozhang');
insert into test(id,name) values(3,'xiaosun');
insert into test(id,name) values(5,'xiaoli');
insert into test(id,name) values(6,'xiaozhou');
insert into test(id,name) values(4,'xiaowu');
3.利用sqoop将表导入到HDFS的/test下:
sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --target-dir /test/test
4.再次插入数据:
insert into test(id,name) values(7,'xiaozheng');
insert into test(id,name) values(8,'xiaowang');
5.展示数据:
mysql> select * from test;
+----+-----------+
| id | name |
+----+-----------+
| 1 | xiaozhao |
| 2 | xiaozhang |
| 3 | xiaosun |
| 4 | xiaowu |
| 5 | xiaoli |
| 6 | xiaozhou |
| 7 | xiaozheng |
| 8 | xiaowang |
+----+-----------+
8 rows in set (0.00 sec)
6.利用append模式增量导入
sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test -m 1 --check-column id --incremental append --last-value 5 --target-dir /test/test
6.查看结果
hadoop:hadoop:/home/hadoop:>hadoop fs -ls test
18/01/28 12:12:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
Found 3 items
-rw-r--r-- 1 hadoop supergroup 0 2018-01-28 11:49 test/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 62 2018-01-28 11:49 test/part-m-00000
-rw-r--r-- 1 hadoop supergroup 34 2018-01-28 12:11 test/part-m-00001
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00001
18/01/28 12:12:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
6,xiaozhou
7,xiaozheng
8,xiaowang
hadoop:hadoop:/home/hadoop:>hadoop fs -cat test/part-m-00000
18/01/28 12:13:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your pla
1,xiaozhao
2,xiaozhang
3,xiaosun
4,xiaowu
5,xiaoli
6,xiaozhou
PS:大家可以发现,last-value我取5,所以sqoop从6开始导入,也是就是说这是个开区间。
该模式区别于append的是可以指定为一个时间戳字段,按照时间顺序导入,另外这种模式可以指定增量数据在HDFS存在的方式,–append和append模式一样,都是附件,–merge-key 是合并的方式,最终增量结果为一个文件:part-r-00000。
1.建立带有时间戳的sql表
create table test2(
id int,
name varchar(32),
lasttime timestamp default CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)charset=utf8;
2.插入数据:
insert into test2(id,name) values(1,'xiaozhao');
insert into test2(id,name) values(2,'xiaozhang');
insert into test2(id,name) values(3,'xiaosun');
insert into test2(id,name) values(4,'xiaowu');
insert into test2(id,name) values(5,'xiaoli');
insert into test2(id,name) values(6,'xiaozhou');
3.导入HDFS
sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2
4.继续插入数据
insert into test2(id,name) values(7,'xiaozheng')
5.sqoop增量导入(-append方式)
sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --append
6.查看结果:
hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00000
18/01/28 12:28:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,xiaozhao,2018-01-28 12:23:58.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0
hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-m-00001
18/01/28 12:35:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0
在这里有一个问题,就是大家发现,我指定的时间是第一次导入执行的时间,然而这个–append对时间这个变量来说是闭区间,导致了数据的少量冗余。而且每次增量导入就产生一个文件,的确不合适。针对这两个问题,选择另一个方式:–merge-key,这种方式可以对重复数据进行合并,同时这种方式进行一次完整的MR操作,生成part-r-00000,同时也会对数据进行更新。
7.修改数据:
update test2 set name = 'MARK' where id = 1;
8.查看数据:
mysql> select * from test2;
+------+-----------+---------------------+
| id | name | lasttime |
+------+-----------+---------------------+
| 1 | MARK | 2018-01-28 12:44:37 |
| 2 | xiaozhang | 2018-01-28 12:23:58 |
| 3 | xiaosun | 2018-01-28 12:24:00 |
| 4 | xiaowu | 2018-01-28 12:24:29 |
| 5 | xiaoli | 2018-01-28 12:24:39 |
| 6 | xiaozhou | 2018-01-28 12:24:40 |
| 7 | xiaozheng | 2018-01-28 12:29:28 |
+------+-----------+---------------------+
7 rows in set (0.00 sec)
9.sqoop增量导入(–merge-key)
sqoop import --connect "jdbc:mysql://localhost:3306/wl?useUnicode=true&characterEncoding=utf-8" --username root --password 123456 --table test2 -m 1 --target-dir /test/test2 --check-column lasttime --incremental lastmodified --last-value "2018-01-28 12:24:40" --merge-key id
10.查看结果:
hadoop:hadoop:/home/hadoop:>hadoop fs -cat /test/test2/part-r-00000
18/01/28 12:52:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,MARK,2018-01-28 12:44:37.0
2,xiaozhang,2018-01-28 12:23:58.0
3,xiaosun,2018-01-28 12:24:00.0
4,xiaowu,2018-01-28 12:24:29.0
5,xiaoli,2018-01-28 12:24:39.0
6,xiaozhou,2018-01-28 12:24:40.0
7,xiaozheng,2018-01-28 12:29:28.0
若泽大数据交流群:671914634