Sqoop2的增量导入很简单,根据某个字段值的变化,进行同步。如果要导出到一个存在文件的目录里,必须指定append mode为true。下图为官方文档的截图:
How To Use Sqoop2 Increment Import_第1张图片
一、全量同步
表中存在3条数据,如下图:
How To Use Sqoop2 Increment Import
1.1 创建job

sqoop:000> create job -f PostgreSQL_Link -t HDFS_Link
Creating job for links with from name PostgreSQL_Link and to name HDFS_Link
Please fill following values to create new job object
Name: Web_PSQL_To_HDFS

Database source

Schema name: public
Table name: web_customer
SQL statement: 
Column names: 
There are currently 0 values in the list:
element# 
Partition column: id
Partition column nullable: 
Boundary query: 

Incremental read

Check column: 
Last value: 

Target configuration

Override null value: 
Null value: 
File format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 1
Custom codec: 
Output directory: /user/DataSource/PostgreSQL/WEB_CUSTOMER
Append mode: 

Throttling resources

Extractors: 
Loaders: 

Classpath configuration

Extra mapper jars: 
There are currently 0 values in the list:
element# 
New job was successfully created with validation status OK  and name Web_PSQL_To_HDFS

1.2 运行job

sqoop:000> start job -s -n Web_PSQL_To_HDFS
Submission details
Job Name: Web_PSQL_To_HDFS
Server URL: http://localhost:12000/sqoop/
Created by: hadoop
Creation date: 2017-12-25 10:10:39 CST
Lastly updated by: hadoop
External ID: job_1513929521869_0014
        http://hdp04:8088/proxy/application_1513929521869_0014/
2017-12-25 10:10:39 CST: BOOTING  - Progress is not available
2017-12-25 10:10:51 CST: RUNNING  - 0.00 %
2017-12-25 10:11:01 CST: RUNNING  - 35.00 %
2017-12-25 10:11:14 CST: SUCCEEDED 
Counters:
        org.apache.hadoop.mapreduce.FileSystemCounter
                FILE_LARGE_READ_OPS: 0
                FILE_WRITE_OPS: 0
                HDFS_READ_OPS: 10
                HDFS_BYTES_READ: 1501
                HDFS_LARGE_READ_OPS: 0
                FILE_READ_OPS: 0
                FILE_BYTES_WRITTEN: 3420790
                FILE_BYTES_READ: 0
                HDFS_WRITE_OPS: 10
                HDFS_BYTES_WRITTEN: 297
        org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
                BYTES_WRITTEN: 0
        org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
                BYTES_READ: 0
        org.apache.hadoop.mapreduce.JobCounter
                TOTAL_LAUNCHED_MAPS: 10
                MB_MILLIS_MAPS: 115728384
                VCORES_MILLIS_MAPS: 56508
                SLOTS_MILLIS_MAPS: 56508
                OTHER_LOCAL_MAPS: 10
                MILLIS_MAPS: 56508
        org.apache.sqoop.submission.counter.SqoopCounters
                ROWS_READ: 3                                     --读取3条数据
                ROWS_WRITTEN: 3                               --同步3条数据
        org.apache.hadoop.mapreduce.TaskCounter
                SPILLED_RECORDS: 0
                MERGED_MAP_OUTPUTS: 0
                VIRTUAL_MEMORY_BYTES: 41340801024
                MAP_INPUT_RECORDS: 0
                SPLIT_RAW_BYTES: 1501
                MAP_OUTPUT_RECORDS: 3
                FAILED_SHUFFLE: 0
                PHYSICAL_MEMORY_BYTES: 3848200192
                GC_TIME_MILLIS: 2249
                CPU_MILLISECONDS: 32280
                COMMITTED_HEAP_BYTES: 3965714432
Job executed successfully

web_customer表中只有3条数据,已成功同步。
二、增量同步
2.1 表中插入新数据
第4条是新插入的数据,如图:
How To Use Sqoop2 Increment Import
2.2 创建job

sqoop:000> create job -f PostgreSQL_Link -t HDFS_Link
Creating job for links with from name PostgreSQL_Link and to name HDFS_Link
Please fill following values to create new job object
Name: Web_PSQL_To_HDFS_I

Database source

Schema name: public
Table name: web_customer
SQL statement: 
Column names: 
There are currently 0 values in the list:
element# 
Partition column: id
Partition column nullable: 
Boundary query: 

Incremental read

Check column: id         --指定id字段
Last value: 3                --全量同步后的最后一条记录的id值

Target configuration

Override null value: 
Null value: 
File format: 
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose: 0
Compression codec: 
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose: 1
Custom codec: 
Output directory: /user/DataSource/PostgreSQL/WEB_CUSTOMER
Append mode: true                         --如果要和全量同步的文件放在一起,必须输入true,否则会因目录下有文件而报错。

Throttling resources

Extractors: 
Loaders: 

Classpath configuration

Extra mapper jars: 
There are currently 0 values in the list:
element# 
New job was successfully created with validation status OK  and name Web_PSQL_To_HDFS_I

2.3 运行job

sqoop:000> start job -s -n Web_PSQL_To_HDFS_I
Submission details
Job Name: Web_PSQL_To_HDFS_I
Server URL: http://localhost:12000/sqoop/
Created by: hadoop
Creation date: 2017-12-25 10:43:47 CST
Lastly updated by: hadoop
External ID: job_1513929521869_0015
        http://hdp04:8088/proxy/application_1513929521869_0015/
2017-12-25 10:43:47 CST: BOOTING  - Progress is not available
2017-12-25 10:43:59 CST: RUNNING  - 0.00 %
2017-12-25 10:44:09 CST: RUNNING  - 45.00 %
2017-12-25 10:44:23 CST: SUCCEEDED 
Counters:
        org.apache.hadoop.mapreduce.FileSystemCounter
                FILE_LARGE_READ_OPS: 0
                FILE_WRITE_OPS: 0
                HDFS_READ_OPS: 10
                HDFS_BYTES_READ: 1351
                HDFS_LARGE_READ_OPS: 0
                FILE_READ_OPS: 0
                FILE_BYTES_WRITTEN: 3423100
                FILE_BYTES_READ: 0
                HDFS_WRITE_OPS: 10
                HDFS_BYTES_WRITTEN: 129
        org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
                BYTES_WRITTEN: 0
        org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
                BYTES_READ: 0
        org.apache.hadoop.mapreduce.JobCounter
                TOTAL_LAUNCHED_MAPS: 10
                MB_MILLIS_MAPS: 113948672
                VCORES_MILLIS_MAPS: 55639
                SLOTS_MILLIS_MAPS: 55639
                OTHER_LOCAL_MAPS: 10
                MILLIS_MAPS: 55639
        org.apache.sqoop.submission.counter.SqoopCounters
                ROWS_READ: 1                           --读取一条新数据
                ROWS_WRITTEN: 1                     --同步一条新数据
        org.apache.hadoop.mapreduce.TaskCounter
                SPILLED_RECORDS: 0
                MERGED_MAP_OUTPUTS: 0
                VIRTUAL_MEMORY_BYTES: 41336942592
                MAP_INPUT_RECORDS: 0
                SPLIT_RAW_BYTES: 1351
                MAP_OUTPUT_RECORDS: 1
                FAILED_SHUFFLE: 0
                PHYSICAL_MEMORY_BYTES: 4070748160
                GC_TIME_MILLIS: 2363
                CPU_MILLISECONDS: 31730
                COMMITTED_HEAP_BYTES: 4415553536
Job executed successfully