SQOOP
Sqoop是一款开源的工具,主要用于在Hadoop生态系统与传统的数据库(如mysql)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
关于Sqoop版本,Sqoop1.99.7是属于sqoop2,Sqoop1的最高版本为1.4.7。Sqoop2较之Sqoop1来说,引入CS架构,部署变的复杂了,但是耦合性降低了。
我这里使用Hadoop版本是2.8.1,相应地需要使用Sqoop2。
数据源
网上开源数据颇多,我们这里以2008年美航班起降时间数据为原料,字段详见
http://stat-computing.org/dataexpo/2009/the-data.html
简单分析
规定delay = departure delay
- 常规mysql查询(数据的下载和建表见后)
-
average delay最大的10个起飞城市
mysql> select Origin,avg(DepDelay) from airline2008 group by Origin order by avg(DepDelay) desc limit 10; +--------+---------------+ | Origin | avg(DepDelay) | +--------+---------------+ | ACK | 27.3720 | | PUB | 27.0000 | | CEC | 23.0941 | | PIR | 22.8000 | | AKN | 21.5603 | | OTH | 21.0893 | | SPI | 20.3290 | | LMT | 19.3796 | | ACY | 18.0690 | | EWR | 17.6917 | +--------+---------------+ 10 rows in set (18.91 sec) //check: select DepDelay,ArrDelay from airline2008 where Origin = "ACK" order by DepDelay desc limit 10; select avg(DepDelay) from airline2008 where Origin = "ACK";
-
-
在XX地delay超2h航班最多的一天
mysql> select Month,DayOfMonth,count(*) as delatOver2hCount from airline2008 where Origin = "DCA" and DepDelay > 120 group by Month,DayOfMonth order by count(*) desc limit 10; +-------+------------+------------------+ | Month | DayOfMonth | delatOver2hCount | +-------+------------+------------------+ | 7 | 13 | 37 | | 6 | 4 | 33 | | 6 | 16 | 27 | | 7 | 23 | 26 | | 7 | 27 | 26 | | 5 | 31 | 26 | | 12 | 20 | 23 | | 5 | 2 | 15 | | 12 | 21 | 15 | | 7 | 6 | 15 | +-------+------------+------------------+ 10 rows in set (4.82 sec)
- 如果数据规模大,则需要导入HDFS,利用Hive等大数据组件对其进行分析。这里用Sqoop将MySQL数据转移至HDFS。
数据预处理
- 下载csv
http://stat-computing.org/dataexpo/2009/2008.csv.bz2
使用bzip2 -d zipfile.bz2
解压,并更名为airline2008
- mysql建库建表
CREATE DATABASE IF NOT EXISTS local_db default charset utf8 COLLATE utf8_general_ci;
create table airline2008 (
Year int,
Month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier varchar(5),
FlightNum int,
TailNum varchar(8),
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin varchar(3),
Dest varchar(3),
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode varchar(1),
Diverted varchar(1),
CarrierDelay int,
WeatherDelay int,
NASDelay int,
SecurityDelay int,
LateAircraftDelay int
);
- 导入mysql表
mysqlimport --ignore-lines 1 --fields-terminated-by=, --local -uroot -p local_db airline2008.csv
并且添加自增主键(可选:加FIRST可以将id这列置于首位,更便于查看)
ALTER TABLE 'airline2008' ADD 'id' INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
详见
https://dev.mysql.com/doc/refman/5.6/en/mysqlimport.html
-
sqoop导入hdfs
安装sqoop2
上官网找到sqoop1.99.7,解压并创建环境变量$SQOOP_HOME
修改sqoop配置
修改
$SQOOP_HOME/conf/sqoop.properties
如下:#修改指向我的hadoop安装目录 org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/lib/hadoop-2.8.1/etc/hadoop
修改
$SQOOP_HOME/server/catalina.properties
如下:common.loader= /usr/local/lib/hadoop-2.8.1/share/hadoop/common/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/common/lib/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/lib/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/lib/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/lib/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/lib/*.jar, /usr/local/lib/hadoop-2.8.1/share/hadoop/httpfs/tomcat/lib/*.jar
下载mysql-connector的jar包,把jar包丢到到$SQOOP_HOME/server/lib
-
Shell基本操作
键入命令sqoop2-shell
或者sqoop server start
进入sqoop2-shellsqoop2-tool verify 验证是否配置成功 sqoop.sh server start 开启server sqoop.sh server stop 关闭server sqoop.sh client 开启客户端
-
Shell关键操作
主要是connector/link/job的创建和维护-
创建link
sqoop:000> set server --host localhost --port 12000 --webapp sqoop Server is set successfully sqoop:000> show version --all client version: Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016 server version: Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016 API versions: [v1] sqoop:000> create link -connector generic-jdbc-connector Creating link for connector with name generic-jdbc-connector Please fill following values to create new link object Name: mysql-link Database connection Driver class: com.mysql.jdbc.Driver Connection String: jdbc:mysql://localhost/local_db Username: root Password: ****** Fetch Size: Connection Properties: There are currently 0 values in the map: entry# protocol=tcp There are currently 1 values in the map: protocol = tcp entry# SQL Dialect Identifier enclose: Mon Jan 22 16:00:38 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification. New link was successfully created with validation status WARNING and name mysql-link sqoop:000> create link -connector hdfs-connector Creating link for connector with name hdfs-connector Please fill following values to create new link object Name: hdfs-link HDFS cluster URI: hdfs://localhost:9000 Conf directory: Additional configs:: There are currently 0 values in the map: entry# New link was successfully created with validation status OK and name hdfs-link
-
创建job
sqoop:000> create job -f "mysql-link" -t "hdfs-link" Creating job for links with from name mysql-link and to name hdfs-link Please fill following values to create new job object Name: mysql-hdfs Database source Schema name: local_db Table name: airline2008 SQL statement: Column names: There are currently 0 values in the list: element# Partition column: id ##!!!!!明天改下 Partition column nullable: Boundary query: Incremental read Check column: Last value: Target configuration Override null value: Null value: File format: 0 : TEXT_FILE 1 : SEQUENCE_FILE 2 : PARQUET_FILE Choose: 0 Compression codec: 0 : NONE 1 : DEFAULT 2 : DEFLATE 3 : GZIP 4 : BZIP2 5 : LZO 6 : LZ4 7 : SNAPPY 8 : CUSTOM Choose: 0 Custom codec: Output directory: /hadoop/sqoop Append mode: Throttling resources Extractors: Loaders: Classpath configuration Extra mapper jars: There are currently 0 values in the list: element# New job was successfully created with validation status OK and name mysql-hdfs
-
开始job
sqoop:000> start job -n mysql-hdfs
即可以完成MySQL数据导入HDFS。
-
由于airline2008表过大,导出时间略长,上次结果忘记保存了,这里使用本地之前保存的Demo表(Department)做演示,结果如下:
sqoop:000> start job -n mysql-hdfs Submission details Job Name: mysql-hdfs Server URL: http://localhost:12000/sqoop/ Created by: hadoop Creation date: 2018-01-23 18:09:07 CST Lastly updated by: hadoop External ID: job_local1619449777_0003 http://localhost:8080/ Source Connector schema: Schema{name= hive . departments ,columns=[ Text{name=dept_no,nullable=true,type=TEXT,charSize=null}, Text{name=dept_name,nullable=true,type=TEXT,charSize=null}]} 2018-01-23 18:09:07 CST: BOOTING - Progress is not available $ hdfs dfs -ls /hadoop/sqoop Found 10 items -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/0d38a9bd-3740-4c45-b73f-d34b70679596.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1b1db902-1595-4756-b552-c146a49600ea.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1c847041-46fa-420d-a8fa-feb74e82a7f6.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/1fdb7a44-23fd-4cf0-9b44-f7293e72fdc8.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/5862a9bb-4b07-46d4-9dfa-f588b3e20d19.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/58e7cde4-e129-4fee-a08b-48a00e026e5d.txt -rw-r--r-- 3 hadoop supergroup 189 2018-01-23 18:09 /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/9eae0db2-6171-4856-9182-e4137dfd6911.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/cec1349f-2e18-4227-98dd-4f50ef6abd2d.txt -rw-r--r-- 3 hadoop supergroup 0 2018-01-23 18:09 /hadoop/sqoop/dd7e8b46-2406-436d-8e47-3f39f643a4e3.txt $ hdfs dfs -cat /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt 'd001','Marketing' 'd002','Finance' 'd003','Human Resources' 'd004','Production' 'd005','Development' 'd006','Quality Management' 'd007','Sales' 'd008','Research' 'd009','Customer Service'
-
-
注意事项
create link时,hdfs端口要和hadoop的core-site里设置的一致
-
core-site.xml中,可能需要添加如下配置
hadoop.proxyuser.root.hosts * hadoop.proxyuser.root.groups * -
记得开启verbose,方便在出错时及时查错
set option --name verbose --value true