Sqoop2的操作小记

SQOOP

Sqoop是一款开源的工具,主要用于在Hadoop生态系统与传统的数据库(如mysql)间进行数据的传递,可以将一个关系型数据库(例如 : MySQL ,Oracle ,Postgres等)中的数据导进到Hadoop的HDFS中,也可以将HDFS的数据导进到关系型数据库中。
关于Sqoop版本,Sqoop1.99.7是属于sqoop2,Sqoop1的最高版本为1.4.7。Sqoop2较之Sqoop1来说,引入CS架构,部署变的复杂了,但是耦合性降低了。
我这里使用Hadoop版本是2.8.1,相应地需要使用Sqoop2。

数据源

网上开源数据颇多,我们这里以2008年美航班起降时间数据为原料,字段详见

http://stat-computing.org/dataexpo/2009/the-data.html

简单分析

规定delay = departure delay

  • 常规mysql查询(数据的下载和建表见后)
    1. average delay最大的10个起飞城市

       mysql> select Origin,avg(DepDelay) from airline2008 group by Origin order by avg(DepDelay) desc limit 10;
       +--------+---------------+
       | Origin | avg(DepDelay) |
       +--------+---------------+
       | ACK    |       27.3720 |
       | PUB    |       27.0000 |
       | CEC    |       23.0941 |
       | PIR    |       22.8000 |
       | AKN    |       21.5603 |
       | OTH    |       21.0893 |
       | SPI    |       20.3290 |
       | LMT    |       19.3796 |
       | ACY    |       18.0690 |
       | EWR    |       17.6917 |
       +--------+---------------+
       10 rows in set (18.91 sec)
      
       //check:
       select DepDelay,ArrDelay from airline2008 where Origin = "ACK" order by DepDelay desc limit 10;
       select avg(DepDelay) from airline2008 where Origin = "ACK";
      
  1. 在XX地delay超2h航班最多的一天

         mysql> select Month,DayOfMonth,count(*) as delatOver2hCount from airline2008 where Origin = "DCA" and DepDelay > 120 group by Month,DayOfMonth order by count(*) desc limit 10;
         +-------+------------+------------------+
         | Month | DayOfMonth | delatOver2hCount |
         +-------+------------+------------------+
         |     7 |         13 |               37 |
         |     6 |          4 |               33 |
         |     6 |         16 |               27 |
         |     7 |         23 |               26 |
         |     7 |         27 |               26 |
         |     5 |         31 |               26 |
         |    12 |         20 |               23 |
         |     5 |          2 |               15 |
         |    12 |         21 |               15 |
         |     7 |          6 |               15 |
         +-------+------------+------------------+
         10 rows in set (4.82 sec)
    
  • 如果数据规模大,则需要导入HDFS,利用Hive等大数据组件对其进行分析。这里用Sqoop将MySQL数据转移至HDFS。

数据预处理

  1. 下载csv

http://stat-computing.org/dataexpo/2009/2008.csv.bz2

使用bzip2 -d zipfile.bz2解压,并更名为airline2008

  1. mysql建库建表

CREATE DATABASE IF NOT EXISTS local_db default charset utf8 COLLATE utf8_general_ci;

create table airline2008 (
  Year int,
  Month int,
  DayofMonth int,
  DayOfWeek int,
  DepTime  int,
  CRSDepTime int,
  ArrTime int,
  CRSArrTime int,
  UniqueCarrier varchar(5),
  FlightNum int,
  TailNum varchar(8),
  ActualElapsedTime int,
  CRSElapsedTime int,
  AirTime int,
  ArrDelay int,
  DepDelay int,
  Origin varchar(3),
  Dest varchar(3),
  Distance int,
  TaxiIn int,
  TaxiOut int,
  Cancelled int,
  CancellationCode varchar(1),
  Diverted varchar(1),
  CarrierDelay int,
  WeatherDelay int,
  NASDelay int,
  SecurityDelay int,
  LateAircraftDelay int
);
  1. 导入mysql表
    mysqlimport --ignore-lines 1 --fields-terminated-by=, --local -uroot -p local_db airline2008.csv
    并且添加自增主键(可选:加FIRST可以将id这列置于首位,更便于查看)
    ALTER TABLE 'airline2008' ADD 'id' INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;
    详见

https://dev.mysql.com/doc/refman/5.6/en/mysqlimport.html

  1. sqoop导入hdfs

    • 安装sqoop2
      上官网找到sqoop1.99.7,解压并创建环境变量$SQOOP_HOME

    • 修改sqoop配置

    修改 $SQOOP_HOME/conf/sqoop.properties 如下:

     #修改指向我的hadoop安装目录  
     org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/lib/hadoop-2.8.1/etc/hadoop
    

    修改 $SQOOP_HOME/server/catalina.properties 如下:

     common.loader=
     /usr/local/lib/hadoop-2.8.1/share/hadoop/common/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/common/lib/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/hdfs/lib/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/mapreduce/lib/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/tools/lib/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/yarn/lib/*.jar,
     /usr/local/lib/hadoop-2.8.1/share/hadoop/httpfs/tomcat/lib/*.jar
    

    下载mysql-connector的jar包,把jar包丢到到$SQOOP_HOME/server/lib

    • Shell基本操作
      键入命令 sqoop2-shell 或者 sqoop server start 进入sqoop2-shell

        sqoop2-tool verify 验证是否配置成功
        sqoop.sh server start 开启server
        sqoop.sh server stop 关闭server
        sqoop.sh client 开启客户端
      
    • Shell关键操作
      主要是connector/link/job的创建和维护

      • 创建link

          sqoop:000> set server --host localhost --port 12000 --webapp sqoop
          Server is set successfully
          sqoop:000> show version --all
          client version:
            Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb 
            Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
          server version:
            Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb 
            Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016
          API versions:
            [v1]
          sqoop:000> create link -connector generic-jdbc-connector
          Creating link for connector with name generic-jdbc-connector
          Please fill following values to create new link object
          Name: mysql-link
          
          Database connection
          
          Driver class: com.mysql.jdbc.Driver
          Connection String: jdbc:mysql://localhost/local_db
          Username: root
          Password: ******
          Fetch Size: 
          Connection Properties: 
          There are currently 0 values in the map:
          entry# protocol=tcp
          There are currently 1 values in the map:
          protocol = tcp
          entry# 
          
          SQL Dialect
          
          Identifier enclose:  
          Mon Jan 22 16:00:38 CST 2018 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
          New link was successfully created with validation status WARNING and name mysql-link
          
          sqoop:000> create link -connector hdfs-connector
          Creating link for connector with name hdfs-connector
          Please fill following values to create new link object
          Name: hdfs-link
          
          HDFS cluster
          
          URI: hdfs://localhost:9000
          Conf directory: 
          Additional configs:: 
          There are currently 0 values in the map:
          entry# 
          New link was successfully created with validation status OK and name hdfs-link
        
      • 创建job

          sqoop:000> create job -f "mysql-link" -t "hdfs-link"
          Creating job for links with from name mysql-link and to name hdfs-link
          Please fill following values to create new job object
          Name: mysql-hdfs
          
          Database source
          
          Schema name: local_db
          Table name: airline2008
          SQL statement: 
          Column names: 
          There are currently 0 values in the list:
          element# 
          Partition column: id ##!!!!!明天改下
          Partition column nullable: 
          Boundary query: 
          
          Incremental read
          
          Check column: 
          Last value: 
          
          Target configuration
          
          Override null value: 
          Null value: 
          File format: 
            0 : TEXT_FILE
            1 : SEQUENCE_FILE
            2 : PARQUET_FILE
          Choose: 0
          Compression codec: 
            0 : NONE
            1 : DEFAULT
            2 : DEFLATE
            3 : GZIP
            4 : BZIP2
            5 : LZO
            6 : LZ4
            7 : SNAPPY
            8 : CUSTOM
          Choose: 0
          Custom codec: 
          Output directory: /hadoop/sqoop
          Append mode: 
          
          Throttling resources
          
          Extractors: 
          Loaders: 
          
          Classpath configuration
          
          Extra mapper jars: 
          There are currently 0 values in the list:
          element# 
          New job was successfully created with validation status OK  and name mysql-hdfs
        
      • 开始job

          sqoop:000> start job -n mysql-hdfs
        

        即可以完成MySQL数据导入HDFS。

      • 由于airline2008表过大,导出时间略长,上次结果忘记保存了,这里使用本地之前保存的Demo表(Department)做演示,结果如下:

          sqoop:000> start job -n mysql-hdfs
          Submission details
          Job Name: mysql-hdfs
          Server URL: http://localhost:12000/sqoop/
          Created by: hadoop
          Creation date: 2018-01-23 18:09:07 CST
          Lastly updated by: hadoop
          External ID: job_local1619449777_0003
                  http://localhost:8080/
          Source Connector schema: Schema{name= hive . departments ,columns=[
                  Text{name=dept_no,nullable=true,type=TEXT,charSize=null},
                  Text{name=dept_name,nullable=true,type=TEXT,charSize=null}]}
          2018-01-23 18:09:07 CST: BOOTING  - Progress is not available
          
          
          $ hdfs dfs -ls /hadoop/sqoop
          Found 10 items
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/0d38a9bd-3740-4c45-b73f-d34b70679596.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1b1db902-1595-4756-b552-c146a49600ea.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1c847041-46fa-420d-a8fa-feb74e82a7f6.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/1fdb7a44-23fd-4cf0-9b44-f7293e72fdc8.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/5862a9bb-4b07-46d4-9dfa-f588b3e20d19.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/58e7cde4-e129-4fee-a08b-48a00e026e5d.txt
          -rw-r--r--   3 hadoop supergroup        189 2018-01-23 18:09 /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/9eae0db2-6171-4856-9182-e4137dfd6911.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/cec1349f-2e18-4227-98dd-4f50ef6abd2d.txt
          -rw-r--r--   3 hadoop supergroup          0 2018-01-23 18:09 /hadoop/sqoop/dd7e8b46-2406-436d-8e47-3f39f643a4e3.txt
          
          $ hdfs dfs -cat /hadoop/sqoop/8cd3761a-6070-474f-be08-e6c2609bc6ff.txt
          'd001','Marketing'
          'd002','Finance'
          'd003','Human Resources'
          'd004','Production'
          'd005','Development'
          'd006','Quality Management'
          'd007','Sales'
          'd008','Research'
          'd009','Customer Service'
        
    • 注意事项

      • create link时,hdfs端口要和hadoop的core-site里设置的一致

      • core-site.xml中,可能需要添加如下配置

          
            hadoop.proxyuser.root.hosts
            *
          
          
            hadoop.proxyuser.root.groups
            *
          
        
      • 记得开启verbose,方便在出错时及时查错

        set option --name verbose --value true
        

你可能感兴趣的:(Sqoop2的操作小记)