自己去官网上看user guide,可有效解决问题:http://sqoop.apache.org/
一、sqoop1,sqoop-1.4.6.bin_hadoop-2.0.4-alpha.tar.gz安装配置:
1、下载解压:
2、修改配置文件:
cd $SQOOP_HOME/conf
mv sqoop-env-template.sh sqoop-env.sh
打开sqoop-env.sh并编辑下面几行:
export HADOOP_COMMON_HOME=/home/hadoop/apps/hadoop-2.6.1/
export HADOOP_MAPRED_HOME=/home/hadoop/apps/hadoop-2.6.1/
export HIVE_HOME=/home/hadoop/apps/hive-1.2.1
3、配置环境变量:
必须配置环境变量,否则报错:
vi /etc/profile
export SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source /etc/profile
4、下载并配置java的MySQL连接器(mysql-connector-java)
wget http://central.maven.org/maven2/mysql/mysql-connector-java/8.0.11/mysql-connector-java-8.0.11.jar
将mysql-connector-java-8.0.11.jar移动到lib目录下:
mv mysql-connector-java-8.0.11.jar $SQOOP_HOME/lib
5、验证Sqoop
下面的命令被用来验证Sqoop版本。
sqoop -version
二、sqoop2,sqoop-1.99.7-bin-hadoop200.tar.gz安装配置
1、下载解压:
2、创建2个相关目录:
mkdir /home/hadoop/sqoop/sqoop-1.99.7-bin-hadoop200/extra
mkdir /home/hadoop/sqoop/sqoop-1.99.7-bin-hadoop200/logs
3、配置环境变量:
vi /etc/profile
export SQOOP_HOME=/home/hadoop/sqoop/sqoop-1.99.7-bin-hadoop200
export PATH=$PATH:$SQOOP_HOME/bin
export SQOOP_SERVER_EXTRA_LIB=$SQOOP_HOME/extra
export CATALINA_BASE=$SQOOP_HOME/server
export LOGDIR=$SQOOP_HOME/logs/
source /etc/profile
4、修改sqoop配置文件:
cd /home/hadoop/sqoop/sqoop-1.99.7-bin-hadoop200/conf
vi sqoop.properties
修改hadoop configure directory路径:
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/usr/local/hadoop/hadoop-2.7.3/etc/hadoop
在conf目录下,添加catalina.properties文件。加入本机hadoop的相关jar路径,如下所示:
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/common/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/common/lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/hdfs/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/hdfs/lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/mapreduce/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/mapreduce/lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/yarn/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/*.jar,/usr/local/hadoop/hadoop-2.7.3/share/hadoop/httpfs/tomcat/lib/*.jar
注意:将标红的hadoop路径替换成自己实际的hadoop安装路径,可在notepad++中用正则表达式替换。
5、sqoop2必须修改$HADOOP_HOME/etc/hadoop/core-site.xml 文件,否则start job会报错:
感觉这也是sqoop比较坑的地方,报错图文不符
Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_HDFS_CONNECTOR_0007:Invalid input/output directory - Unexpected exception
在core-site.xml中追加:(直接把下面这句复制过去就好,不用做任何修改)
6、启动server、client:
sqoop.sh server start
sqoop.sh client
7、sqoop使用:
服务端连接是正常可用,show connector
1、创建到各个源或目标存储的连接:
1)创建hdfs link
sqoop:000> create link --connector hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: hdfs-link ##link的名称,可以随意起
HDFS cluster
URI: hdfs://192.168.101.11:9000 ##hdfs地址,在hadoop配置文件core-site.xml中查看
Conf directory: /home/hadoop/hadoop-2.6.5/etc/hadoop ##hadoop配置文件目录
Additional configs::
There are currently 0 values in the map:
entry# ##直接回车即可
New link was successfully created with validation status OK and name hdfs-link
sqoop:000> show link
+-----------+----------------+---------+
| Name | Connector Name | Enabled |
+-----------+----------------+---------+
| hdfs-link | hdfs-connector | true |
+-----------+----------------+---------+
2)创建Oracle link
sqoop:000> create link --connector generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: oracle-link ##link的名称,可以随意起
Database connection
Driver class: oracle.jdbc.driver.OracleDriver ##Oracle jdbc驱动
Connection String: jdbc:oracle:thin:@192.168.101.9:1521:orcl ##Oracle连接字符串
Username: scott ##Oracle用户名
Password: ***** #密码
Fetch Size: #以下直接回车,默认即可
Connection Properties:
There are currently 0 values in the map:
entry#
SQL Dialect
Identifier enclose: ##sql定界符,为避免错误,需要打一个空格
New link was successfully created with validation status OK and name oracle-link
sqoop:000> show link
+-------------+------------------------+---------+
| Name | Connector Name | Enabled |
+-------------+------------------------+---------+
| oracle-link | generic-jdbc-connector | true |
| hdfs-link | hdfs-connector | true |
+-------------+------------------------+---------+
3)创建从oracle向hdfs导入数据的job
sqoop:000> create job -f oracle-link -t hdfs-link
Creating job for links with from name oracle-link and to name hdfs-link
Please fill following values to create new job object
Name: oracle2hdfs #job名称
Database source
Schema name: scott #oracle的schema,即用户名
Table name: emp #需要导入的表名
SQL statement: ##SQL,默认导入整张表
Column names:
There are currently 0 values in the list:
element#
Partition column: empno ##指定一个列名即可,一般可以用主键,或者时间列,以便mapper任务的切分
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose: 0 #选择导入hdfs的一种格式,选择txt即可
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose: 0 #默认无压缩
Custom codec:
Output directory: /data ##导入到hdfs的目录,注意,全量导入时,该目录需要为空
Append mode: ##追加模式,默认为全量
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name oracle2hdfs
sqoop:000>
4)创建从hdfs向oracle导入的job
sqoop:000> create job -f hdfs-link -t oracle-link
Creating job for links with from name hdfs-link and to name oracle-link
Please fill following values to create new job object
Name: hdfs2oracle
Input configuration
Input directory: /data ##hdfs 的导入目录
Override null value:
Null value:
Incremental import
Incremental type:
0 : NONE
1 : NEW_FILES
Choose: 0 #默认选0即可,应该是配合上一个参数,Incremental import,设置增量导入模式
Last imported date:
Database target
Schema name: scott #oracle用户
Table name: emp2 #oracle表名,需要提前创建好表结构
Column names:
There are currently 0 values in the list:
element#
Staging table:
Clear stage table:
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
New job was successfully created with validation status OK and name hdfs2oracle
sqoop:000>
sqoop:000> show job
+----+-------------+--------------------------------------+--------------------------------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+-------------+--------------------------------------+--------------------------------------+---------+
| 8 | oracle2hdfs | oracle-link (generic-jdbc-connector) | hdfs-link (hdfs-connector) | true |
| 9 | hdfs2oracle | hdfs-link (hdfs-connector) | oracle-link (generic-jdbc-connector) | true |
+----+-------------+--------------------------------------+--------------------------------------+---------+
sqoop:000>
5)运行作业
start job -name oracle2hdfs
6)修改job
update job --name hdfs-to-mysql-001
测试过程中,用该方法从HDFS导数据到mysql中,二百多万的表的数据,只成功导入了一般,漏了一半。sqoop2似乎没有sqoop1好用。
备注:其它用法可去官网查看文档