注意:部署sqoop和hive的时候,需要将hive和sqoop部署到同一节点上。不然使用sqoop导入数据的时候会报错。
错误示例如下:
Database Class Loader started - derby.database.classpath=''
19/05/28 14:37:16 ERROR bonecp.BoneCP: Unable to start/stop JMX
java.security.AccessControlException: access denied ("javax.management.MBeanTrustPermission" "register")
...
sqoop 版本查询: 查询为Sqoop 1.4.7-cdh6.2.0
# sqoop version
Warning: /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/05/27 15:19:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.2.0
Sqoop 1.4.7-cdh6.2.0
git commit id
Compiled by jenkins on Thu Mar 14 00:00:45 PDT 2019
安装对应数据库的jdbc驱动程序:
放置到目录: /opt/cloudera/parcels/CDH/lib/sqoop/lib
]# ll | grep -i sql | grep -i jar
lrwxrwxrwx 1 root root 33 3月 14 15:50 hsqldb-1.8.0.10.jar -> ../../../jars/hsqldb-1.8.0.10.jar
-rw-r--r-- 1 root root 990927 4月 11 09:09 mysql-connector-java.jar
-rw-r--r-- 1 root root 660074 5月 24 18:23 sqljdbc42.jar
-rw-r--r-- 1 root root 19230 5月 24 17:51 sqoop-sqlserver-1.0.jar
--sqoop 将MySQL的数据导入到hive:
-- 查看mysql下有多少库:
# sqoop list-databases --connect jdbc:mysql://172.16.1.87:3310/yjp_office --username 'root' --password 'yijiupi'
-- 查看有多少表:
# sqoop list-tables --connect jdbc:mysql://172.16.1.87:3310/yjp_office --username 'root' --password 'yijiupi'
-- 执行select 语句:
sqoop eval --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --query "select * from hr_employees limit 10"
sqoop import-all-tables --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --hive-import --hive-database yjp_lz_crm_office --m 2
sqoop import-all-tables --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --hive-import --hive-database yjp_lz_crm_czbank --m 2
说明:
如果其中部分表格没有主键 并行就有问题。需要使用一个参数 --autoreset-to-one-mapper
报错信息:
19/05/27 14:00:50 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://nameservice1/user/hdfs/hr_bizuservisitbasicdata already exists
-- 处理办法:
[root@datanode3 ~]# hadoop fs -rm -r hr_bizuservisitbasicdata
19/05/27 14:01:27 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/root/hr_bizuservisitbasicdata' to trash at: hdfs://nameservice1/user/root/.Trash/Current/user/root/hr_bizuservisitbasicdata
报错信息:
报错问题:
19/05/27 14:41:14 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: java.io.IOException: Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter
--原因:
sqoop导入mysql表时,没有指定map数量时,会并行导入
并行导入时会根据表的主键类型选着不同的Splitter,参见DataDrivenDBInputFormat的getSplitter方法
TextSplitter会现将string转BigDecimal再转string ,这个过程产生乱码和特殊字符
参考:
https://blog.csdn.net/MuQianHuanHuoZhe/article/details/80585672
--解决办法:
调整参数--m 为1
导入中不要选用String类型的主键切分map个数
导入小表时指定-m 1,单map 即节省资源,又防止并行切分字符串主键问题
单表import到Hive:
sqoop import \
--connect jdbc:mysql://127.0.0.1:3306/database_name \
--username root \
--password 123456 \
--table table_name \
--outdir \${HOME}/.sqoop/java \
--delete-target-dir \
--hive-import \
--hive-overwrite \
--null-string '\\N' \
--null-non-string '\\N' \
-m 8 \
--direct
整库import到Hive:
sqoop import-all-tables \
--connect jdbc:mysql://127.0.0.1:3306/database_name \
--username root \
--password 123456 \
--outdir \${HOME}/.sqoop/java \
--hive-import \
--hive-overwrite \
--null-string '\\N' \
--null-non-string '\\N' \
-m 8 \
--direct
--out-dir:指定Java文件的路径
--delete-target-dir:删除HDFS目录
--hive-import:直接导入的Hive
--hive-overwrite:全表覆盖(如果表不存在会直接创建表)
--null-string:指定String列如果为NULL转换到Hive里也为NULL
--null-non-string:指定String列如果为NULL转换到Hive里也为NULL
-m 8:8个并发
--direct:直连模式,使用mysqldump加快速度(本地测试13W条数据,74M, 快了10%左右)
--MySQL的分库分表通过sqoop进行合库合表操作:
--数据存储到hdfs上的文件格式:
--as-avrodatafile
--as-binaryfile
--as-parquetfile
--as-sequencefile
--as-textfile (默认)
--项目化的脚本:
# cat mysql_sqoop.sh
#!/bin/bash
#Set the RDBMS topic name as bash's first params
#Set the RDBMS table name as bash's second params
echo "<----------------Import topic name is [$1]---------------->"
echo "<----------------Import table name is [$2]---------------->"
if [ $# -lt 2 ]
then
echo "Please Import table name as the First params!"
echo "Please Import topic name as the Second params!"
exit 1;
elif [ $# -gt 2 ]
then
echo "input more then 2 params , Now the topic is [$1] <--> sqoop table name is [$2] "
fi
#Set the topic name
db_name="$1"
#Set the RDBMS connection params
#mysql_connstr="jdbc:mysql://10.19.196.129:3312/${db_name}?tinyInt1isBit=false"
mysql_connstr="jdbc:mysql://172.16.4.150:3306/${db_name}?tinyInt1isBit=false"
username="root"
password="yjp@2019"
echo "${mysql_connstr}"
#Set the source table in RDBMS
table_name=""
#根据传入的多个表名,循环执行sqoop抽取任务
for ((i=2;i<=$#;i++))
do
table_name=${!i}
#每次拉取的数据临时存放目录
dir="/tmp/${db_name}/${table_name}"
target_dir=$(echo $dir | tr '[A-Z]' '[a-z]')
# hadoop fs -mkdir -p $target_dir
# hadoop fs -chmod 777 $target_dir
#---------------------------------------------------------
#Import init data from RDBMS into HDFS
job_name="import_job_${db_name}_${table_name}"
hive_table="${db_name}.${table_name}"
echo "${hive_table}"
#检测sqoop任务的输出目录是否存在,存在则删除
hadoop fs -test -e $target_dir
if [ $? -eq 0 ]
then
echo "<-------------[$target_dir] is exist, Now delete it !------------->"
hadoop fs -rm -r $target_dir
else
echo "<-------------[$target_dir] is not exist, No need delete it !----------->"
fi
#首次执行需要创建分布式sqoop任务
sqoop job --delete ${job_name}
sqoop import \
--connect ${mysql_connstr} \
--username ${username} \
--password ${password} \
-m 1 \
--table ${table_name} \
--null-string '\\N' \
--null-non-string '\\N' \
--input-null-string '\\N' \
--input-null-non-string '\\N' \
--hive-drop-import-delims \
--hive-import \
--hive-overwrite \
--target-dir $target_dir \
--create-hive-table \
--hive-table ${hive_table}
#echo "<----------------Now Execute Job name is [$job_name]---------------->"
#sqoop job --exec ${job_name}
echo "<----------------sqoop job: ${job_name} complete!---------------->"
# 执行半壁后重新将table_name置空
table_name=""
dir=""
done
# 循环结束
exit 0
# cat sqoop_track.sh
sudo -u hive /root/migration/mysql_sqoop.sh track dim_track_app_group dim_track_app_info
# cat run_track.sh
nohup /root/migration/sqoop_track.sh.sh 1>>/root/migration/track.out 2>>/root/migration/track.err &
--将MySQL的整库的数据搬迁到hive中:
sudo -u hive sqoop import-all-tables --connect "jdbc:mysql://172.16.4.150:3306/track?tinyInt1isBit=false" --username root --password yjp@2019 --hive-import --hive-overwrite --hive-database yjp_track --m 1 --null-string '\\N' --null-non-string '\\N' --input-null-string '\\N' --input-null-non-string '\\N' --hive-drop-import-delims
参考:
https://www.jianshu.com/p/bb78ccd0252f