sqoop 导入MySQL数据到Hive

注意:部署sqoop和hive的时候,需要将hive和sqoop部署到同一节点上。不然使用sqoop导入数据的时候会报错。
错误示例如下:
Database Class Loader started - derby.database.classpath=''
19/05/28 14:37:16 ERROR bonecp.BoneCP: Unable to start/stop JMX
java.security.AccessControlException: access denied ("javax.management.MBeanTrustPermission" "register")
...


sqoop 版本查询: 查询为Sqoop 1.4.7-cdh6.2.0
# sqoop version
Warning: /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/05/27 15:19:58 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7-cdh6.2.0
Sqoop 1.4.7-cdh6.2.0
git commit id 
Compiled by jenkins on Thu Mar 14 00:00:45 PDT 2019

安装对应数据库的jdbc驱动程序:
放置到目录: /opt/cloudera/parcels/CDH/lib/sqoop/lib
]# ll | grep -i sql | grep -i jar
lrwxrwxrwx 1 root root     33 3月  14 15:50 hsqldb-1.8.0.10.jar -> ../../../jars/hsqldb-1.8.0.10.jar
-rw-r--r-- 1 root root 990927 4月  11 09:09 mysql-connector-java.jar
-rw-r--r-- 1 root root 660074 5月  24 18:23 sqljdbc42.jar
-rw-r--r-- 1 root root  19230 5月  24 17:51 sqoop-sqlserver-1.0.jar


--sqoop 将MySQL的数据导入到hive:
-- 查看mysql下有多少库:
# sqoop list-databases --connect jdbc:mysql://172.16.1.87:3310/yjp_office --username 'root' --password 'yijiupi'
-- 查看有多少表:
# sqoop list-tables --connect jdbc:mysql://172.16.1.87:3310/yjp_office --username 'root' --password 'yijiupi'
-- 执行select 语句:
sqoop eval --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --query "select * from hr_employees limit 10"

sqoop import-all-tables --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --hive-import --hive-database yjp_lz_crm_office --m 2

sqoop import-all-tables --connect "jdbc:mysql://172.16.1.87:3310/yjp_office" --username root --password yijiupi --hive-import --hive-database yjp_lz_crm_czbank --m 2

说明:
如果其中部分表格没有主键 并行就有问题。需要使用一个参数 --autoreset-to-one-mapper


报错信息:
19/05/27 14:00:50 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://nameservice1/user/hdfs/hr_bizuservisitbasicdata already exists
-- 处理办法:
[root@datanode3 ~]# hadoop fs -rm -r hr_bizuservisitbasicdata
19/05/27 14:01:27 INFO fs.TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/root/hr_bizuservisitbasicdata' to trash at: hdfs://nameservice1/user/root/.Trash/Current/user/root/hr_bizuservisitbasicdata

报错信息:
报错问题:
19/05/27 14:41:14 ERROR tool.ImportAllTablesTool: Encountered IOException running import job: java.io.IOException: Generating splits for a textual index column allowed only in case of "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" property passed as a parameter

--原因:

sqoop导入mysql表时,没有指定map数量时,会并行导入 
并行导入时会根据表的主键类型选着不同的Splitter,参见DataDrivenDBInputFormat的getSplitter方法 
TextSplitter会现将string转BigDecimal再转string ,这个过程产生乱码和特殊字符
 
参考:
https://blog.csdn.net/MuQianHuanHuoZhe/article/details/80585672 
 
--解决办法:
调整参数--m 为1
导入中不要选用String类型的主键切分map个数 
导入小表时指定-m 1,单map 即节省资源,又防止并行切分字符串主键问题

单表import到Hive:

sqoop import  \
  --connect jdbc:mysql://127.0.0.1:3306/database_name \
  --username root \
  --password 123456 \
  --table table_name \
  --outdir \${HOME}/.sqoop/java  \
  --delete-target-dir \
  --hive-import \
  --hive-overwrite \
  --null-string '\\N' \ 
  --null-non-string '\\N' \
  -m 8 \
  --direct

整库import到Hive:
sqoop import-all-tables \
  --connect jdbc:mysql://127.0.0.1:3306/database_name \
  --username root \
  --password 123456 \
  --outdir \${HOME}/.sqoop/java  \
  --hive-import \
  --hive-overwrite \
  --null-string '\\N' \ 
  --null-non-string '\\N'  \
  -m 8 \
  --direct

--out-dir:指定Java文件的路径
--delete-target-dir:删除HDFS目录
--hive-import:直接导入的Hive
--hive-overwrite:全表覆盖(如果表不存在会直接创建表)
--null-string:指定String列如果为NULL转换到Hive里也为NULL
--null-non-string:指定String列如果为NULL转换到Hive里也为NULL
-m 8:8个并发
--direct:直连模式,使用mysqldump加快速度(本地测试13W条数据,74M, 快了10%左右)

--MySQL的分库分表通过sqoop进行合库合表操作:

--数据存储到hdfs上的文件格式:
--as-avrodatafile
--as-binaryfile
--as-parquetfile
--as-sequencefile 
--as-textfile (默认)

 


--项目化的脚本:
# cat mysql_sqoop.sh 
#!/bin/bash

#Set the RDBMS topic name as bash's first  params
#Set the RDBMS table name as bash's second params


echo "<----------------Import topic name is [$1]---------------->"
echo "<----------------Import table name is [$2]---------------->"

if [ $# -lt 2 ] 
then 
   echo "Please Import table name as the First  params!"
   echo "Please Import topic name as the Second params!"  
   exit 1; 
elif [ $# -gt 2 ] 
then 
   echo "input more then 2 params , Now  the topic is [$1] <--> sqoop table name is [$2] " 
fi 

#Set the topic name
db_name="$1"

#Set the RDBMS connection params
#mysql_connstr="jdbc:mysql://10.19.196.129:3312/${db_name}?tinyInt1isBit=false"
mysql_connstr="jdbc:mysql://172.16.4.150:3306/${db_name}?tinyInt1isBit=false"

username="root"
password="yjp@2019"

echo "${mysql_connstr}"



#Set the source table in RDBMS
table_name=""

#根据传入的多个表名,循环执行sqoop抽取任务

for ((i=2;i<=$#;i++))
do
table_name=${!i}

#每次拉取的数据临时存放目录
dir="/tmp/${db_name}/${table_name}"
target_dir=$(echo $dir | tr '[A-Z]' '[a-z]') 


#    hadoop fs -mkdir -p  $target_dir
#    hadoop fs -chmod 777 $target_dir

#---------------------------------------------------------
#Import init data from RDBMS into   HDFS
job_name="import_job_${db_name}_${table_name}"
hive_table="${db_name}.${table_name}"

echo "${hive_table}"

#检测sqoop任务的输出目录是否存在,存在则删除
hadoop fs -test -e $target_dir
if [ $? -eq 0 ] 
then 
  echo "<-------------[$target_dir]  is exist, Now delete it !------------->"
  hadoop fs -rm -r  $target_dir
else 
  echo "<-------------[$target_dir]  is not exist, No need delete it !----------->"
fi 

#首次执行需要创建分布式sqoop任务
sqoop job --delete ${job_name} 


sqoop   import   \
--connect  ${mysql_connstr}    \
--username ${username}  \
--password ${password}  \
-m 1    \
--table ${table_name}      \
--null-string '\\N'        \
--null-non-string '\\N'     \
--input-null-string '\\N'    \
--input-null-non-string '\\N' \
--hive-drop-import-delims      \
--hive-import                   \
--hive-overwrite                 \
--target-dir  $target_dir         \
--create-hive-table                \
--hive-table ${hive_table} 
#echo "<----------------Now Execute Job name is [$job_name]---------------->"

#sqoop job --exec ${job_name} 

 echo "<----------------sqoop job: ${job_name} complete!---------------->" 

 
# 执行半壁后重新将table_name置空
table_name=""
dir=""
done
# 循环结束


exit 0

# cat sqoop_track.sh 
sudo -u hive /root/migration/mysql_sqoop.sh track dim_track_app_group dim_track_app_info

# cat run_track.sh 
nohup /root/migration/sqoop_track.sh.sh  1>>/root/migration/track.out  2>>/root/migration/track.err  &

--将MySQL的整库的数据搬迁到hive中:
sudo -u hive sqoop import-all-tables --connect "jdbc:mysql://172.16.4.150:3306/track?tinyInt1isBit=false" --username root --password yjp@2019 --hive-import --hive-overwrite --hive-database yjp_track --m 1 --null-string '\\N' --null-non-string '\\N' --input-null-string '\\N' --input-null-non-string '\\N' --hive-drop-import-delims

参考:

https://www.jianshu.com/p/bb78ccd0252f

你可能感兴趣的:(Hadoop)