Sqoop: SQL to Hadoop
场景:数据在RDBMS中,我们如何使用Hive或者Hadoop来进行数据分析呢?
1) RDBMS ==> Hadoop
2) Hadoop ==> RDBMS
MapReduce InputFormat OutputFormat
Sqoop: RDBMS和Hadoop之间的一个桥梁
Sqoop 1.x: 1.4.7
底层是通过MapReduce来实现的,而且是只有map没有reduce的
ruozedata.person ===> HDFS
jdbc
Sqoop 2.x: 1.99.7
RDBMS <==> Hadoop 出发点是Hadoop
导入:RDBMS ==> Hadoop
导出:Hadoop ==> RDBMS
wget http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.7.0.tar.gz
tar -zxvf xxxxx -C ~/app
export SQOOP_HOME=/home/hadoop/app/sqoop-1.4.6-cdh5.7.0
export PATH=$SQOOP_HOME/bin:$PATH
$SQOOP_HOME/conf/sqoop-env.sh
export HADOOP_COMMON_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
export HADOOP_MAPRED_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0
拷贝mysql驱动到$SQOOP_HOME/lib
sqoop list-tables \
--connect jdbc:mysql://localhost:3306/ruozedata_basic03 \
--username root --password root
RDBMS ==> HDFS
sqoop import \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--table day_video_access_topn_stat -m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir \
--columns "EMPNO,ENAME,JOB,SAL,COMM" \
--target-dir EMP_COLUMN_WHERE \
--fields-terminated-by '\t' \
--null-string '' --null-non-string '0' \
--where 'SAL>2000'
默认:
-m 4
emp
emp.jar
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
-m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir \
--target-dir EMP_COLUMN_QUERY \
--fields-terminated-by '\t' \
--null-string '' --null-non-string '0' \
--query "SELECT * FROM emp WHERE EMPNO>=7566 AND \$CONDITIONS" \
--split-by 'EMPNO'
sqoop eval \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--query 'select * from day_video_access_topn_stat'
emp.opt
import
--connect
jdbc:mysql://localhost:3306/imooc_project
--username
root
--password
root
--table
day_video_access_topn_stat
--delete-target-dir
sqoop --options-file emp.opt
sqoop export \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
-m 2 \
--mapreduce-job-name FromHDFSToMySQL \
--table emp_demo \
--export-dir /user/hadoop/emp
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop \
--username root --password root \
--table emp -m 2 \
--mapreduce-job-name FromMySQLToHive \
--delete-target-dir \
--hive-database ruozedata \
--hive-table ruozedata_emp_partition \
--hive-import \
--hive-partition-key 'pt' \
--hive-partition-value '2018-08-08' \
--fields-terminated-by '\t' --hive-overwrite
--create-hive-table不建议使用,建议大家先创建一个hive表再进行sqoop导入数据到hive表
create table ruozedata_emp_partition
(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double, deptno int)
partitioned by (pt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
Hive To MySQL
Hive的数据:HDFS
HDFS2MySQL == Hive2MySQL
sqoop job --create ruozejob -- \
import \
--connect jdbc:mysql://localhost:3306/imooc_project \
--username root --password root \
--table day_video_access_topn_stat -m 2 \
--mapreduce-job-name FromMySQLToHDFS \
--delete-target-dir
crontab
job
--options-file
需求:统计各个区域下最热门的TOP3的商品
1) MySQL: city_info 静态
2) MySQL: product_info 静态
DROP TABLE product_info;
CREATE TABLE `product_info` (
`product_id` int(11) DEFAULT NULL,
`product_name` varchar(255) DEFAULT NULL,
`extend_info` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
3) Hive: user_click 用户行为日志 date分区
user_id int
session_id string
action_time string
city_id int
product_id int
实现需求:
1) city_info ===> hive
2) product_info ===> hive
3) 三表的join 取 TOP3(按区域进行分组) 按天分区表
最终的统计结果字段如下:
product_id 商品ID
product_name 商品名称
area 区域
click_count 点击数/访问量
rank 排名
day 时间
====> MySQL
day : 20180808
hive 2 mysql: delete xxx from where day='20180808'
id=10
id=11