Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from arelational database management system (RDBMS) such as MySQL or Oracleinto the Hadoop Distributed File System (HDFS),transform the data in Hadoop MapReduce, and then export the data backinto an RDBMS
介绍:上文引用自官方文档。意思很明显,sqoop就是将传统关系型数据库中的数据抽取到sqoop上,也能导回RDBMS。而且sqoop现在已经是apache的一个顶级项目,应该比较成熟了。
具体使用和详细信息参考官方文档:http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html
因为我使用的hadoop是0.20.X的版本,sqoop不支持。sqoop支持cdh3的版本,不过没有关系,通过拷贝jar的方式可以弥补。
wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-CDH3B4.tar.gz wget http://archive.cloudera.com/cdh/3/sqoop-1.2.0-CDH3B4.tar.gz tar -zxvf sqoop-1.2.0-CDH3B4.tar.gz tar -zxvf hadoop-0.20.2-CDH3B4.tar.gz cp hadoop-0.20.2-CDH3B4/hadoop-core-0.20.2-CDH3B4.jar sqoop-1.2.0-CDH3B4/lib/
vi .bash_profile #内容如下 export SQOOP_HOME=/home/hadoop/sqoop-1.2.0-CDH3B4
export PATH=$PATH:$SQOOP_HOME/bin export CLASSPATH=$CLASSPATH:$SQOOP_HOME/ojdbc6.jar:$SQOOP_HOME/lib/*
话说我一开始老是类找不到 可是东西明明在lib下了,非常的无奈,还好后来自己摸索出来这样强制指定jar的方式
sqoop import --connect jdbc:oracle:thin:@192.168.1.65/wkm --username wkm --password wkm --target-dir /user/hadoop/wkmData2 --table C_PICRECORD_ALL -m 1 用以上命令执行即可 --connect 后给出连接的数据库地址以及名字 --username 用户名 --password 密码 --targer-dir hdfs上要保存的输出位置,目录自己会创建,不用我们操心 --table 用的ORACLE 一定要表名大写 因为sqoop是从oracle dictionary table里面读取的 -m 1 设定1个并行度 ,如果多个并行度 必须确保有主键
输出结果:
注意:如果你的hadoop采用了mesos,不要把CPU和内存资源分配完,要不然会卡在map0% reduce0%
------------------------------------------------1cpu 1G内存---------------------------------------------------------------------------- 13/04/16 11:31:05 INFO mapreduce.ImportJobBase: Beginning import of C_PICRECORD_ALL 13/04/16 11:31:05 INFO manager.OracleManager: Time zone has been set to GMT 13/04/16 11:31:08 INFO mapred.JobClient: Running job: job_201304161006_0004 13/04/16 11:31:09 INFO mapred.JobClient: map 0% reduce 0% 13/04/16 11:31:34 INFO mapred.JobClient: map 100% reduce 0% 13/04/16 11:40:42 INFO mapred.JobClient: Job complete: job_201304161006_0004 13/04/16 11:40:42 INFO mapred.JobClient: Counters: 18 13/04/16 11:40:42 INFO mapred.JobClient: Job Counters 13/04/16 11:40:42 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=563490 13/04/16 11:40:42 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/04/16 11:40:42 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/04/16 11:40:42 INFO mapred.JobClient: Launched map tasks=1 13/04/16 11:40:42 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/04/16 11:40:42 INFO mapred.JobClient: File Output Format Counters 13/04/16 11:40:42 INFO mapred.JobClient: Bytes Written=503848416 13/04/16 11:40:42 INFO mapred.JobClient: FileSystemCounters 13/04/16 11:40:42 INFO mapred.JobClient: HDFS_BYTES_READ=87 13/04/16 11:40:42 INFO mapred.JobClient: FILE_BYTES_WRITTEN=27953 13/04/16 11:40:42 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=503848416 13/04/16 11:40:42 INFO mapred.JobClient: File Input Format Counters 13/04/16 11:40:42 INFO mapred.JobClient: Bytes Read=0 13/04/16 11:40:42 INFO mapred.JobClient: Map-Reduce Framework 13/04/16 11:40:42 INFO mapred.JobClient: Map input records=3133366 13/04/16 11:40:42 INFO mapred.JobClient: Physical memory (bytes) snapshot=99123200 13/04/16 11:40:42 INFO mapred.JobClient: Spilled Records=0 13/04/16 11:40:42 INFO mapred.JobClient: CPU time spent (ms)=169400 13/04/16 11:40:42 INFO mapred.JobClient: Total committed heap usage (bytes)=18546688 13/04/16 11:40:42 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1521811456 13/04/16 11:40:42 INFO mapred.JobClient: Map output records=3133366 13/04/16 11:40:42 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 13/04/16 11:40:43 INFO mapreduce.ImportJobBase: Transferred 480.5073 MB in 577.2022 seconds (852.456 KB/sec) 13/04/16 11:40:43 INFO mapreduce.ImportJobBase: Retrieved 3133366 records. 13/04/16 11:40:43 INFO util.AppendUtils: Appending to directory kakouData ------------------------------------2CPU 1.8G内存------------------------------------------- 13/04/16 13:57:41 INFO mapreduce.ImportJobBase: Beginning import of C_PICRECORD_ALL 13/04/16 13:57:41 INFO manager.OracleManager: Time zone has been set to GMT 13/04/16 13:57:43 INFO mapred.JobClient: Running job: job_201304161356_0001 13/04/16 13:57:44 INFO mapred.JobClient: map 0% reduce 0% 13/04/16 13:58:10 INFO mapred.JobClient: map 100% reduce 0% 13/04/16 14:07:14 INFO mapred.JobClient: Job complete: job_201304161356_0001 13/04/16 14:07:14 INFO mapred.JobClient: Counters: 18 13/04/16 14:07:14 INFO mapred.JobClient: Job Counters 13/04/16 14:07:14 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=557391 13/04/16 14:07:14 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/04/16 14:07:14 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/04/16 14:07:14 INFO mapred.JobClient: Launched map tasks=1 13/04/16 14:07:14 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/04/16 14:07:14 INFO mapred.JobClient: File Output Format Counters 13/04/16 14:07:14 INFO mapred.JobClient: Bytes Written=503848416 13/04/16 14:07:14 INFO mapred.JobClient: FileSystemCounters 13/04/16 14:07:14 INFO mapred.JobClient: HDFS_BYTES_READ=87 13/04/16 14:07:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=27436 13/04/16 14:07:14 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=503848416 13/04/16 14:07:14 INFO mapred.JobClient: File Input Format Counters 13/04/16 14:07:14 INFO mapred.JobClient: Bytes Read=0 13/04/16 14:07:14 INFO mapred.JobClient: Map-Reduce Framework 13/04/16 14:07:14 INFO mapred.JobClient: Map input records=3133366 13/04/16 14:07:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=139059200 13/04/16 14:07:14 INFO mapred.JobClient: Spilled Records=0 13/04/16 14:07:14 INFO mapred.JobClient: CPU time spent (ms)=178130 13/04/16 14:07:14 INFO mapred.JobClient: Total committed heap usage (bytes)=17367040 13/04/16 14:07:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1521811456 13/04/16 14:07:14 INFO mapred.JobClient: Map output records=3133366 13/04/16 14:07:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 13/04/16 14:07:14 INFO mapreduce.ImportJobBase: Transferred 480.5073 MB in 573.1275 seconds (858.5166 KB/sec) 13/04/16 14:07:14 INFO mapreduce.ImportJobBase: Retrieved 3133366 records.
上图我分配了不同的调度资源,不过速度上好像没提升,可能这跟map和reduce的数量有关,或者我的数据量不够大。从上面可知总共有3133366条记录,花费了10分钟不到,转移到hdfs上的大小是480多M。
敬请期待。。。