----------------------------------
一、前言
二、环境
三、配置
1.本地模式
2.MapReduce模式
四、测试
----------------------------------
一、前言
Pig是一个用来处理大规模数据集的平台,和Google的Sawzall类似,由Yahoo!贡献给Apache。MapReduce的查询矿街虽然主要是Map和Reduce两个函数,但是用户从编程写程序到在集群中部署、运行,仍然要花费不少时间。使用Pig可以简化MapReduce任务开发,提高控制具有上千台机器的Hadoop集群进行数据处理工作的方便程度。目前在Yahoo!有40%的MapReduce作业时通过Pig进行的。
Pig有两种工作模式:本地模式和MapReduce模式。
本地模式下,所有的文件和执行过程都在本机进行,本地模式可以在短时间内处理少量的数据,这样用户可以方便地进行程序功能的测试。输入“pig -x local”命令可以进入本地模式。MapReduce模式是用Pig进行实际工作的模式。
二、环境
系统:CentOS6.4 32位
软件包:pig-0.12.1.tar.gz
本实验均在上两篇博客基础上做。
本地模式及MapReduce模式均在master主机(192.168.2.101)上配置,master主机已安装hadoop集群。
三、配置
1.本地模式(root用户下配置)
# tar zxvf pig-0.12.1.tar.gz -C /usr/ # mv /usr/pig-0.12.1/ /usr/pig # vim /etc/profile //将pig的路径加入系统环境变量中 PIG_HOME=/usr/pig PATH=$PATH:$PIG_HOME/bin export PIG_HOME # . /etc/profile
# jps //master主机进程 2822 JobTracker 2628 NameNode 5562 HMaster 2757 SecondaryNameNode 7865 Jps # pig -x local //本地模式 2014-06-16 21:02:24,784 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.1 (r1585011) compiled Apr 05 2014, 01:41:34 2014-06-16 21:02:24,785 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/pig/pig_1402977744780.log 2014-06-16 21:02:24,855 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found 2014-06-16 21:02:25,126 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt>
2.MapReduce模式(root用户下配置)
# cd /usr/hadoop/ # tar cvf hadoop-conf.tar conf/ //将hadoop目录下的conf打包到pig下的conf目录 # cp hadoop-conf.tar /usr/pig/conf/ # cd /usr/pig/conf/ # tar xvf hadoop-conf.tar # vim /etc/profile //将PIG_CLASSPATH加入到系统环境变量中 export PIG_CLASSPATH=/usr/pig/conf/conf # . /etc/profile
# jps 2822 JobTracker 2628 NameNode 5562 HMaster 2757 SecondaryNameNode 7865 Jps # pig //MapReduce模式 2014-06-16 21:33:01,054 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.1 (r1585011) compiled Apr 05 2014, 2014-06-16 21:33:01,055 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/pig/conf/pig_1402979581052.log 2014-06-16 21:33:01,103 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found 2014-06-16 21:33:01,541 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 2014-06-16 21:33:01,966 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to grunt>
四、测试
1.简单测试
# pig grunt> ls /usr/hadoop hdfs://master:9000/usr/hadoop/tmp <dir> grunt> cd .. grunt> ls hdfs://master:9000/user/hadoop <dir> grunt> cd hadoop grunt> ls hdfs://master:9000/user/hadoop/out <dir> hdfs://master:9000/user/hadoop/test <dir> grunt> cd test grunt> ls hdfs://master:9000/user/hadoop/test/test1.txt<r 1> 11 hdfs://master:9000/user/hadoop/test/test2.txt<r 1> 13 grunt> cat test1.txt hello word grunt> copyToLocal test1.txt 1.txt //复制hdfs系统的文件到本地
[root@master ~]# ll 1.txt //在本地查看 -rwxrwxrwx. 1 root root 11 Jun 16 22:57 1.txt
grunt> sh jps //运行系统命令 2822 JobTracker 8700 RunJar 2628 NameNode 5562 HMaster 2757 SecondaryNameNode 9156 Jps
2.csdn泄露文件测试
[hadoop@master ~]$ ll //上传需要的文件 -rwxrwxrwx. 1 hadoop hadoop 287238395 Jun 17 01:26 csdn.sql [hadoop@master ~]$ wc -l csdn.sql //竟然有600多万数据,真无耻啊,明文 6428632 csdn.sql [hadoop@master ~]$ head csdn.sql //查看前10行数据 zdg # 12344321 # [email protected] LaoZheng # 670203313747 # [email protected] fstao # 730413 # [email protected] huwolf # 2535263 # [email protected] cadcjl # KIC43dk6! # [email protected] netsky # s12345 # [email protected] Michael # apple # [email protected] siclj # lj7202 # [email protected] jinbuhuan # 12345 # [email protected] Eie # hebeibdh # [email protected]
可以看到数据每行3列,分别为用户名,密码和邮箱,并以#号隔开,现在我们要做的就是把用户名和密码去掉,只保留邮箱这一列,你懂得!
[hadoop@master ~]$ pig -x local A =LOAD '/home/hadoop/csdn.sql' >> USING PigStorage('#') >> AS (id,pw,em); grunt> B =FOREACH A >> GENERATE em; grunt> STORE B INTO '/home/hadoop/email' >> USING PigStorage();
[hadoop@master ~]$ ll -rwxrwxrwx. 1 hadoop hadoop 287238395 Jun 17 01:26 csdn.sql drwxrwxr-x. 2 hadoop hadoop 4096 Jun 17 01:51 email [hadoop@master ~]$ cd email/ [hadoop@master email]$ ll -rwxrwxrwx. 1 hadoop hadoop 14935615 Jun 17 01:50 part-m-00000 -rwxrwxrwx. 1 hadoop hadoop 14954292 Jun 17 01:50 part-m-00001 -rwxrwxrwx. 1 hadoop hadoop 14831079 Jun 17 01:50 part-m-00002 -rwxrwxrwx. 1 hadoop hadoop 14802578 Jun 17 01:50 part-m-00003 -rwxrwxrwx. 1 hadoop hadoop 14600189 Jun 17 01:50 part-m-00004 -rwxrwxrwx. 1 hadoop hadoop 14591448 Jun 17 01:50 part-m-00005 -rwxrwxrwx. 1 hadoop hadoop 14573905 Jun 17 01:50 part-m-00006 -rwxrwxrwx. 1 hadoop hadoop 14750540 Jun 17 01:51 part-m-00007 -rwxrwxrwx. 1 hadoop hadoop 8682256 Jun 17 01:51 part-m-00008 [hadoop@master email]$ head part-m-00000 [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
错误1:出现ERROR 2997: Encountered IOException.
grunt> ls 2014-06-16 23:55:37,897 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. File or directory null does not exist. Details at logfile: /root/pig_1402984352164.log grunt> ls / //加上/即可解决 hdfs://master:9000/user <dir> hdfs://master:9000/usr <dir> grunt> cd /user grunt> ls hdfs://master:9000/user/hadoop <dir> grunt> cd hadoop grunt> ls hdfs://master:9000/user/hadoop/out <dir> hdfs://master:9000/user/hadoop/test <dir>