Pig是一个基于Hadoop的大规模数据分析平台,它提供的SQL-LIKE语言叫Pig Latin,该语言的编译器会把类SQL的数据分析请求转为一系列经过优化处理的MapReduce运算。
实验环境:已安装好Hadoop环境,CentOS7
下载地址:
http://pig.apache.org/
wget http://mirror.bit.edu.cn/apache/pig/pig-0.16.0/pig-0.16.0.tar.gz
tar -zvxf pig-0.16.0.tar.gz
mv pig-0.16.0 pig
修改/etc/profile
export PIG_HOME=/usr/local/pig
export PIG_CLASSPATH=${PIG_HOME}/conf/
export PATH=.:${PIG_HOME}/bin:$PATH
source /etc/profile
设置 Pig与Hadoop关联
进入$PIG_HOME/conf,vi pig.properties,添加:
fs.defaultFS=hdfs://hadoop-master:9000
mapreduce.jobtracker.address=hadoop-master:9001
cd $PIG_HOME/bin
./pig --进入grunt shell
quit; --退出grunt
ls / --列出目录
cd aa --进入文件夹
cat a.txt --查看文件
Pig Latin是一个相对简单的语言,它可以执行语句。
命令换行时,可用放在行尾 \
cat /whr/daily/stats/2017/03/21/cmd
a = LOAD '/whr/daily/stats/2017/03/21/cmd' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:int);
describe a;
b = GROUP a BY(col2);
describe b;
c = FOREACH b GENERATE COUNT(a.col2); 计算输入文件记录数
dump c;
语法解释 :
- LOAD 加载文件 , 其中PigStorage用来定义列分隔符
- STORE 语句,存储结果集
- explain 解释执行逻辑或物理视图
- describe 显示schema的关系
别名:
- dump \d
- describe \de
- explain \e
- illustrate \i
- 退出 \q
import java.io.IOException;
import org.apache.pig.PigServer;
public class WordCount {
public static void main(String[] args) {
PigServer pigServer = new PigServer();
try {
pigServer.registerJar("/mylocation/tokenize.jar");
runMyQuery(pigServer, "myinput.txt";
}
catch (IOException e) {
e.printStackTrace();
}
}
public static void runMyQuery(PigServer pigServer, String inputFile) throws IOException {
pigServer.registerQuery("A = load '" + inputFile + "' using TextLoader();");
pigServer.registerQuery("B = foreach A generate flatten(tokenize($0));");
pigServer.registerQuery("C = group B by $1;");
pigServer.registerQuery("D = foreach C generate flatten(group), COUNT(B.$0);");
pigServer.store("D", "myoutput");
}
}
vi /etc/hosts
192.168.40.2 f764... hadoop
cd /app/hadoop-1.2.2/bin
./start-all.sh
cd /home/shiyanlou/install-pack
tar -xzf pig-0.13.0.tar.gz
mv pig-0.13.0 /app
sudo vi /etc/profile
内容
export PIG_HOME=/app/pig-0.13.0
export PIG_CLASSPATH=/app/hadoop-1.1.2/conf
export PATH=$PATH:$PIG_HOME/bin
source /etc/profile
echo $PATH
pig
quit
cd /home/shiyanlou/install-pack/class7
unzip website_log.zip
ll
hadoop fs -mkdir /class7/input
hadoop fs -copyFromLocal website_log.txt /class7/input
hadoop fs -cat /class7/input/website_log.txt | less
pig
输入(不要带中文注释)
//加载HDFS中访问日志,使用空格进行分割,只加载ip列
records = LOAD 'hdfs://hadoop:9000/class7/input/website_log.txt' USING PigStorage(' ') AS (ip:chararray);
// 按照ip进行分组,统计每个ip点击数
records_b = GROUP records BY ip;
records_c = FOREACH records_b GENERATE group,COUNT(records) AS click;
// 按照点击数排序,保留点击数前10个的ip数据
records_d = ORDER records_c by click DESC;
top10 = LIMIT records_d 10;
// 把生成的数据保存到HDFS的class7目录中
STORE top10 INTO 'hdfs://hadoop:9000/class7/out';
查看结果
quit
hadoop fs -ls /class7/out
hadoop fs -cat /class7/out/part-r-00000
一些高级语法:
http://www.cnblogs.com/siwei1988/archive/2012/08/06/2624912.html