pig入门案例

测试数据位于:/home/hadoop/luogankun/workspace/sync_data/pig
person.txt中的数据以逗号分隔

1,zhangsan,112

2,lisi,113

3,wangwu,114

4,zhaoliu,115

 

score.txt中的数据以制表符分隔

1       20

2       30

3       40

5       50

 

pig只能针对HDFS上的文件进行操作,所以需要将文件先上传到HDFS中

cd /home/hadoop/luogankun/workspace/sync_data/pig

hadoop fs -put person.txt input/pig/person.txt

hadoop fs -put score.txt input/pig/score.txt

 

 

load文件(HDFS系统上的)

a = load 'input/pig/person.txt' using PigStorage(',') as (id:int, name:chararray, age:int);

b = load 'input/pig/score.txt' using PigStorage('\t') as (id:int, score:int);

 

 

查看表结构

describe a

a: {id: int,name: chararray,age: int}



describe b

b: {id: int,score: int}

 

查看表数据

dump a

(1,zhangsan,112)

(2,lisi,113)

(3,wangwu,114)

(4,zhaoliu,115)



dump b

(1,20)

(2,30)

(3,40)

(5,50)

dump 会跑mapreduce任务。

 

条件过滤

查询person中id小于4的人

aa = filter a by id < 4;



dump aa;

(1,zhangsan,112)

(2,lisi,113)

(3,wangwu,114)

pig中等号使用==, 例如:aa = filter a by id == 4;

 

表关联

c = join a by id left , b by id;



describe c

c: {a::id: int,a::name: chararray,a::age: int,b::id: int,b::score: int}

#表名字段名之间两个冒号,字段与字段类型之间一个冒号 dump c

(1,zhangsan,112,1,20)

(2,lisi,113,2,30)

(3,wangwu,114,3,40)

(4,zhaoliu,115,,)

由于采用的是left join,所以只有四条数据,而且第四条数据是没有分数的。

 

迭代数据

d =foreach c generate a::id as id, a::name as name, b::score as score, a::age as age;



describe d;

d: {id: int,name: chararray,score: int,age: int}



dump d

(1,zhangsan,20,112)

(2,lisi,30,113)

(3,wangwu,40,114)

(4,zhaoliu,,115)

注意:foreach使用时只要等号前或者后有一个空格即可,如果等号两端都没有空格的话会报错。

 

处理结果存储到HDFS系统上

store d into 'output/pig/person_score' using PigStorage(',');   #导出到HDFS上的文件分隔符是逗号

hadoop fs -ls output/pig/person_score

hadoop fs -cat output/pig/person_score/part-r-00000

1,zhangsan,20,112

2,lisi,30,113

3,wangwu,40,114

4,zhaoliu,,115



hadoop fs -rmr output/pig/person_score

store d into 'output/pig/person_score';     #导出到HDFS上的文件分隔符是制表符

hadoop fs -ls output/pig/person_score

hadoop fs -cat output/pig/person_score/part-r-00000

1 zhangsan 20 112

2 lisi 30 113

3 wangwu 40 114

4 zhaoliu 115

 

 

pig执行文件

将上面的所有pig shell脚本放到一个sh脚本中执行
/home/hadoop/luogankun/workspace/shell/pig/person_score.pig

a = load 'input/pig/person.txt' using PigStorage(',') as (id:int, name:chararray, age:int);

b = load 'input/pig/score.txt' using PigStorage('\t') as (id:int, score:int);

c = join a by id left , b by id;

d =foreach c generate a::id as id, a::name as name, b::score as score, a::age as age;

store d into 'output/pig/person_score2' using PigStorage(',');   

 

执行person.score.pig脚本:

/home/hadoop/luogankun/workspace/shell/pig

pig person_score.pig

 

 

pig脚本传递参数

pig脚本位置:/home/hadoop/luogankun/workspace/shell/pig/mulit_params_demo01.pig

log = LOAD '$input' AS (user:chararray, time:long, query:chararray);

lmt = LIMIT log $size;

DUMP lmt;

 

上传数据到hdfs文件中

cd /home/hadoop/luogankun/workspace/shell/pig

hadoop fs -put excite-small.log input/pig/excite-small.log

传递方式一:逐个参数传递

pig -param input=input/pig/excite-small.log -param size=4 mulit_params_demo01.pig

传递方式二:将参数保存在txt文件中

/home/hadoop/luogankun/workspace/shell/pig/mulit_params.txt

input=input/pig/excite-small.log

size=5

pig -param_file mulit_params.txt mulit_params_demo01.pig

 

你可能感兴趣的:(pig)