任务目标:

目标一 : 每名学生被多少位老师教过

方法一 : 先DISTINCT, 在计数

      - DISTINCT 能偶对所有数据去重

方法二 : 先分组

      - FOREACH 嵌套

      - 使用DISTINCT

 

首先创建一份数据源文件

[hadoop@hadoop1 ~]$ cat score.txt 
James,Network,Tiger,100
James,Database,Tiger,99
James,PDE,Yao,95
Vincent,Network,Tiger,95
Vincent,PDE,Yao,98
Vincent,PDE,
NocWei,PDE,Yao,100
[hadoop@hadoop1 ~]$ hadoop fs -put score.txt

 

[hadoop@hadoop1 ~]$ pig
grunt> A = LOAD '/score.txt' USING PigStorage(',') AS (student,course,teacher,score:int);
grunt> DESCRIBE A;
grunt> B = FOREACH A GENERATE student, teacher;                      #只提取student和teacher,其他的丢掉
grunt> DESCRIBE B;                                                   #查看B数据,会发现只有两个元祖
grunt> C = DISTINCT B;                                               #对B的数据去重
grunt> D = FOREACH ( GROUP C BY student ) GENERATE group AS student , COUNT(C);  
grunt> DUMP D

#结果
(James,2)
(NocWei,1)
(Vincent,3)
grunt>

 

grunt> E = group B by student;  
grunt> F = foreach E                                
>> {                                            
>> T = B.teacher;                               
>> uniq = DISTINCT T;                           
>> generate group as student,COUNT(uniq) as cnt;
>> }

 

 

目标二 : 找出每门课程最优秀的两名学生

步骤一: group by

- group by 的嵌套方法

步骤二: order by

- foreach嵌套

步骤三: limit

- 配合order by 使用

步骤四: flantten

- 去括号过程

grunt> A = LOAD '/score.txt' USING PigStorage(',') as (student,course,teacher,score:int);
grunt> dump A
(James,Network,Tiger,100)
(James,Database,Tiger,99)
(James,PDE,Yao,95)
(Vincent,Network,Tiger,95)
(Vincent,PDE,Yao,98)
(Vincent,PDE,,)
(NocWei,PDE,Yao,100)
grunt> B = FOREACH A GENERATE student,course,score;
grunt> dump B
(James,Network,100)
(James,Database,99)
(James,PDE,95)
(Vincent,Network,95)
(Vincent,PDE,98)
(Vincent,PDE,)
(NocWei,PDE,100)
grunt> C = group B by course
grunt> dump C
(PDE,{(NocWei,PDE,100),(Vincent,PDE,),(Vincent,PDE,98),(James,PDE,95)})
(Network,{(Vincent,Network,95),(James,Network,100)})
(Database,{(James,Database,99)})
grunt> D = FOREACH C                        
>> {                                    
>> sorted = ORDER B BY score DESC;      
>> top = LIMIT sorted 2;                
>> GENERATE group AS course, top AS top;
>> }
grunt> dump D
(Database,{(James,Database,99)})
(Network,{(James,Network,100),(Vincent,Network,95)})
(PDE,{(NocWei,PDE,100),(Vincent,PDE,98)})
grunt> E = FOREACH D GENERATE course,FLATTEN(top);                      #对输出格式去括号
grunt> dump E
(Database,James,Database,99)
(Network,James,Network,100)
(Network,Vincent,Network,95)
(PDE,NocWei,PDE,100)
(PDE,Vincent,PDE,98)