## pig的数据类型
###1:基本数据类型
###2:复杂类型(map、 tuple、 bag)
2.1 map: 是一种chararray 和数据元素之间的键值对映射。
2.2 tuple 是一定长的,包含有序pig数据元素的集合。(一个tuple相当于sql中的一行,而tuple的字段相当于sql中的列。)
tuple常量使用圆括号来指示tuple结构,使用逗号来划分tuple中的字段。如(‘bob’,55)
2.3 bag:是一个无序的tuple集合,因为它无序,所以无法通过位置获取bag中的tuple。
bag常量是通过花括号进行划分的,bag中的tuple用逗号来分隔,如{(‘bob’,55),(‘sally’,52),(‘john’,25)}。
###3 pig 与 数据库表的对比
3.1 关系(relation)--> 表 :一个关系是一个包
3.2 包(bag)--> : 一个bag是一个tuple的集合,bag使用的是{}
3.3 元组( tuple)--> 行 :并不要求每一个“元组”都含有相同数量的字段,并且也不会要求各“元组”中在相同位置处的字段具有相同的数据类型(太随意了,是吧?)
## 一数据准备(上传数据到hdfs),
### 1: student.txt
```
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
```
### 2:student_details.txt
```
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
```
### 3customers.txt
```
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
```
### 4 orders.txt
```
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
orders = LOAD 'hdfs://mycluster/test/test_data/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
```
## 二: 加载数据
1:load 数据
```
student = LOAD '/test/test_data/student_data.txt' USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
# 指定集群(hdfs://mycluster)
student_details = LOAD 'hdfs://mycluster/test/test_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
```
## 三 诊断运算符(Diagnostic)
Load 语句会简单地将数据加载到Apache Pig中的指定关系中。要验证Load语句的执行,必须使用Diagnostic运算符。Pig Latin提供四种不同类型的诊断运算符:
1 Dump 运算符 (执行)
2 Describe 运算符 (查看数据schema)
3 Explanation 运算符
4 Illustrate 运算符 ( 以表的形式呈现数据)
## 四 Pig Group 运算符
###1 让我们按照年龄关系中的记录/元组进行分组
```
# 按一列分组
group_data = group student by age;
# 按多列分组
group_data = Group student_details by (age, city);
```
## 五 pig join
1: Self-join
2: Inner-join
3: Outer-join − left join, right join, and full join
注:Outer-join 与 inner-join 的区别就是至少返回一个关系中的所有行。
###5.1 self join
```
# self-join用于将表与其自身连接,就像表是两个关系一样,临时重命名至少一个关系。通常,在Apache Pig中,为了执行self-join,我们将在不同的别名(名称)下多次加载相同的数据。那么,将文件 customers.txt 的内容加载为两个表,如下所示。customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
customers2 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
customers3 = join customers1 by id, customers2 by id
```
### 5.2 Inner join(内部链接)
Inner Join使用较为频繁;它也被称为等值连接。当两个表中都存在匹配时,内部连接将返回行。基于连接谓词(join-predicate),通过组合两个关系(例如A和B)的列值来创建新关系。查询将A的每一行与B的每一行进行比较,以查找满足连接谓词的所有行对。当连接谓词被满足时,A和B的每个匹配的行对的列值被组合成结果行。
```
coustomer_orders = JOIN customers1 BY id, orders BY customer_id;
```
### 5.3 Outer Join
Outer Join:与inner join不同,outer join返回至少一个关系中的所有行。outer join操作以三种方式执行:
```
# Left-Join
outer_left = join customers1 by id left outer, orders by customer_id;
#Right-Join
outer_right= join customers1 by id right outer, orders by customer_id;
#Full-JJoin
outer_full= join customers1 by id full outer, orders by customer_id;
```
### 5.4 多条件链接
```
coustomer_orders = JOIN customers1 BY (id, salary) , orders BY customer_id, amount);
```
## pig 去重
注:pig的distinct 按照元组去重(正行),想要实现按照字段去重就需要借助groupBy