pig 基础.md

## pig的数据类型

###1:基本数据类型

 ###2:复杂类型(map、 tuple、 bag)

    2.1 map: 是一种chararray 和数据元素之间的键值对映射。

    2.2 tuple 是一定长的,包含有序pig数据元素的集合。(一个tuple相当于sql中的一行,而tuple的字段相当于sql中的列。)

              tuple常量使用圆括号来指示tuple结构,使用逗号来划分tuple中的字段。如(‘bob’,55)

    2.3 bag:是一个无序的tuple集合,因为它无序,所以无法通过位置获取bag中的tuple。 

                bag常量是通过花括号进行划分的,bag中的tuple用逗号来分隔,如{(‘bob’,55),(‘sally’,52),(‘john’,25)}。



  ###3 pig 与 数据库表的对比

3.1 关系(relation)--> 表  :一个关系是一个包

3.2  包(bag)-->            : 一个bag是一个tuple的集合,bag使用的是{}

3.3  元组( tuple)--> 行        :并不要求每一个“元组”都含有相同数量的字段,并且也不会要求各“元组”中在相同位置处的字段具有相同的数据类型(太随意了,是吧?)

## 一数据准备(上传数据到hdfs),

### 1: student.txt

```

001,Rajiv,Reddy,21,9848022337,Hyderabad

002,siddarth,Battacharya,22,9848022338,Kolkata

003,Rajesh,Khanna,22,9848022339,Delhi

004,Preethi,Agarwal,21,9848022330,Pune

005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar

006,Archana,Mishra,23,9848022335,Chennai

007,Komal,Nayak,24,9848022334,trivendram

008,Bharathi,Nambiayar,24,9848022333,Chennai

```

### 2:student_details.txt

```

001,Rajiv,Reddy,21,9848022337,Hyderabad

002,siddarth,Battacharya,22,9848022338,Kolkata

003,Rajesh,Khanna,22,9848022339,Delhi

004,Preethi,Agarwal,21,9848022330,Pune

005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar

006,Archana,Mishra,23,9848022335,Chennai

007,Komal,Nayak,24,9848022334,trivendram

008,Bharathi,Nambiayar,24,9848022333,Chennai

```

### 3customers.txt

```

1,Ramesh,32,Ahmedabad,2000.00

2,Khilan,25,Delhi,1500.00

3,kaushik,23,Kota,2000.00

4,Chaitali,25,Mumbai,6500.00 

5,Hardik,27,Bhopal,8500.00

6,Komal,22,MP,4500.00

7,Muffy,24,Indore,10000.00

customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')

   as (id:int, name:chararray, age:int, address:chararray, salary:int);

``` 

### 4 orders.txt

```

102,2009-10-08 00:00:00,3,3000

100,2009-10-08 00:00:00,3,1500

101,2009-11-20 00:00:00,2,1560

103,2008-05-20 00:00:00,4,2060

 orders = LOAD 'hdfs://mycluster/test/test_data/orders.txt' USING PigStorage(',')

   as (oid:int, date:chararray, customer_id:int, amount:int);

```

## 二: 加载数据

    1:load 数据

```

student = LOAD '/test/test_data/student_data.txt' USING PigStorage(',')

    as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

# 指定集群(hdfs://mycluster)

student_details = LOAD 'hdfs://mycluster/test/test_data/student_details.txt' USING PigStorage(',')

   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

```


## 三 诊断运算符(Diagnostic)

 Load 语句会简单地将数据加载到Apache Pig中的指定关系中。要验证Load语句的执行,必须使用Diagnostic运算符。Pig Latin提供四种不同类型的诊断运算符:

1 Dump 运算符 (执行)

2 Describe 运算符   (查看数据schema)

3 Explanation 运算符

4 Illustrate 运算符  ( 以表的形式呈现数据)

## 四 Pig Group 运算符

###1 让我们按照年龄关系中的记录/元组进行分组

```

# 按一列分组

group_data = group student by age;

# 按多列分组   

group_data = Group student_details by (age, city);

```

## 五 pig join 

1: Self-join

2: Inner-join

3: Outer-join − left join, right join, and full join

       注:Outer-join 与 inner-join 的区别就是至少返回一个关系中的所有行。

###5.1 self join

```

# self-join用于将表与其自身连接,就像表是两个关系一样,临时重命名至少一个关系。通常,在Apache Pig中,为了执行self-join,我们将在不同的别名(名称)下多次加载相同的数据。那么,将文件 customers.txt 的内容加载为两个表,如下所示。customers1 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')

   as (id:int, name:chararray, age:int, address:chararray, salary:int);

customers2 = LOAD 'hdfs://mycluster/test/test_data/customers.txt' USING PigStorage(',')

   as (id:int, name:chararray, age:int, address:chararray, salary:int);


customers3 = join customers1 by id, customers2 by id

```

### 5.2 Inner join(内部链接)

Inner Join使用较为频繁;它也被称为等值连接。当两个表中都存在匹配时,内部连接将返回行。基于连接谓词(join-predicate),通过组合两个关系(例如A和B)的列值来创建新关系。查询将A的每一行与B的每一行进行比较,以查找满足连接谓词的所有行对。当连接谓词被满足时,A和B的每个匹配的行对的列值被组合成结果行。

```

 coustomer_orders = JOIN customers1 BY id, orders BY customer_id;

```

### 5.3   Outer Join

Outer Join:与inner join不同,outer join返回至少一个关系中的所有行。outer join操作以三种方式执行:

```

# Left-Join

 outer_left = join customers1 by id left outer, orders by customer_id;

#Right-Join

  outer_right= join customers1 by id right outer, orders by customer_id;

#Full-JJoin

outer_full= join customers1 by id full outer, orders by customer_id;

```

### 5.4 多条件链接

```

coustomer_orders = JOIN customers1 BY (id, salary) , orders BY customer_id, amount);

```

## pig 去重

注:pig的distinct 按照元组去重(正行),想要实现按照字段去重就需要借助groupBy

你可能感兴趣的:(pig 基础.md)