Pig Latin作为一种语言,内在的函数清单亦不早少数,类别详细如下
1:Eval Functions
a:AVG
求平均值, 针对int,long,float,double,bytearray有效
求平均值后,类型为long,long,double,double,double
b:CONCAT
将两个字段合并
CONCAT (expression, expression)
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
DUMP A;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE CONCAT(f2,f3);
DUMP X;
(opensource)
(mapreduce)
(piglatin)
c:COUNT
d:COUNT_STAR
计算一个包中的元素个数
X = FOREACH B GENERATE COUNT_STAR(A);
e:DIFF
比较一个元组中的两个fields,找出其中的差异部分
DIFF (expression, expression)
A = LOAD 'bag_data' AS (B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
DUMP A;
({(8,9),(0,1)},{(8,9),(1,1)})
({(2,3),(4,5)},{(2,3),(4,5)})
({(6,7),(3,7)},{(2,2),(3,7)})
DESCRIBE A;
a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
X = FOREACH A DIFF(B1,B2);
grunt> dump x;
({(0,1),(1,1)})
({})
({(6,7),(2,2)})
f:IsEmpty
判断一个bag或者map是否为空
g:MAX
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group,MAX(A.gpa);
DUMP X;
(John,4.0F)
h:MIN
配合group一起使用
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MIN(A.gpa);
DUMP X;
(John,3.7F)
(Mary,3.8F)
i:SIZE
计算元素的个数,根据date type
A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
X = FOREACH A GENERATE SIZE(f1);
DUMP X;
(6L)
(6L)
(3L)
其中int,long,float,double返回都是1
chararray返回字符个数
bytearray返回字节个数
tuple返回fields个数
bag返回tuples个数
map返回key/values个数
j:SUM
实例
A = LOAD 'data' AS (owner:chararray, pet_type:chararray, pet_num:int);
DUMP A;
(Alice,turtle,1)
(Alice,goldfish,5)
(Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
B = GROUP A BY owner;
DUMP B;
(Alice,{(Alice,turtle,1),(Alice,goldfish,5),(Alice,cat,2)})
(Bob,{(Bob,dog,2),(Bob,cat,2)})
X = FOREACH B GENERATEgroup, SUM(A.pet_num);
DUMP X;
(Alice,8L)
(Bob,4L)
k:TOKENIZE
分割字符,输出一个words的包
TOKENIZE(expression [, 'field_delimiter'])
实例
A = LOAD 'data' AS (f1:chararray);
DUMP A;
(Here is the first string.)
(Here is the second string.)
(Here is the third string.)
X = FOREACH A GENERATE TOKENIZE(f1);
DUMP X;
({(Here),(is),(the),(first),(string.)})
({(Here),(is),(the),(second),(string.)})
({(Here),(is),(the),(third),(string.)})
2:Load/Store Functions
针对压缩的处理规则
PigStorage 和TextLoader支持gzip和bzip
BinStorage不支持压缩
文件名以.gz或者.bz结尾
A = load ‘myinput.gz’;
store A into ‘myoutput.gz’;
A = load ‘myinput.bz’;
store A into ‘myoutput.bz’;
BinStorage
读取只读数据格式,不支持压缩,支持文件和目录,globs作为输入
a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();
a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;
针对*的数据可能存在数据转型错误,最好通过Converter转换
a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id,d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;
JsonLoader,JsonStorage
加载和存储JSON数据
JsonLoader( [‘schema’] )
JsonStorage( )
a = load 'a.json' using JsonLoader('a0:int,a1:{(a10:int,a11:chararray)},a2:(a20:double,a21:bytearray),a3:[chararray]');
a = load 'a.json' using JsonLoader();
PigDump
将数据以UTF-8格式存储
PigDump()
STORE X INTO 'output' USING PigDump();
PigStorage
以结构化文本文件加载和存储数据
PigStorage( [field_delimiter] , ['options'] )
Load命令,期望数据通过field delimiters或者tab character('\')或者其它符号来分割
Store命令,利用field deliminters或者tab character或者其它字符,record delimiter('\n')
A = LOAD 'student' USING PigStorage('\t') AS (name: chararray, age:int,gpa: float);
A = LOAD 'student' AS (name: chararray, age:int, gpa: float);
STORE X INTO 'output' USING PigStorage('*');
a = load '1.txt' as (a0:{t:(m:map[int],d:double)});
{([foo#1,bar#2],34.0),([white#3,yellow#4],45.0)} : valid
{([foo#badint],baddouble)} : conversion fail for badint/baddouble, get
{([foo#],)}
{} : valid, empty bag
TextLoader
以UTF-8格式加载非架构化数据
TextLoader()
支持压缩,不能用于存储数据
A = LOAD 'data' USING TextLoader();
3:Math Functions
支持如下算数函数
ABS
ACOS
ASIN
ATAN
CBRT
CEIL
返回表达式最近的整数,并且>=表达式的值
x CEIL(x)
4.6 5
3.5 4
2.4 3
1.0 1
-1.0 -1
-2.4 -2
-3.5 -3
-4.6 -4
COS
COSH
EXP
FLOOR
返回表达式最近点小于或者等于的整数,同CEIL相反
x FLOOR(x)
4.6 4
3.5 3
2.4 2
1.0 1
-1.0 -1
-2.4 -3
-3.5 -4
-4.6 -5
LOG
LOG10
RANDOM
ROUND
针对表达式,取整数,采取四舍五入
x ROUND(x)
4.6 5
3.5 4
2.4 2
1.0 1
-1.0 -1
SIN
SINH
SQRT
TAN
TANH
4:String Functions
字符串函数自然必不可少
INDEXOF
LAST_INDEX_OF
LCFIRST
LOWER
REGEX_EXTRACT
根据正则表达式,来匹配截取数据
REGEX_EXTRACT (string, regex, index)
实例获取IP
REGEX_EXTRACT('192.168.1.5:8020', '(.*)\:(.*)', 1);
REGEX_EXTRACT_ALL
REPLACE
STRSPLIT
SUBSTRING
TRIM
UCFIRST
UPPER
5:Tuple,Bag,Map Functions
TOTUPLE
将一个或者多个表达式转换为tuple类型
TOTUPLE(expression [, expression ...])
a = LOAD 'student' AS (f1:chararray, f2:int, f3:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOTUPLE(f1,f2,f3);
DUMP b;
((John,18,4.0))
((Mary,19,3.8))
((Bill,20,3.9))
((Joe,18,3.8))
TOBAG
将一个或者多个表达式转换为bag类型
TOBAG(expression [, expression ...])
a = LOAD 'student' AS (f1:chararray, f2:int, f3:float);
DUMP a;
(John,18,4.0)
(Mary,19,3.8)
(Bill,20,3.9)
(Joe,18,3.8)
b = FOREACH a GENERATE TOBAG(f1,f3);
DUMP b;
({(John),(4.0)})
({(Mary),(3.8)})
({(Bill),(3.9)})
({(Joe),(3.8)})
TOMAP
将key/value表达式转为为一个map
TOMAP(key-expression, value-expression [, key-expression, value-expression ...])
A = load 'students' as (name:chararray, age:int, gpa:float);
B = foreach A generate TOMAP(name, gpa);
store B into ‘results’;
Input (students)
joe smith 20 3.5
amy chen 22 3.2
leo allen 18 2.1
Output (results)
[joe smith#3.5]
[amy chen#3.2]
[leo allen#2.1]
TOP
从一个包中的多个tuples返回前n个tuples
TOP(topN,column,relation)
A = LOAD 'data' as (first: chararray, second: chararray);
B = GROUP A BY (first, second);
C = FOREACH B generate FLATTEN(group), COUNT(*) as count;
D = GROUP C BY first; // again group by first
topResults = FOREACH D {
result = TOP(10, 2, C); // and retain top 10 occurrences of 'second' in
first
GENERATE FLATTEN(result);
}