pig latin 简介

pig位流处理语言,每一步都会产生一个新的数据集或者关系

下面的几条语句是合法的

A = load 'NYSE_dividends' (exchange, symbol, date, dividends);
A = filter A by dividends > 0;
A = foreach A generate UPPER(symbol);

但是并不是好的行为

1、load命令

从本地或者HDFS读取文件,是最简单的

PigStorage. divs = load '/data/examples/NYSE_dividends';

divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);

从HBase读数据则要使用HBaseStorage函数

divs = load 'NYSE_dividends' using HBaseStorage();

a = load 'A' using PigStorage(',') as (a1:int, a2:int, a3:int);

2、STore命令

store processed into '/data/examples/processed';

store processed into 'processed' using
      HBaseStorage();


3、Dump命令

显示内容

4、操作命令

foreach

对每条数据,都要发送到下一个operator

For example, the following code loads an entire record, but then removes all but the user and id fields from each record:

A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,
      preferences:map[]);
B = foreach A generate user, id;

..命令 ranges of fields

prices    = load 'NYSE_daily' as (exchange, symbol, date, open,
                high, low, close, volume, adj_close);
beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open
middle    = foreach prices generate open..close; -- produces open, high, low, close
end       = foreach prices generate volume..; -- produces volume, adj_close

还可以做一些简单的操作

prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
             volume, adj_close);
gain   = foreach prices generate close - open;
gain2  = foreach prices generate $6 - $3;

boolean test

2 == 2 ? 1 : 4 --returns 1 
2 == 3 ? 1 : 4 --returns 4 
null == 2 ? 1 : 4 -- returns null
2 == 2 ? 1 : 'fred' -- type error; both values must be of the same type

从组合类型map  tuple抽取数据

# for map, and . for tuple and bag

bball = load 'baseball' as (name:chararray, team:chararray,
          position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#'batting_average';


A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;


A = load 'input' as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.(x, y);

User Defined Functions (UDFs) 

。。

名字空间

divs     = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                date:chararray, dividends:float);
in_cents = foreach divs generate dividends * 100.0 as dividend, dividends * 100.0; 
describe in_cents;
in_cents: {dividend: double,double}

Filter

保留哪几条记录


divs        = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                  date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';

Group

collects all records with the same value for the provided key together into a bag. 

daily = load 'NYSE_daily' as (exchange, stock);
grpd  = group daily by stock;
store grpd into 'by_group';

grpd: {group: bytearray,daily: {exchange: bytearray,stock: bytearray}}


daily = load 'NYSE_daily' as (exchange, stock, date, dividends);
grpd  = group daily by (exchange, stock);
avg   = foreach grpd generate group, AVG(daily.dividends);
describe grpd;
grpd: {group: (exchange: bytearray,stock: bytearray),daily: {exchange: bytearray,
    stock: bytearray,date: bytearray,dividends: bytearray}}

Order by

daily          = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
                    date:chararray, open:float, high:float, low:float,
                    close:float, volume:int, adj_close:float);
bydatensymbol  = order daily by date, symbol;

Distinct

remove duplicate records

daily   = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);
uniq    = distinct daily;

JOIN

divs  = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd   = join daily by symbol, divs by symbol;


daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
            volume, adj_close);
divs  = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd   = join daily by (symbol, date), divs by (symbol, date);

Left join and right join

Limit

divs    = load 'NYSE_dividends';
first10 = limit divs 10;

Sample

divs = load 'NYSE_dividends';
some = sample divs 0.1;

等于filter A by random() <= 0.1

Parallel

daily   = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
            volume, adj_close);
bysymbl = group daily by symbol parallel 10;


10 reducers

自定义函数

divs      = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
                date:chararray, dividends:float);
backwards = foreach divs generate
                org.apache.pig.piggybank.evaluation.string.Reverse(symbol);


register 'production.py' using jython as bballudfs;
players  = load 'baseball' as (name:chararray, team:chararray,
                pos:bag{t:(p:chararray)}, bat:map[]);
nonnull  = filter players by bat#'slugging_percentage' is not null and
                bat#'on_base_percentage' is not null;
calcprod = foreach nonnull generate name, bballudfs.production(
                (float)bat#'slugging_percentage',
                (float)bat#'on_base_percentage');

Calling staatic java functions

 use Java’s Integer class to translate decimal values to hexadecimal values, you could do:

define hex InvokeForString('java.lang.Integer.toHexString', 'int');
divs  = load 'NYSE_daily' as (exchange, symbol, date, open, high, low,
            close, volume, adj_close);
nonnull = filter divs by volume is not null;
inhex = foreach nonnull generate symbol, hex((int)volume);


转载 :http://blog.csdn.net/chendaya/article/details/8559309

你可能感兴趣的:(cloud)