pig位流处理语言,每一步都会产生一个新的数据集或者关系
下面的几条语句是合法的
A = load 'NYSE_dividends' (exchange, symbol, date, dividends);
A = filter A by dividends > 0;
A = foreach A generate UPPER(symbol);
但是并不是好的行为
1、load命令
从本地或者HDFS读取文件,是最简单的
PigStorage. divs = load '/data/examples/NYSE_dividends';
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
从HBase读数据则要使用HBaseStorage函数
divs = load 'NYSE_dividends' using HBaseStorage();
a = load 'A' using PigStorage(',') as (a1:int, a2:int, a3:int);
2、STore命令
store processed into '/data/examples/processed';
store processed into 'processed' using
HBaseStorage();
3、Dump命令
显示内容
4、操作命令
foreach
对每条数据,都要发送到下一个operator
For example, the following code loads an entire record, but then removes all but the user and id fields from each record:
A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,
preferences:map[]);
B = foreach A generate user, id;
..命令 ranges of fields
prices = load 'NYSE_daily' as (exchange, symbol, date, open,
high, low, close, volume, adj_close);
beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open
middle = foreach prices generate open..close; -- produces open, high, low, close
end = foreach prices generate volume..; -- produces volume, adj_close
还可以做一些简单的操作
prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
gain = foreach prices generate close - open;
gain2 = foreach prices generate $6 - $3;
boolean test
2 == 2 ? 1 : 4 --returns 1
2 == 3 ? 1 : 4 --returns 4
null == 2 ? 1 : 4 -- returns null
2 == 2 ? 1 : 'fred' -- type error; both values must be of the same type
从组合类型map tuple抽取数据
# for map, and . for tuple and bag
bball = load 'baseball' as (name:chararray, team:chararray,
position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#'batting_average';
A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;
A = load 'input' as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.(x, y);
User Defined Functions (UDFs)
。。
名字空间
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
in_cents = foreach divs generate dividends * 100.0 as dividend, dividends * 100.0;
describe in_cents;
in_cents: {dividend: double,double}
Filter
保留哪几条记录
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';
Group
collects all records with the same value for the provided key together into a bag.
daily = load 'NYSE_daily' as (exchange, stock);
grpd = group daily by stock;
store grpd into 'by_group';
grpd: {group: bytearray,daily: {exchange: bytearray,stock: bytearray}}
daily = load 'NYSE_daily' as (exchange, stock, date, dividends);
grpd = group daily by (exchange, stock);
avg = foreach grpd generate group, AVG(daily.dividends);
describe grpd;
grpd: {group: (exchange: bytearray,stock: bytearray),daily: {exchange: bytearray,
stock: bytearray,date: bytearray,dividends: bytearray}}
Order by
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
bydatensymbol = order daily by date, symbol;
Distinct
remove duplicate records
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);
uniq = distinct daily;
JOIN
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by symbol, divs by symbol;
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd = join daily by (symbol, date), divs by (symbol, date);
Left join and right join
Limit
divs = load 'NYSE_dividends';
first10 = limit divs 10;
Sample
divs = load 'NYSE_dividends';
some = sample divs 0.1;
等于filter A by random() <= 0.1
Parallel
daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,
volume, adj_close);
bysymbl = group daily by symbol parallel 10;
10 reducers
自定义函数
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
backwards = foreach divs generate
org.apache.pig.piggybank.evaluation.string.Reverse(symbol);
register 'production.py' using jython as bballudfs;
players = load 'baseball' as (name:chararray, team:chararray,
pos:bag{t:(p:chararray)}, bat:map[]);
nonnull = filter players by bat#'slugging_percentage' is not null and
bat#'on_base_percentage' is not null;
calcprod = foreach nonnull generate name, bballudfs.production(
(float)bat#'slugging_percentage',
(float)bat#'on_base_percentage');
Calling staatic java functions
use Java’s Integer class to translate decimal values to hexadecimal values, you could do:
define hex InvokeForString('java.lang.Integer.toHexString', 'int');
divs = load 'NYSE_daily' as (exchange, symbol, date, open, high, low,
close, volume, adj_close);
nonnull = filter divs by volume is not null;
inhex = foreach nonnull generate symbol, hex((int)volume);
转载 :http://blog.csdn.net/chendaya/article/details/8559309