- flatten
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos= foreach players generate name, flatten(position) as position;
bypos= group pos by position;
Jorge Posada,New York Yankees,{(Catcher),(Designated_hitter)},...
==>
Jorge Posada,Catcher
Jorge Posada,Designated_hitter
Note:A foreach with a flatten produces a cross product of every record in the bag with all of the other expressions in the generate statement.If there is more than one bag and both are flattened, this cross product will be done with members of each bag as well as other expressions in the generate statement.
If the bag is empty, no records are produced. But you can avoid this by
noempty = foreach players generate name,
((position is null or IsEmpty(position)) ? {('unknown')} : position) as position;
Flatten can also be applied to a tuple. In this case, it does not produce a cross product;instead, it elevates each field in the tuple to a top-level field.
- Nested foreach
daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields
grpd = group daily by exchange;
uniqcnt = foreach grpd {
sym = daily.symbol;
uniq_sym = distinct sym;
generate group, COUNT(uniq_sym);
};
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
grpd = group divs by symbol;
top3 = foreach grpd {
sorted = order divs by dividends desc;
top = limit sorted 3;
generate group, flatten(top);
};
Note:only distinct , filter , limit , and order are supported in foreach.
- fragment-replicate join
jnd = join daily by (exchange, symbol), divs by (exchange, symbol) using 'replicated';
Pig implements the fragment-replicate join by loading the replicated input into Ha-doop’s distributed cache. All but the first relation will be load into memory.
Note:Fragment-replicate join supports only inner and left outer joins.
- skew join
In many data sets, there are a few keys that have three or more orders of magnitude more records than other keys. This results in one or two reducers that will take much longer than the rest.
Skew join works by first sampling one input for the join. In that input it identifies any keys that have so many records that skew join estimates it will not be able to fit them all into memory. Then, in a second MapReduce job, it does the join. For all records except those identified in the sample, it does a standard join, collecting records with the same key onto the same reducer. Those keys identified as too large are treated differently. Based on how many records were seen for a given key, those records are split across the appropriate number of reducers. The number of reducers is chosen based on Pig’s estimate of how wide the data must be split such that each reducer can fit its split into memory. For the input to the join that is not split, those keys that were split are then replicated to each reducer that contains that key.
users = load 'users' as (name:chararray, city:chararray);
cinfo = load 'cityinfo' as (city:chararray, population:int);
jnd = join cinfo by city, users by city using 'skewed';
Note:Skew join can be done on inner or outer joins. However, it can take only two join inputs.Pig looks at the record sizes in the sample and assumes it can use 30% of the JVM’s heap to materialize records that will be joined. This can be controlled by parameter pig.skewedjoin.reduce.memusage
- merge join
daily = load 'NYSE_daily_sorted' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low:float,close:float, volume:int, adj_close:float);
divs = load 'NYSE_dividends_sorted' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
jnd = join daily by symbol, divs by symbol using 'merge';