- Relation and Field
Pig Latin is a dataflow language. Each processing step results in a new data set, or relation.
A = load 'NYSE_dividends' (exchange, symbol, date, dividends);
//A is relation exchange,symbol,date and dividends are all fields
- Case Sensitivity
Keywords in Pig Latin are not case-sensitive; for example, LOAD is equivalent to load. But relation and field names are.UDF names are also case-sensitive, thus COUNT is not the same UDF as count.
- Comments
A = load 'foo'; --this is a single-line comment
/*
* This is a multiline comment.
*/
B = load /* a comment in the middle */'bar';
- Load
PigStorage and TextLoader
divs = load '/data/examples/NYSE_dividends'; //tab-delimited file
divs = load 'NYSE_dividends' using HBaseStorage(); //load from hbase
divs = load 'NYSE_dividends' using PigStorage(','); //comma-separated text data
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends); //specify the schema
divs = load 'datadir'; // read all files recursively in the datadir
divs = load 'datadir/part-2012-*'; //read multiple files in datadir
- Store
store processed into '/data/examples/processed';
store processed into 'processed' using HBaseStorage();
store processed into 'processed' using PigStorage(',');
Note:when writing to a filesystem, processed will be a directory with part files rather than a single file. But how many part files will be created? That depends on the parallelism of the last job before the store. If it has re-duces, it will be determined by the parallel level set for that job. If it is a map-only job, it will be determined by the number of maps, which is controlled by Hadoop and not Pig.
- Dump
dump processed; //sent contents of processed to console