Pig实例:使用Pig latin来求年最高气温(测试NCDC天气数据)

hadoop集群模式开启:start-all.sh

    Pig的安装不在陈述,网上有许多可参考的

启动grunt shell

可以使用help命令查看帮助信息:

查看grunt shell命令:

grunt> fs 

准备测试数据(下载测试数据)

以NCDC天气数据求年最大气温为例,准备数据如下(为方便测试每列数据只包含年、气温和数据状态并以冒号分割):

hadoop@master:~$ head ncdc_data.txt
1953:122:5
1953:83:5
1953:44:5
1953:33:5
1953:50:5
1953:44:5
1953:39:5
1953:33:5
1953:33:5
1953:33:5
hadoop@master:~$ wc -1 ncdc_data.txt
wc: invalid option -- '1'
Try 'wc --help' for more information.
hadoop@master:~$ wc -l ncdc_data.txt
321146 ncdc_data.txt
hadoop@master:~$ 
在grunt shell中将ncdc_data.txt存入hdfs中

grunt> copyFromLocal ncdc_data.txt ./  
grunt> ls
hdfs://master:8020/user/hadoop/11	
hdfs://master:8020/user/hadoop/22	
hdfs://master:8020/user/hadoop/32	
hdfs://master:8020/user/hadoop/hadooplog	
hdfs://master:8020/user/hadoop/ncdc_data.txt	3352528
hdfs://master:8020/user/hadoop/output	
hdfs://master:8020/user/hadoop/passwd	2019
grunt> 


使用Pig latin求年最高气温加载天气数据


grunt> A = LOAD 'ncdc_data.txt' USING PigStorage(':') AS (year:int, temp:int, quality:int);
过程:
grunt> A = LOAD 'ncdc_data.txt' USING PigStorage(':') AS (year:int, temp:int, quality:int);
2015-02-07 16:52:16,351 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-02-07 16:52:16,457 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt> describe A 
A: {year: int,temp: int,quality: int}
grunt> 
过滤数据

grunt> B = FILTER A BY temp != 9999 AND ((chararray)quality matches '[01459]');

或B = FILTER A BY temp != 9999 AND (
                   quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
过程:
grunt> B = FILTER A BY temp != 9999 AND ((chararray)quality matches '[01459]');
grunt> describe B
B: {year: int,temp: int,quality: int}
grunt> 

按年分组天气数据


grunt> C = GROUP B BY year;
过程:
grunt> C = GROUP B BY year;
grunt> describe C          
C: {group: int,B: {(year: int,temp: int,quality: int)}}
grunt> 
逐行扫描数据并求最大值和对应的年份(group)


grunt> D = FOREACH C GENERATE group, MAX(B.temp) AS max_temp;
过程:
grunt> D = FOREACH C GENERATE group, MAX(B.temp) AS max_temp;
grunt> describe D                                            
D: {group: int,max_temp: int}
grunt> 

输出结果


grunt> DUMP D;
过程:
grunt> dump D;
2015-02-07 16:56:15,488 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
2015-02-07 16:56:15,538 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2015-02-07 16:56:15,650 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-02-07 16:56:15,652 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2015-02-07 16:56:15,655 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-02-07 16:56:15,655 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-02-07 16:56:15,691 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-02-07 16:56:15,728 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at master/10.15.43.214:8032
2015-02-07 16:56:15,755 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-02-07 16:56:15,756 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-02-07 16:56:15,757 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-02-07 16:56:15,757 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-02-07 16:56:15,758 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=3352528
2015-02-07 16:56:15,759 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-02-07 16:56:15,759 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-02-07 16:56:15,808 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job4573583805590959622.jar
2015-02-07 16:56:21,212 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job4573583805590959622.jar created
2015-02-07 16:56:21,358 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-02-07 16:56:21,374 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-02-07 16:56:21,374 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-02-07 16:56:21,374 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-02-07 16:56:21,534 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-07 16:56:21,536 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at master/10.15.43.214:8032
2015-02-07 16:56:22,728 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-02-07 16:56:22,728 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-02-07 16:56:22,758 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-02-07 16:56:23,029 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-02-07 16:56:23,263 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1423285333973_0005
2015-02-07 16:56:24,011 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1423285333973_0005
2015-02-07 16:56:25,174 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://master:8088/proxy/application_1423285333973_0005/
2015-02-07 16:56:25,174 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1423285333973_0005
2015-02-07 16:56:25,174 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B,C,D
2015-02-07 16:56:25,174 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[1,4],A[-1,-1],B[2,4],D[4,4],C[3,4] C: D[4,4],C[3,4] R: D[4,4]
2015-02-07 16:56:25,589 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-02-07 16:56:25,589 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0005]
2015-02-07 16:56:47,391 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 24% complete
2015-02-07 16:56:47,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0005]
2015-02-07 16:56:49,394 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-02-07 16:56:49,395 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0005]
2015-02-07 16:56:59,908 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0005]
2015-02-07 16:57:05,580 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-07 16:57:05,581 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0	0.13.0	hadoop	2015-02-07 16:56:15	2015-02-07 16:57:05	GROUP_BY,FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1423285333973_0005	1	1	11	11	11	11	9	99	9	A,B,C,D	GROUP_BY,COMBINER	hdfs://master:8020/tmp/temp624427084/tmp-368727104,

Input(s):
Successfully read 321146 records (3352891 bytes) from: "hdfs://master:8020/user/hadoop/ncdc_data.txt"

Output(s):
Successfully stored 43 records (430 bytes) in: "hdfs://master:8020/tmp/temp624427084/tmp-368727104"

Counters:
Total records written : 43
Total bytes written : 430
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1423285333973_0005


2015-02-07 16:57:05,604 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-02-07 16:57:05,623 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-02-07 16:57:05,645 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-02-07 16:57:05,645 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1901,317)
(1902,261)
(1903,278)
(1904,194)
(1905,278)
(1906,283)
(1907,300)
(1908,322)
(1909,350)
(1910,322)
(1911,322)
(1912,411)
(1913,361)
(1914,378)
(1915,411)
(1916,289)
(1917,478)
(1918,450)
(1919,428)
(1920,344)
(1921,417)
(1922,400)
(1923,394)
(1924,456)
(1925,322)
(1926,411)
(1928,161)
(1929,178)
(1930,311)
(1931,450)
(1932,322)
(1933,411)
(1934,300)
(1935,311)
(1936,389)
(1937,339)
(1938,411)
(1939,433)
(1940,433)
(1941,462)
(1942,278)
(1949,367)
(1953,400)
grunt> 
存储结果到文件


grunt> STORE D INTO 'max_temp' USING PigStorage(':');
过程:
grunt> STORE D INTO 'max_temp' USING PigStorage(':');
2015-02-07 16:59:53,558 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-02-07 16:59:53,600 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,FILTER
2015-02-07 16:59:53,603 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2015-02-07 16:59:53,610 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-02-07 16:59:53,611 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner
2015-02-07 16:59:53,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-02-07 16:59:53,613 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-02-07 16:59:53,628 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-02-07 16:59:53,631 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at master/10.15.43.214:8032
2015-02-07 16:59:53,633 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.MRScriptState - Pig script settings are added to the job
2015-02-07 16:59:53,633 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-02-07 16:59:53,634 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Reduce phase detected, estimating # of required reducers.
2015-02-07 16:59:53,634 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2015-02-07 16:59:53,637 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=3352528
2015-02-07 16:59:53,637 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1
2015-02-07 16:59:53,637 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - This job cannot be converted run in-process
2015-02-07 16:59:53,655 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job6820753826732922606.jar
2015-02-07 16:59:57,504 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job6820753826732922606.jar created
2015-02-07 16:59:57,539 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-02-07 16:59:57,545 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2015-02-07 16:59:57,545 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche
2015-02-07 16:59:57,545 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2015-02-07 16:59:57,581 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-07 16:59:57,583 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at master/10.15.43.214:8032
2015-02-07 16:59:58,411 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-02-07 16:59:58,411 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2015-02-07 16:59:58,413 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2015-02-07 16:59:58,557 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2015-02-07 16:59:58,731 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1423285333973_0006
2015-02-07 16:59:58,797 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1423285333973_0006
2015-02-07 16:59:58,801 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://master:8088/proxy/application_1423285333973_0006/
2015-02-07 16:59:58,802 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1423285333973_0006
2015-02-07 16:59:58,802 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases A,B,C,D
2015-02-07 16:59:58,802 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: A[1,4],A[-1,-1],B[2,4],D[4,4],C[3,4] C: D[4,4],C[3,4] R: D[4,4]
2015-02-07 16:59:58,807 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2015-02-07 16:59:58,807 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0006]
2015-02-07 17:00:18,496 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2015-02-07 17:00:18,496 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0006]
2015-02-07 17:00:26,004 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1423285333973_0006]
2015-02-07 17:00:29,138 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-07 17:00:29,139 [main] INFO  org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: 

HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
2.4.0	0.13.0	hadoop	2015-02-07 16:59:53	2015-02-07 17:00:29	GROUP_BY,FILTER

Success!

Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	MedianMapTime	MaxReduceTime	MinReduceTime	AvgReduceTime	MedianReducetime	Alias	Feature	Outputs
job_1423285333973_0006	1	1	10	10	10	10	4	44	4	A,B,C,D	GROUP_BY,COMBINER	hdfs://master:8020/user/hadoop/max_temp,

Input(s):
Successfully read 321146 records (3352891 bytes) from: "hdfs://master:8020/user/hadoop/ncdc_data.txt"

Output(s):
Successfully stored 43 records (387 bytes) in: "hdfs://master:8020/user/hadoop/max_temp"

Counters:
Total records written : 43
Total bytes written : 387
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1423285333973_0006


2015-02-07 17:00:29,162 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
grunt> 
查看结果


grunt> cat max_temp
过程:

grunt> cat max_temp
1901:317
1902:261
1903:278
1904:194
1905:278
1906:283
1907:300
1908:322
1909:350
1910:322
1911:322
1912:411
1913:361
1914:378
1915:411
1916:289
1917:478
1918:450
1919:428
1920:344
1921:417
1922:400
1923:394
1924:456
1925:322
1926:411
1928:161
1929:178
1930:311
1931:450
1932:322
1933:411
1934:300
1935:311
1936:389
1937:339
1938:411
1939:433
1940:433
1941:462
1942:278
1949:367
1953:400
grunt> 

你可能感兴趣的:(Pig,hadoop)