Hadoop workshop homework.
For privacy, the blog post will not show source code at all, only the job output logs and counters.
- Copy the packaged jar file into hadoop cluster:
[root@n1 hadoop-examples]# scp [email protected]:~/prog/hadoop/cdh4-examples/cdh4-examples.jar . Password: cdh4-examples.jar 100% 46KB 46.0KB/s 00:00
- Copy the input data into HDFS:
$ scp NASA_access_log_Jul95.gz [email protected]:/root/hadoop-examples [email protected]'s password: NASA_access_log_Jul95.gz 100% 20MB 19.7MB/s 00:00 [root@n1 hadoop-examples]# gunzip -d NASA_access_log_Jul95.gz [root@n1 hadoop-examples]# hadoop fs -mkdir nasa_access_log [root@n1 hadoop-examples]# hadoop fs -copyFromLocal NASA_access_log_Jul95 ./nasa_access_log/
Scenario 1 output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.LogProcessor nasa_access_log output 2 13/07/13 00:14:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:14:57 INFO input.FileInputFormat: Total input paths to process : 1 13/07/13 00:14:58 INFO mapred.JobClient: Running job: job_201307122107_0009 13/07/13 00:14:59 INFO mapred.JobClient: map 0% reduce 0% 13/07/13 00:15:17 INFO mapred.JobClient: map 5% reduce 0% 13/07/13 00:15:18 INFO mapred.JobClient: map 14% reduce 0% 13/07/13 00:15:21 INFO mapred.JobClient: map 28% reduce 0% 13/07/13 00:15:25 INFO mapred.JobClient: map 44% reduce 0% 13/07/13 00:15:27 INFO mapred.JobClient: map 68% reduce 0% 13/07/13 00:15:30 INFO mapred.JobClient: map 78% reduce 0% 13/07/13 00:15:34 INFO mapred.JobClient: map 87% reduce 0% 13/07/13 00:15:36 INFO mapred.JobClient: map 96% reduce 0% 13/07/13 00:15:39 INFO mapred.JobClient: map 100% reduce 0% 13/07/13 00:15:54 INFO mapred.JobClient: map 100% reduce 84% 13/07/13 00:15:56 INFO mapred.JobClient: map 100% reduce 100% 13/07/13 00:15:59 INFO mapred.JobClient: Job complete: job_201307122107_0009 13/07/13 00:15:59 INFO mapred.JobClient: Counters: 33 13/07/13 00:15:59 INFO mapred.JobClient: File System Counters 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of bytes read=21497514 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of bytes written=31791353 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of bytes read=205308182 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of bytes written=2139772 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of read operations=4 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/13 00:15:59 INFO mapred.JobClient: Job Counters 13/07/13 00:15:59 INFO mapred.JobClient: Launched map tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Launched reduce tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Data-local map tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=63399 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=26747 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/13 00:15:59 INFO mapred.JobClient: Map-Reduce Framework 13/07/13 00:15:59 INFO mapred.JobClient: Map input records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Map output records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Map output bytes=43967362 13/07/13 00:15:59 INFO mapred.JobClient: Input split bytes=278 13/07/13 00:15:59 INFO mapred.JobClient: Combine input records=0 13/07/13 00:15:59 INFO mapred.JobClient: Combine output records=0 13/07/13 00:15:59 INFO mapred.JobClient: Reduce input groups=81621 13/07/13 00:15:59 INFO mapred.JobClient: Reduce shuffle bytes=10171946 13/07/13 00:15:59 INFO mapred.JobClient: Reduce input records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Reduce output records=81621 13/07/13 00:15:59 INFO mapred.JobClient: Spilled Records=5615964 13/07/13 00:15:59 INFO mapred.JobClient: CPU time spent (ms)=43710 13/07/13 00:15:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=767377408 13/07/13 00:15:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3596718080 13/07/13 00:15:59 INFO mapred.JobClient: Total committed heap usage (bytes)=397082624 13/07/13 00:15:59 INFO mapred.JobClient: demo.LogProcessorMap$LOG_PROCESSOR_COUNTER 13/07/13 00:15:59 INFO mapred.JobClient: BAD_RECORDS=1871988 # of Good Records :1871988
Scenario 2 output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.genericwritable.LogProcessor nasa_access_log output 2 13/07/13 00:17:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:17:28 INFO input.FileInputFormat: Total input paths to process : 1 13/07/13 00:17:29 INFO mapred.JobClient: Running job: job_201307122107_0011 13/07/13 00:17:30 INFO mapred.JobClient: map 0% reduce 0% 13/07/13 00:17:43 INFO mapred.JobClient: map 24% reduce 0% 13/07/13 00:17:45 INFO mapred.JobClient: map 33% reduce 0% 13/07/13 00:17:46 INFO mapred.JobClient: map 49% reduce 0% 13/07/13 00:17:48 INFO mapred.JobClient: map 57% reduce 0% 13/07/13 00:17:49 INFO mapred.JobClient: map 66% reduce 0% 13/07/13 00:17:51 INFO mapred.JobClient: map 75% reduce 0% 13/07/13 00:17:54 INFO mapred.JobClient: map 87% reduce 0% 13/07/13 00:17:57 INFO mapred.JobClient: map 99% reduce 0% 13/07/13 00:17:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/13 00:18:12 INFO mapred.JobClient: map 100% reduce 50% 13/07/13 00:18:15 INFO mapred.JobClient: map 100% reduce 69% 13/07/13 00:18:18 INFO mapred.JobClient: map 100% reduce 70% 13/07/13 00:18:20 INFO mapred.JobClient: map 100% reduce 83% 13/07/13 00:18:21 INFO mapred.JobClient: map 100% reduce 84% 13/07/13 00:18:25 INFO mapred.JobClient: map 100% reduce 86% 13/07/13 00:18:26 INFO mapred.JobClient: map 100% reduce 100% 13/07/13 00:18:30 INFO mapred.JobClient: Job complete: job_201307122107_0011 13/07/13 00:18:30 INFO mapred.JobClient: Counters: 32 13/07/13 00:18:30 INFO mapred.JobClient: File System Counters 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of bytes read=70122269 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of bytes written=103466795 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of bytes read=205308182 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of bytes written=86859890 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of read operations=4 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/13 00:18:30 INFO mapred.JobClient: Job Counters 13/07/13 00:18:30 INFO mapred.JobClient: Launched map tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Launched reduce tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Data-local map tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=47028 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=44185 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/13 00:18:30 INFO mapred.JobClient: Map-Reduce Framework 13/07/13 00:18:30 INFO mapred.JobClient: Map input records=1891715 13/07/13 00:18:30 INFO mapred.JobClient: Map output records=3743976 13/07/13 00:18:30 INFO mapred.JobClient: Map output bytes=168829257 13/07/13 00:18:30 INFO mapred.JobClient: Input split bytes=278 13/07/13 00:18:30 INFO mapred.JobClient: Combine input records=0 13/07/13 00:18:30 INFO mapred.JobClient: Combine output records=0 13/07/13 00:18:30 INFO mapred.JobClient: Reduce input groups=81621 13/07/13 00:18:30 INFO mapred.JobClient: Reduce shuffle bytes=33609934 13/07/13 00:18:30 INFO mapred.JobClient: Reduce input records=3743976 13/07/13 00:18:30 INFO mapred.JobClient: Reduce output records=81621 13/07/13 00:18:30 INFO mapred.JobClient: Spilled Records=11231928 13/07/13 00:18:30 INFO mapred.JobClient: CPU time spent (ms)=51290 13/07/13 00:18:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=914145280 13/07/13 00:18:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4566802432 13/07/13 00:18:30 INFO mapred.JobClient: Total committed heap usage (bytes)=573489152
Scenario 3 (Hadoop streaming MapReduce)
Copy the python scrpit into hadoop cluster:
$ scp logProcessor.py [email protected]:/root/hadoop-examples [email protected]'s password: logProcessor.py 100% 470 0.5KB/s 00:00
Output:
[root@n1 hadoop-examples]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar -input nasa_access_log -output output -mapper 'python logProcessor.py' -reducer aggregate -file logProcessor.py packageJobJar: [logProcessor.py, /tmp/hadoop-root/hadoop-unjar641255321819856404/] [] /tmp/streamjob5121005386227726797.jar tmpDir=null 13/07/13 00:34:05 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:34:05 INFO mapred.FileInputFormat: Total input paths to process : 1 13/07/13 00:34:06 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local] 13/07/13 00:34:06 INFO streaming.StreamJob: Running job: job_201307122107_0015 13/07/13 00:34:06 INFO streaming.StreamJob: To kill this job, run: 13/07/13 00:34:06 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=n1.example.com:8021 -kill job_201307122107_0015 13/07/13 00:34:06 INFO streaming.StreamJob: Tracking URL: http://n1.example.com:50030/jobdetails.jsp?jobid=job_201307122107_0015 13/07/13 00:34:07 INFO streaming.StreamJob: map 0% reduce 0% 13/07/13 00:34:24 INFO streaming.StreamJob: map 11% reduce 0% 13/07/13 00:34:25 INFO streaming.StreamJob: map 25% reduce 0% 13/07/13 00:34:27 INFO streaming.StreamJob: map 39% reduce 0% 13/07/13 00:34:28 INFO streaming.StreamJob: map 52% reduce 0% 13/07/13 00:34:31 INFO streaming.StreamJob: map 75% reduce 0% 13/07/13 00:34:33 INFO streaming.StreamJob: map 87% reduce 0% 13/07/13 00:34:34 INFO streaming.StreamJob: map 100% reduce 0% 13/07/13 00:34:46 INFO streaming.StreamJob: map 100% reduce 100% 13/07/13 00:34:50 INFO streaming.StreamJob: Job complete: job_201307122107_0015 13/07/13 00:34:50 INFO streaming.StreamJob: Output: output