A large number of customer's pass-through data motivate the use of Hadoop.
Framework
user define:
a.
b. mapper funtions -- apply to original data handling, generate
c. reducer functions -- apply to the groupped intermediate result of mapper funtion
Hadoop handle logistics:
a. hadoop distributes mapper funtions to data (computation close to data rule)
b. mapper functions get result as
c. same key's group is passed to the same reducer.
programming assignment 1 - trouble shooting:
1. hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py
It reports error:
INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
16/04/08 21:21:00 INFO ipc.Client: Retrying connect to server: quickstart.cloudera/127.0.0.1:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
I fixed it by restart yarn-resourcemanager service.
$ sudo service hadoop-yarn-resourcemanager stop/start
Hadoop of rule
one mapper per data split
one reducer per compute core (for paralleling)
we can cascade Map/Reduce jobs or chain them together.And when you need to perform a sequence of tasks and outputs, then cascading jobsare possible and using composite keys can help make that a good alternative.
lesson 11 - slides
lesson 12 - slides