2018-01-29 11-12 Map/Reduce framework

A large number of customer's pass-through data motivate the use of Hadoop.

Framework

user define:

    a.        --  our basic unit of data and analysis

    b. mapper funtions    -- apply to original data handling, generate data

    c. reducer functions    -- apply to the groupped intermediate result of mapper funtion

Hadoop handle logistics:

    a. hadoop distributes mapper funtions to data (computation close to data rule)

    b. mapper functions  get result as , and group them according to key --> intermediate result.

    c. same key's group is passed to the same reducer.


programming assignment 1 - trouble shooting:

1. hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

  -input /user/cloudera/input \

  -output /user/cloudera/output_new \

  -mapper /home/cloudera/wordcount_mapper.py \

  -reducer /home/cloudera/wordcount_reducer.py

It reports error:

INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032

16/04/08 21:21:00 INFO ipc.Client: Retrying connect to server: quickstart.cloudera/127.0.0.1:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

I fixed it by restart yarn-resourcemanager service.

$ sudo service hadoop-yarn-resourcemanager stop/start


Hadoop of rule

one mapper per data split

one reducer per compute core (for paralleling)


we can cascade Map/Reduce jobs or chain them together.And when you need to perform a sequence of tasks and outputs, then cascading jobsare possible and using composite keys can help make that a good alternative.

lesson 11 - slides

lesson 12 - slides

你可能感兴趣的:(2018-01-29 11-12 Map/Reduce framework)