大数据系统与大规模数据分析 之 作业二

  • 大数据系统与大规模数据分析 之 作业二
    • 问题描述
    • Hadoop编程
    • 程序源码

大数据系统与大规模数据分析 之 作业二

问题描述

作业二:Hadoop编程

  1. 总体任务
    • 输入文件:
      • 文本文件
      • source destination time 这3个部分由空格隔开
      • 其中source和destination为两个字符串,内部没有空格
      • time为一个浮点数,代表时间(秒为单位)
      • 涵义:可以表示一次电话通话,或表示一次网站访问等
      • 输入可能有噪音:如果一行不符合上述格式,应该被丢弃,程序需要正确执行
    • mapreduce计算:统计每对source-destination的信息
    • 输出:
      • source destination count average-time
      • 每一个source-destination组合输出一行(注意:顺序相反按不同处理)
      • 每行输出通话次数和通话平均时间(保留3位小数,例如2.300)

Hadoop编程

  1. MapReduce计算:统计每对source-destination信息

    • Mapper:

      • 输入键值对key-value; 输出中间格式的键值对
      • 输出被排序后交给每个Reducer
      • Map的个数:由输入大小决定,一般为所有输入文件的总块block数
      • combine(): 为了减轻reducer的负担
    • partition:

      • 相当于先对Mapper的结果进行分类,再平均分给Reducer
      • 可以默认处理,HashPartitioner
      • 默认的取模方式只是为了平均reduce的处理能力
    • shuffle

      • shuffle阶段的主要函数是fetchOutputs(),这个函数的功能就是将map阶段的输出,copy到reduce 节点本地。
    • Reducer:

      • 将一个key关联的一组中间数值集规约reduce到一个更小的数值集
      • 三个阶段:
        • shuffle:
          • Reducer的输入就是Mapper已经排好序的输出
          • 框架通过http为每个Reducer获得所有Mapper输出中与之相关的块
        • Sort:
          • 框架按key对Reducer的输入进行分组
          • shuffle和sort两个阶段是同时进行的,map输出也是一边被取回一边被合并
          • Secondary sort:中间过程对key的分组规则和reduce对key的规则不同,需要二次排序,控制中间过程的key如何被分组
        • Reduce:
          • 对已分组的输入数据的每个key-listofvalues调用一次reduce方法
          • reduce的个数:0.95或1.75*maximum
            • 太少:使节点处于过度负载状态
            • 太多:对shuffle过程不利
  2. 思路:

    1. 直接map + reduce :简单粗暴
    2. map + combiner + reduce:
      1. combiner返回Text(),reducer接收进行Text()的切分
      2. combiner返回自建类型,reducer接收也为自建类型
  3. 补充:

    • Text:
      • 若直接使用 String line=value.toString(); 会输出乱码, 这是由Text这个Writable类型造成的。
      • Text类型是String的Writable封装。但其实Text和String还是有些区别,它是一种UTF-8格式的Writable,而Java中的String是Unicode字符。所以直接使用value.toString()方法,会默认其中的字符都是UTF-8编码过的,因而原本GBK编码的数据使用Text读入后直接使用该方法就会变成乱码。
      • 正确的方法是将输入的Text类型的value转换为字节数组(value.getBytes()),使用String的构造器String(byte[] bytes, int offset, int length, Charset charset),通过使用指定的charset解码指定的byte子数组,构造一个新的String。
      • 如果需要map/reduce输出其它编码格式的数据,需要自己实现OutputFormat,在其中指定编码方式,而不能使用默认的TextOutputFormat。
    • Linux连续运行多个命令:
      • && 没有错误一直执行,直到错误出现
      • || 出错也一直执行,直到有正确的
  4. 测试:

    1. start hadoop
      • start-dfs.sh
      • start-yarn.sh
    2. Example: 以自带的WordCount.java为例
      1. edit WordCount.java (have a look at the code)
      2. edit WordCount-manifest.txt (have a look at this)
      3. compile and generate jar
        • rm -f .class .jar
        • javac WordCount.java
        • jar cfm WordCount.jar WordCount-manifest.txt WordCount*.class
      4. remove output hdfs directory then run MapReduce job
        • hdfs dfs -rm -f -r /hw2/output
        • hadoop jar ./WordCount.jar /hw2/example-input.txt /hw2/output
      5. display output
        • hdfs dfs -cat ‘/hw2/output/part-*’

程序源码


import java.io.IOException;
import java.util.StringTokenizer;
import java.util.regex.Pattern;
import java.text.DecimalFormat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 * This is main method.
 * 
 * @author guest
 * @version 1.0
 */
public class Hw2Part1 {

  /**
   * This is the Mapper class
   * @author guest
   * @version 1.0
   */
  public static class TokenizerMapper 
       extends Mapper{

    private Text newKey = new Text();
    private Text outValue = new Text();
    /**
     * This is the map method.
     * 
     * @author guest
     * @version 1.0
     */
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer line = new StringTokenizer(value.toString(), "\n");

      int count = 1;  
      while (line.hasMoreTokens()) {
        String tmp = line.nextToken(); //first line string
        StringTokenizer str = new StringTokenizer(tmp); //first line token
        // condition 1
        if (str.countTokens() != 3){
            continue;
        }   

        String source = str.nextToken(); 
        String destination = str.nextToken();
        // condition 4 time is float
        String test = str.nextToken();
        Pattern pattern = Pattern.compile("^[-\\+]?[.\\d]*$");
        if (!(pattern.matcher(test).matches())){
            continue;
        }
        float time = Float.valueOf(test);

        newKey.set(source + " " + destination);
        outValue.set(Integer.toString(count) + " " + Float.toString(time));
        context.write(newKey, outValue);
      }
    }
  }

  /**
   * This is the class to combine.
   * 
   * @author guest
   * @version 1.0
   */
  public static class FloatAvgCombiner
       extends Reducer {
    private Text result = new Text();

    /**
     * This is the reduce method.
     * 
     * @author guest
     * @version 1.0
     */
    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      float sum = 0;
      int count = 0;

      for (Text line : values){
        String tmp = line.toString();
        StringTokenizer str = new StringTokenizer(tmp);
        int c = Integer.valueOf(str.nextToken());
        float avg = Float.valueOf(str.nextToken());
        sum += avg * c;
        count += c;
      }

      result.set(Integer.toString(count) + " " + Float.toString(sum/count));
      context.write(key, result);
    }
  }

  /**
   * This is the Reducer class.
   * 
   * @author guest
   * @version 1.0
   */
  public static class FloatAvgReducer
       extends Reducer {

    private Text result_key= new Text();
    private Text result_value= new Text();

    /**
     * This is the reduce method.
     * 
     * @author guest
     * @version 1.0
     */
    public void reduce(Text key, Iterable values, 
                       Context context
                       ) throws IOException, InterruptedException {
      float sum = 0;
      int count = 0;

      for (Text line : values){
        String tmp = line.toString(); // first line string
        StringTokenizer str = new StringTokenizer(tmp); // first line token
        int c = Integer.valueOf(str.nextToken());
        float avg = Float.valueOf(str.nextToken());
        sum += avg * c; 
        count += c;
      }

      // generate result key
      result_key.set(key);

      // generate result value
      double avg_result = (double)(sum / count);
      avg_result = (double)(Math.round(avg_result * 1000)/1000.0);
      DecimalFormat df = new DecimalFormat("#.000");
      String avg_print = df.format(avg_result);
      result_value.set(Integer.toString(count) + " " + avg_print);

      context.write(result_key, result_value);
    }
  }

  /**
   * This is main method.
   * 
   * @param args
   * @throws Exception
   */
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length < 2) {
      System.err.println("Usage: wordcount  [...] ");
      System.exit(2);
    }

    Job job = Job.getInstance(conf, "count and avg");

    job.setJarByClass(Hw2Part1.class);

    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(FloatAvgCombiner.class);
    job.setReducerClass(FloatAvgReducer.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    // add the input paths as given by command line
    for (int i = 0; i < otherArgs.length - 1; ++i) {
      FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
    }

    // add the output path as given by the command line
    FileOutputFormat.setOutputPath(job,
      new Path(otherArgs[otherArgs.length - 1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

你可能感兴趣的:(大数据)