hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)

目录

    • 前言
    • 先产生大单词文件
      • 输入(14块Block)
    • mr(设置reduce数为2,默认分区)
      • mr log
      • 默认HashPartitioner分区
      • 输出
    • 自定义分区
      • 输出
      • application日志
        • 14个map, 2个reduce (数据倾斜)
        • application 成功 log
    • 思考数据倾斜的解决?
    • 附:
      • java 源代码
    • 一个 yarn 问题

前言

本文还是以word count 为例子,来实践自定义分区的功能;可以先阅读hadoop 简单的MapReduce源码分析(源码&流程&word count日志) 一文,对hadoop mr 底层有一定对认识,让后再来阅读本文并实践,以加深理解;通过本文实践,也能直观的看到数据倾斜这一常见的大数据问题。

先产生大单词文件

/**
* 产生大文件
 */
public static void getBigFile() {
    try {
        String filePath = "/Users/mubi/test_data/words_skew.txt";
        File file = new File(filePath);
        if(!file.exists()) {   //如果不存在则创建
            file.createNewFile();
            System.out.println("文件创建完成,开始写入");
        }
        FileWriter fw = new FileWriter(file);       //创建文件写入
        BufferedWriter bw = new BufferedWriter(fw);

        Random random = new Random();
        for(int i=0;i<1024 * 1024 * 256;i++) {
            String wordTmp;
            // 让偶数index都是单词hello,成为倾斜的那个单词
            if((i & 1) == 0){
                wordTmp = "hello";
            }else {
                // 随机单词
                int randomIndex = random.nextInt(100);
                wordTmp = "hello" + randomIndex;
            }
            bw.write(wordTmp);      //写入一个随机数
            bw.newLine();       //新的一行
        }
        bw.close();
        fw.close();
        System.out.println("文件写入完成");
    } catch (Exception e) {
        e.printStackTrace();
    }
}

输入(14块Block)

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第1张图片

mr(设置reduce数为2,默认分区)

mr log

20/05/08 22:30:39 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=8177945944
                FILE: Number of bytes written=11655827679
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=1865681612
                HDFS: Number of bytes written=1606
                HDFS: Number of read operations=48
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=4
        Job Counters 
                Launched map tasks=14
                Launched reduce tasks=2
                Data-local map tasks=14
                Total time spent by all maps in occupied slots (ms)=1016987
                Total time spent by all reduces in occupied slots (ms)=210559
                Total time spent by all map tasks (ms)=1016987
                Total time spent by all reduce tasks (ms)=210559
                Total vcore-seconds taken by all map tasks=1016987
                Total vcore-seconds taken by all reduce tasks=210559
                Total megabyte-seconds taken by all map tasks=1041394688
                Total megabyte-seconds taken by all reduce tasks=215612416
        Map-Reduce Framework
                Map input records=268435456
                Map output records=268435456
                Map output bytes=2939368746
                Map output materialized bytes=3476239826
                Input split bytes=1442
                Combine input records=0
                Combine output records=0
                Reduce input groups=101
                Reduce shuffle bytes=3476239826
                Reduce input records=268435456
                Reduce output records=101
                Spilled Records=899921774
                Shuffled Maps =28
                Failed Shuffles=0
                Merged Map outputs=28
                GC time elapsed (ms)=16953
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=3240099840
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=1865680170
        File Output Format Counters 
                Bytes Written=1606

默认HashPartitioner分区

/** Partition keys by their {@link Object#hashCode()}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

}

这里numReduceTasks=2, 实时上我们共101个key, 所以最后差不多是每个 reducer 输出有 50 个 key 的结果

输出

2个文件,两个文件的 key 差不多, 即reduce处理的都差不多(当然可能也会有不均衡的情况,即一个处理大量数据,另一个只处理少量数据)

  • hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00000
hello1	1341325
hello10	1343473
hello12	1344220
hello14	1340394
hello16	1343187
hello18	1343303
hello21	1343380
hello23	1342026
hello25	1341196
hello27	1342497
hello29	1340774
hello3	1341453
hello30	1341756
hello32	1340401
hello34	1340890
hello36	1341383
hello38	1344090
hello41	1343010
hello43	1340936
hello45	1342636
hello47	1342940
hello49	1342628
hello5	1341706
hello50	1343112
hello52	1343793
hello54	1341610
hello56	1342225
hello58	1341521
hello61	1340893
hello63	1343427
hello65	1342251
hello67	1341848
hello69	1342101
hello7	1342317
hello70	1341821
hello72	1341972
hello74	1342041
hello76	1342539
hello78	1343379
hello81	1343054
hello83	1341773
hello85	1342390
hello87	1344358
hello89	1341553
hello9	1343188
hello90	1342887
hello92	1343126
hello94	1341217
hello96	1341221
hello98	1343300
  • hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00001
hello	134217728
hello0	1342894
hello11	1341587
hello13	1343587
hello15	1339478
hello17	1341919
hello19	1342120
hello2	1342875
hello20	1342815
hello22	1343697
hello24	1342658
hello26	1340078
hello28	1340114
hello31	1343454
hello33	1343156
hello35	1342257
hello37	1340605
hello39	1341536
hello4	1341234
hello40	1341259
hello42	1341287
hello44	1341361
hello46	1343004
hello48	1342011
hello51	1341286
hello53	1342911
hello55	1341548
hello57	1343366
hello59	1342538
hello6	1341054
hello60	1342632
hello62	1342056
hello64	1342303
hello66	1342864
hello68	1342509
hello71	1343520
hello73	1343064
hello75	1343421
hello77	1342444
hello79	1342260
hello8	1343224
hello80	1342051
hello82	1341672
hello84	1341337
hello86	1341051
hello88	1341963
hello91	1341783
hello93	1340030
hello95	1343021
hello97	1341265
hello99	1341048

自定义分区

让 hello 和 hello[0-90) 在一个分区,剩下的 hello[90-100) 在另一个分区

static class MyPartioner extends Partitioner<Text, IntWritable>{

       @Override
       public int getPartition(Text text, IntWritable intWritable, int i) {
           String key = text.toString();
           if(key.equals("hello")){
               return 0;
           }else {
               int index = Integer.valueOf(key.substring("hello".length()));
               if(index < 90){
                   return 0;
               }
               return 1;
           }
       }
   }

输出

将 hello, hello[0,90) 全部分到分区0,剩下的 hello[90,100) 分到了分区1; 显然通过自定义分区,达到了自定义分区的目的

  • hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00000
hello	134217728
hello0	1342894
hello1	1341325
hello10	1343473
hello11	1341587
hello12	1344220
hello13	1343587
hello14	1340394
hello15	1339478
hello16	1343187
hello17	1341919
hello18	1343303
hello19	1342120
hello2	1342875
hello20	1342815
hello21	1343380
hello22	1343697
hello23	1342026
hello24	1342658
hello25	1341196
hello26	1340078
hello27	1342497
hello28	1340114
hello29	1340774
hello3	1341453
hello30	1341756
hello31	1343454
hello32	1340401
hello33	1343156
hello34	1340890
hello35	1342257
hello36	1341383
hello37	1340605
hello38	1344090
hello39	1341536
hello4	1341234
hello40	1341259
hello41	1343010
hello42	1341287
hello43	1340936
hello44	1341361
hello45	1342636
hello46	1343004
hello47	1342940
hello48	1342011
hello49	1342628
hello5	1341706
hello50	1343112
hello51	1341286
hello52	1343793
hello53	1342911
hello54	1341610
hello55	1341548
hello56	1342225
hello57	1343366
hello58	1341521
hello59	1342538
hello6	1341054
hello60	1342632
hello61	1340893
hello62	1342056
hello63	1343427
hello64	1342303
hello65	1342251
hello66	1342864
hello67	1341848
hello68	1342509
hello69	1342101
hello7	1342317
hello70	1341821
hello71	1343520
hello72	1341972
hello73	1343064
hello74	1342041
hello75	1343421
hello76	1342539
hello77	1342444
hello78	1343379
hello79	1342260
hello8	1343224
hello80	1342051
hello81	1343054
hello82	1341672
hello83	1341773
hello84	1341337
hello85	1342390
hello86	1341051
hello87	1344358
hello88	1341963
hello89	1341553
hello9	1343188
  • hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00001
hello90	1342887
hello91	1341783
hello92	1343126
hello93	1340030
hello94	1341217
hello95	1343021
hello96	1341221
hello97	1341265
hello98	1343300
hello99	1341048

application日志

http://localhost:8088/cluster/apps/ACCEPTED

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第2张图片

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第3张图片

14个map, 2个reduce (数据倾斜)

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第4张图片

  • 由于自定义了分区,导致一个reduce处理的数据远比另一个reduce数据多(即:数据倾斜)

一个reduce会很快完成,但是另一个reduce会执行比较久一点

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第5张图片
hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第6张图片

application 成功 log

hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第7张图片
两个reduce执行时间,一个1分半,一个5分钟(差距还是很大的)
hadoop自定义分区&数据倾斜问题引入(仍然是wordcount 例子实践)_第8张图片

思考数据倾斜的解决?

一些通常去想的思路

  1. 给任务扩大资源:让最大资源保证最倾斜的任务能正常跑
  2. 增加reduce个数:让任务增多,似的倾斜的幅度降低点
  3. 尽量减少这种倾斜
    a. 分析倾斜原因,避免倾斜:自定义分区; 重新自定义key等
    b. 进入分析前,可能数据是不是要过滤掉;或者随机下倾斜的key

// TODO 具体实践

附:

java 源代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.util.Iterator;


/**
 * @Author mubi
 * @Date 2020/4/28 23:39
 */
public class WC {

    /**
     * map函数
     *
     * 四个泛型类型分别代表:
     * KeyIn        Mapper的输入数据的Key,这里是每行文字的起始位置(0,11,...)
     * ValueIn      Mapper的输入数据的Value,这里是每行文字
     * KeyOut       Mapper的输出数据的Key,这里是每行文字中的"单词"
     * ValueOut     Mapper的输出数据的Value,这里是单词的数量"1"
     */
    static class WordCountMapper extends
            Mapper<LongWritable, Text, Text, IntWritable> {

        @Override
        public void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            // 将mapTask传给我们的文本内容先转换为String
            String line = value.toString();
            // 根据空格将这一行切分为单词
            String[] words = line.split(" ");

            // 将单词输出为<单词,1>
            for(String word: words){
                // 将单词作为key, 将次数作为value, 以便于后续的数据分发, 根据单词分发, 以便于相同单词会到相同的reduceTask内部.
                context.write(new Text(word), new IntWritable(1));
            }
        }
    }

    /**
     * reduce函数
     *
     * 四个泛型类型分别代表:
     * KeyIn        Reducer的输入数据的Key,这里是每行文字中的“单词”
     * ValueIn      Reducer的输入数据的Value,这里是单词数量列表
     * KeyOut       Reducer的输出数据的Key,这里是不重复的“单词”
     * ValueOut     Reducer的输出数据的Value,这里是单词总数量
     */
    static class WordCountReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)throws IOException, InterruptedException {
            int count = 0;
            Iterator<IntWritable> it = values.iterator();
            while(it.hasNext()){
                count += it.next().get();
            }
            context.write(key, new IntWritable(count));
        }
    }

    static class MyPartioner extends Partitioner<Text, IntWritable>{

        @Override
        public int getPartition(Text text, IntWritable intWritable, int i) {
            String key = text.toString();
            if(key.equals("hello")){
                return 0;
            }else {
                int index = Integer.valueOf(key.substring("hello".length()));
                if(index < 90){
                    return 0;
                }
                return 1;
            }
        }
    }

    public static void main(String[] args) throws Exception {
        //输入路径
        String dst = "hdfs://localhost:9000/input/words_skew";
//        String dst = "hdfs://localhost:9000" + args[1];

        // hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00000
        //输出路径,必须是不存在的,空文件也不行。
        String dstOut = "hdfs://localhost:9000/output/wcoutput";
//        String dstOut = "hdfs://localhost:9000" + args[2];

        Configuration hadoopConfig = new Configuration();
        hadoopConfig.set("fs.hdfs.impl",
                org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
        );

        hadoopConfig.set("fs.file.impl",
                org.apache.hadoop.fs.LocalFileSystem.class.getName()
        );

        //如果输出目录已经存在,则先删除
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://localhost:9000"), hadoopConfig);
        Path outputPath = new Path("/output/wcoutput");
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath,true);
        }

        Job job = new Job(hadoopConfig);
        //如果需要打成jar运行,需要下面这句
        // hadoop jar wordcount-1.0-SNAPSHOT.jar WC
        job.setJarByClass(WC.class);
        job.setJobName("WC");

        //job执行作业时输入和输出文件的路径
        FileInputFormat.addInputPath(job, new Path(dst));
        FileOutputFormat.setOutputPath(job, new Path(dstOut));

        //指定自定义的Mapper和Reducer作为两个阶段的任务处理类
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 设置combiner
//        job.setCombinerClass(WordCountReducer.class);

        // 设置reduceTask数
        job.setNumReduceTasks(2);
        // 自定义分区
        job.setPartitionerClass(MyPartioner.class);

        // 设置map输出结果的Key和Value的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        //执行job,直到完成
        job.waitForCompletion(true);
        System.out.println("Job Finished");
        System.exit(0);
    }
}

一个 yarn 问题

会报错1/1 local-dirs are bad: /Users/mubi/hadoop/hdfs/tmp/nm-local-dir; 1/1 log-di 磁盘问题,我本地解决是修改yarn-site.xml, 直接调大如下

<property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
        <value>99</value>
</property>

你可能感兴趣的:(#,hadoop)