本文还是以word count 为例子,来实践自定义分区的功能;可以先阅读hadoop 简单的MapReduce源码分析(源码&流程&word count日志) 一文,对hadoop mr 底层有一定对认识,让后再来阅读本文并实践,以加深理解;通过本文实践,也能直观的看到数据倾斜
这一常见的大数据问题。
/**
* 产生大文件
*/
public static void getBigFile() {
try {
String filePath = "/Users/mubi/test_data/words_skew.txt";
File file = new File(filePath);
if(!file.exists()) { //如果不存在则创建
file.createNewFile();
System.out.println("文件创建完成,开始写入");
}
FileWriter fw = new FileWriter(file); //创建文件写入
BufferedWriter bw = new BufferedWriter(fw);
Random random = new Random();
for(int i=0;i<1024 * 1024 * 256;i++) {
String wordTmp;
// 让偶数index都是单词hello,成为倾斜的那个单词
if((i & 1) == 0){
wordTmp = "hello";
}else {
// 随机单词
int randomIndex = random.nextInt(100);
wordTmp = "hello" + randomIndex;
}
bw.write(wordTmp); //写入一个随机数
bw.newLine(); //新的一行
}
bw.close();
fw.close();
System.out.println("文件写入完成");
} catch (Exception e) {
e.printStackTrace();
}
}
20/05/08 22:30:39 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=8177945944
FILE: Number of bytes written=11655827679
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1865681612
HDFS: Number of bytes written=1606
HDFS: Number of read operations=48
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=14
Launched reduce tasks=2
Data-local map tasks=14
Total time spent by all maps in occupied slots (ms)=1016987
Total time spent by all reduces in occupied slots (ms)=210559
Total time spent by all map tasks (ms)=1016987
Total time spent by all reduce tasks (ms)=210559
Total vcore-seconds taken by all map tasks=1016987
Total vcore-seconds taken by all reduce tasks=210559
Total megabyte-seconds taken by all map tasks=1041394688
Total megabyte-seconds taken by all reduce tasks=215612416
Map-Reduce Framework
Map input records=268435456
Map output records=268435456
Map output bytes=2939368746
Map output materialized bytes=3476239826
Input split bytes=1442
Combine input records=0
Combine output records=0
Reduce input groups=101
Reduce shuffle bytes=3476239826
Reduce input records=268435456
Reduce output records=101
Spilled Records=899921774
Shuffled Maps =28
Failed Shuffles=0
Merged Map outputs=28
GC time elapsed (ms)=16953
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=3240099840
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1865680170
File Output Format Counters
Bytes Written=1606
/** Partition keys by their {@link Object#hashCode()}. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K, V> extends Partitioner<K, V> {
/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
这里numReduceTasks=2
, 实时上我们共101
个key, 所以最后差不多是每个 reducer 输出有 50 个 key 的结果
2个文件,两个文件的 key 差不多, 即reduce处理的都差不多(当然可能也会有不均衡的情况,即一个处理大量数据,另一个只处理少量数据)
hello1 1341325
hello10 1343473
hello12 1344220
hello14 1340394
hello16 1343187
hello18 1343303
hello21 1343380
hello23 1342026
hello25 1341196
hello27 1342497
hello29 1340774
hello3 1341453
hello30 1341756
hello32 1340401
hello34 1340890
hello36 1341383
hello38 1344090
hello41 1343010
hello43 1340936
hello45 1342636
hello47 1342940
hello49 1342628
hello5 1341706
hello50 1343112
hello52 1343793
hello54 1341610
hello56 1342225
hello58 1341521
hello61 1340893
hello63 1343427
hello65 1342251
hello67 1341848
hello69 1342101
hello7 1342317
hello70 1341821
hello72 1341972
hello74 1342041
hello76 1342539
hello78 1343379
hello81 1343054
hello83 1341773
hello85 1342390
hello87 1344358
hello89 1341553
hello9 1343188
hello90 1342887
hello92 1343126
hello94 1341217
hello96 1341221
hello98 1343300
hello 134217728
hello0 1342894
hello11 1341587
hello13 1343587
hello15 1339478
hello17 1341919
hello19 1342120
hello2 1342875
hello20 1342815
hello22 1343697
hello24 1342658
hello26 1340078
hello28 1340114
hello31 1343454
hello33 1343156
hello35 1342257
hello37 1340605
hello39 1341536
hello4 1341234
hello40 1341259
hello42 1341287
hello44 1341361
hello46 1343004
hello48 1342011
hello51 1341286
hello53 1342911
hello55 1341548
hello57 1343366
hello59 1342538
hello6 1341054
hello60 1342632
hello62 1342056
hello64 1342303
hello66 1342864
hello68 1342509
hello71 1343520
hello73 1343064
hello75 1343421
hello77 1342444
hello79 1342260
hello8 1343224
hello80 1342051
hello82 1341672
hello84 1341337
hello86 1341051
hello88 1341963
hello91 1341783
hello93 1340030
hello95 1343021
hello97 1341265
hello99 1341048
让 hello 和 hello[0-90) 在一个分区,剩下的 hello[90-100) 在另一个分区
static class MyPartioner extends Partitioner<Text, IntWritable>{
@Override
public int getPartition(Text text, IntWritable intWritable, int i) {
String key = text.toString();
if(key.equals("hello")){
return 0;
}else {
int index = Integer.valueOf(key.substring("hello".length()));
if(index < 90){
return 0;
}
return 1;
}
}
}
将 hello, hello[0,90) 全部分到分区0,剩下的 hello[90,100) 分到了分区1; 显然通过自定义分区,达到了自定义分区的目的
hello 134217728
hello0 1342894
hello1 1341325
hello10 1343473
hello11 1341587
hello12 1344220
hello13 1343587
hello14 1340394
hello15 1339478
hello16 1343187
hello17 1341919
hello18 1343303
hello19 1342120
hello2 1342875
hello20 1342815
hello21 1343380
hello22 1343697
hello23 1342026
hello24 1342658
hello25 1341196
hello26 1340078
hello27 1342497
hello28 1340114
hello29 1340774
hello3 1341453
hello30 1341756
hello31 1343454
hello32 1340401
hello33 1343156
hello34 1340890
hello35 1342257
hello36 1341383
hello37 1340605
hello38 1344090
hello39 1341536
hello4 1341234
hello40 1341259
hello41 1343010
hello42 1341287
hello43 1340936
hello44 1341361
hello45 1342636
hello46 1343004
hello47 1342940
hello48 1342011
hello49 1342628
hello5 1341706
hello50 1343112
hello51 1341286
hello52 1343793
hello53 1342911
hello54 1341610
hello55 1341548
hello56 1342225
hello57 1343366
hello58 1341521
hello59 1342538
hello6 1341054
hello60 1342632
hello61 1340893
hello62 1342056
hello63 1343427
hello64 1342303
hello65 1342251
hello66 1342864
hello67 1341848
hello68 1342509
hello69 1342101
hello7 1342317
hello70 1341821
hello71 1343520
hello72 1341972
hello73 1343064
hello74 1342041
hello75 1343421
hello76 1342539
hello77 1342444
hello78 1343379
hello79 1342260
hello8 1343224
hello80 1342051
hello81 1343054
hello82 1341672
hello83 1341773
hello84 1341337
hello85 1342390
hello86 1341051
hello87 1344358
hello88 1341963
hello89 1341553
hello9 1343188
hello90 1342887
hello91 1341783
hello92 1343126
hello93 1340030
hello94 1341217
hello95 1343021
hello96 1341221
hello97 1341265
hello98 1343300
hello99 1341048
http://localhost:8088/cluster/apps/ACCEPTED
一个reduce会很快完成,但是另一个reduce会执行比较久一点
两个reduce执行时间,一个1分半,一个5分钟(差距还是很大的)
一些通常去想的思路
// TODO 具体实践
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.net.URI;
import java.util.Iterator;
/**
* @Author mubi
* @Date 2020/4/28 23:39
*/
public class WC {
/**
* map函数
*
* 四个泛型类型分别代表:
* KeyIn Mapper的输入数据的Key,这里是每行文字的起始位置(0,11,...)
* ValueIn Mapper的输入数据的Value,这里是每行文字
* KeyOut Mapper的输出数据的Key,这里是每行文字中的"单词"
* ValueOut Mapper的输出数据的Value,这里是单词的数量"1"
*/
static class WordCountMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
// 将mapTask传给我们的文本内容先转换为String
String line = value.toString();
// 根据空格将这一行切分为单词
String[] words = line.split(" ");
// 将单词输出为<单词,1>
for(String word: words){
// 将单词作为key, 将次数作为value, 以便于后续的数据分发, 根据单词分发, 以便于相同单词会到相同的reduceTask内部.
context.write(new Text(word), new IntWritable(1));
}
}
}
/**
* reduce函数
*
* 四个泛型类型分别代表:
* KeyIn Reducer的输入数据的Key,这里是每行文字中的“单词”
* ValueIn Reducer的输入数据的Value,这里是单词数量列表
* KeyOut Reducer的输出数据的Key,这里是不重复的“单词”
* ValueOut Reducer的输出数据的Value,这里是单词总数量
*/
static class WordCountReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)throws IOException, InterruptedException {
int count = 0;
Iterator<IntWritable> it = values.iterator();
while(it.hasNext()){
count += it.next().get();
}
context.write(key, new IntWritable(count));
}
}
static class MyPartioner extends Partitioner<Text, IntWritable>{
@Override
public int getPartition(Text text, IntWritable intWritable, int i) {
String key = text.toString();
if(key.equals("hello")){
return 0;
}else {
int index = Integer.valueOf(key.substring("hello".length()));
if(index < 90){
return 0;
}
return 1;
}
}
}
public static void main(String[] args) throws Exception {
//输入路径
String dst = "hdfs://localhost:9000/input/words_skew";
// String dst = "hdfs://localhost:9000" + args[1];
// hadoop fs -cat hdfs://localhost:9000/output/wcoutput/part-r-00000
//输出路径,必须是不存在的,空文件也不行。
String dstOut = "hdfs://localhost:9000/output/wcoutput";
// String dstOut = "hdfs://localhost:9000" + args[2];
Configuration hadoopConfig = new Configuration();
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
//如果输出目录已经存在,则先删除
FileSystem fileSystem = FileSystem.get(new URI("hdfs://localhost:9000"), hadoopConfig);
Path outputPath = new Path("/output/wcoutput");
if(fileSystem.exists(outputPath)){
fileSystem.delete(outputPath,true);
}
Job job = new Job(hadoopConfig);
//如果需要打成jar运行,需要下面这句
// hadoop jar wordcount-1.0-SNAPSHOT.jar WC
job.setJarByClass(WC.class);
job.setJobName("WC");
//job执行作业时输入和输出文件的路径
FileInputFormat.addInputPath(job, new Path(dst));
FileOutputFormat.setOutputPath(job, new Path(dstOut));
//指定自定义的Mapper和Reducer作为两个阶段的任务处理类
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 设置combiner
// job.setCombinerClass(WordCountReducer.class);
// 设置reduceTask数
job.setNumReduceTasks(2);
// 自定义分区
job.setPartitionerClass(MyPartioner.class);
// 设置map输出结果的Key和Value的类型
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//执行job,直到完成
job.waitForCompletion(true);
System.out.println("Job Finished");
System.exit(0);
}
}
会报错1/1 local-dirs are bad: /Users/mubi/hadoop/hdfs/tmp/nm-local-dir; 1/1 log-di
磁盘问题,我本地解决是修改yarn-site.xml
, 直接调大如下
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99</value>
</property>