首先先来明确几个概念:
1.分区-partition
1)分区(partition):
默认采取散列值进行分区,但此方法容易造成 “ 数据倾斜 ” (大部分数据分到同一个reducer中,影响运行效率);
所以需要自定义partition;
2)分区概念:*** 指定key/value被分配到哪个reducer上
哪个key到哪个Reducer的分配过程,是由Partitioner规定的;
(重写:getPartition(Text key, Text value, int numPartitions))
3)如何自定义partition??
只要自定义一个类,并且继承Partitioner类,重写其getPartition方法就好了,在使用的时候通过调用Job的 setPartitionerClass 指 定一下即可。
4)系统默认的分区partition
系统缺省的Partitioner是HashPartitioner,它以key的Hash值对Reducer的数目取模,得到对应的Reducer。这样就保证 如果有相同的key值,肯定被分配到同一个reducre上
5)执行过程
Map的结果,会通过partition分发到Reducer上。如果设置了Combiner,Map的结果会先送到Combiner进行合并,再 partition,再将合并后数据发送给Reducer。
2.分组grouping
1)概念:
主要定义哪些key可以放置在一组;
2)自定义分组排序
定义实现一个WritableComparator,重写compare(), 设置比较策略;
还需要声明:自定义分组的类
job.setGroupingComparatorClass(SencondarySortGroupComparator.class);//自定义分组
3)分组之后的组内排序--(实现优化)
也就是自定义RawComparator类,系统默认;
4) 如何自定义组内的排序呢?如下:
继承WritableComparator,重写compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)方法;
还需要声明:
job.setSortComparatorClass(SencondarySortComparator.class);//自定义组内排序
先编写一个案例,加深二次排序的映像:
所谓二次排序,对第1个字段相同的数据,使用第2个字段进行排序。
举个例子,电商平台记录了每一用户的每一笔订单的订单金额,现在要求属于同一个用户的所有订单金额作排序,
并且输出的用户名也 要排序。
账户(account) 订单金额(Cost)
hadoop@apache 200
hive@apache 550
yarn@apache 580
hive@apache 159
hadoop@apache 300
hive@apache 258
hadoop@apache 300
二次排序后的结果如下:
账户(account) 订单金额(Cost)
hadoop@apache 200
hadoop@apache 300
hadoop@apache 300
hive@apache 159
hive@apache 258
hive@apache 550
yarn@apache 580
代码部分:
a.实现自定义Writable类
public class AccountBean implements WritableComparable{
private Text accout;
private IntWritable cost;
public AccountBean() {
setAccout(new Text());
setCost(new IntWritable());
}
public AccountBean(Text accout, IntWritable cost) {
this.accout = accout;
this.cost = cost;
}
@Override
public void write(DataOutput out) throws IOException {
accout.write(out);
cost.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
accout.readFields(in);
cost.readFields(in);
}
@Override
public int compareTo(AccountBean o) {
int tmp = accout.compareTo(o.accout);
if(tmp ==0){
return cost.compareTo(o.cost);
}
return tmp;
}
public Text getAccout() {
return accout;
}
public void setAccout(Text accout) {
this.accout = accout;
}
public IntWritable getCost() {
return cost;
}
public void setCost(IntWritable cost) {
this.cost = cost;
}
@Override
public String toString() {
return accout + "\t" + cost;
}
}
b.自定义partition:按account进行分区:--根据key或value及reduce的数量来决定当前的
这对输出数据最终应该交由哪个reduce task处理
public class SencondarySortPartition extends Partitioner {
@Override
public int getPartition(AccountBean key, NullWritable value,int numPartitions) {
return (key.getAccout().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
c.自定义分组比较器:按account进行分组:--key相同的在一个组内;最后执行是组的并行性
public class SencondarySortGroupComparator extends WritableComparator {
public SencondarySortGroupComparator() {
super(AccountBean.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
AccountBean acc1 = (AccountBean)a;
AccountBean acc2 = (AccountBean)b;
return acc1.getAccout().compareTo(acc2.getAccout());//账号相同的在一个组
}
}
d.自定义RawComparator类:--主要是实现在组内的排序(有利于优化),可省略!!!
public class SencondarySortComparator extends WritableComparator {
private static final IntWritable.Comparator INTWRITABLE_COMPARATOR = new IntWritable.Comparator();
public SencondarySortComparator() {
super(AccountBean.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1])+ readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2])+ readVInt(b2, s2);
int cmp = INTWRITABLE_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return INTWRITABLE_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1, b2,s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
// static {
// WritableComparator.define(AccountBean.class,new SencondarySortComparator());
// }
}
e.编写Mapper
public class SencondarySortMapper extends Mapper {
private AccountBean acc = new AccountBean();
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
StringTokenizer st = new StringTokenizer(value.toString());
while (st.hasMoreTokens()) {
acc.setAccout(new Text(st.nextToken()));
acc.setCost(new IntWritable(Integer.parseInt(st.nextToken())));
}
context.write(acc ,NullWritable.get());
}
}
f.编写Reducer
public class SencondarySortReducer extends Reducer{
@Override
protected void reduce(AccountBean key, Iterable values,Context context)
throws IOException, InterruptedException {
for (NullWritable nullWritable : values) {
context.write(key, NullWritable.get());
}
}
}
g.编写主类Driver
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Path outfile = new Path("file:///D:/outtwo1");
FileSystem fs = outfile.getFileSystem(conf);
if(fs.exists(outfile)){
fs.delete(outfile,true);
}
Job job = Job.getInstance(conf);
job.setJarByClass(SencondarySortDriver.class);
job.setJobName("Sencondary Sort");
job.setMapperClass(SencondarySortMapper.class);
job.setReducerClass(SencondarySortReducer.class);
job.setOutputKeyClass(AccountBean.class);
job.setOutputValueClass(NullWritable.class);
//声明自定义分区和分组
job.setPartitionerClass(SencondarySortPartition.class);
job.setGroupingComparatorClass(SencondarySortGroupComparator.class);
//job.setSortComparatorClass(SencondarySortComparator.class);//组内排序需要声明的类
FileInputFormat.addInputPath(job, new Path("file:///D:/测试数据/二次排序/"));
FileOutputFormat.setOutputPath(job,outfile);
System.exit(job.waitForCompletion(true)?0:1);
}
I. 运行结果
hadoop@apache 200
hadoop@apache 300
hadoop@apache 300
hive@apache 159
hive@apache 258
hive@apache 550
yarn@apache 580