先复习两个核心点:
map reduce编程模型:把数据运算流程分成2个阶段:
阶段1: 读取原始数据,形成key-value数据(map方法)
阶段2:将阶段1的key-value数据按照相同key分组聚合(reduce方法)
mapreduce编程模型的具体实现(软件) : hadoop中 的mapreduce框架,spark;
hadoop中的mapreduce框架:
对编程模型阶段1的实现就是: map task对编程模型阶段2的实现就是: reduce task
map task:
读数据: InputFormat-- >TextInputFormat读文本文件
--> SequenceFileInputFormat读Sequence文 件-->DBInputFormat读数据库
处理数据: maptask通过调用lapper类的map()方法实现对数据的处理
分区:将map阶段产生的key-value数据,分发给若干个reduce task来分担负载,maptask调用Parti ti oner类的getParti ti on()方法来决定如何划分数据给不同的reduce task对key- value数据做排序:调用key. compareTo()方法来实现对key-val ue数据排序
reduce task
读数据:通过http方式从maptask产生的数据文件中下载属于自己的“区”的数据到本地磁盘,然后将多个“同区文件”做合并(归并排序)处理数据通过调用GroupingComparator的compare(方法来判断文件 中的哪些key value属于同-组, 然后将这一组数据传 给Reducer类的reduce()方法聚合一次
输出结果。调用OutputF orma t组件将结果key-value数据写出去
OutputF ormat --> TextOutputFormat写文本文件(一对key-value写- 行,分隔符用\t)
--> SequenceF i1e0utputF ormat写Sequence文件 (直接将key-value对象序列化到文件中)--> DBOutputF ormat
再说join,就类似于sql里面的left join,差不多意思:
1、先写bean
public class JoinBean implements Writable {
private String orderId;
private String userId;
private String userName;
private int userAge;
private String userFriend;
private String tableName;
public void set(String orderId, String userId, String userName, int userAge, String userFriend, String tableName) {
this.orderId = orderId;
this.userId = userId;
this.userName = userName;
this.userAge = userAge;
this.userFriend = userFriend;
this.tableName = tableName;
}
public String getTableName() {
return tableName;
}
public void setTableName(String tableName) {
this.tableName = tableName;
}
public String getOrderId() {
return orderId;
}
public void setOrderId(String orderId) {
this.orderId = orderId;
}
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public String getUserName() {
return userName;
}
public void setUserName(String userName) {
this.userName = userName;
}
public int getUserAge() {
return userAge;
}
public void setUserAge(int userAge) {
this.userAge = userAge;
}
public String getUserFriend() {
return userFriend;
}
public void setUserFriend(String userFriend) {
this.userFriend = userFriend;
}
@Override
public String toString() {
return this.orderId + "," + this.userId + "," + this.userAge + "," + this.userName + "," + this.userFriend;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.orderId);
out.writeUTF(this.userId);
out.writeUTF(this.userName);
out.writeInt(this.userAge);
out.writeUTF(this.userFriend);
out.writeUTF(this.tableName);
}
@Override
public void readFields(DataInput in) throws IOException {
this.orderId = in.readUTF();
this.userId = in.readUTF();
this.userName = in.readUTF();
this.userAge = in.readInt();
this.userFriend = in.readUTF();
this.tableName = in.readUTF();
}
}
2、再写mr(map把多个文件变成相同的bean,userid为key,bean为value)reduce把相同id的bean拿到,然后处理。
public class ReduceSideJoin {
public static class ReduceSideJoinMapper extends Mapper {
String fileName = null;
JoinBean bean = new JoinBean();
Text k = new Text();
/**
* maptask在做数据处理时,会先调用一次setup() 钓完后才对每一行反复调用map()
*/
@Override
protected void setup(Mapper.Context context)
throws IOException, InterruptedException {
FileSplit inputSplit = (FileSplit) context.getInputSplit();
fileName = inputSplit.getPath().getName();
}
@Override
protected void map(LongWritable key, Text value, Mapper.Context context)
throws IOException, InterruptedException {
String[] fields = value.toString().split(",");
if (fileName.startsWith("order")) {
bean.set(fields[0], fields[1], "NULL", -1, "NULL", "order");
} else {
bean.set("NULL", fields[0], fields[1], Integer.parseInt(fields[2]), fields[3], "user");
}
k.set(bean.getUserId());
context.write(k, bean);
}
}
public static class ReduceSideJoinReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable beans, Context context)
throws IOException, InterruptedException {
ArrayList orderList = new ArrayList<>();
JoinBean userBean = null;
try {
// 区分两类数据
for (JoinBean bean : beans) {
if ("order".equals(bean.getTableName())) {
JoinBean newBean = new JoinBean();
BeanUtils.copyProperties(newBean, bean);
orderList.add(newBean);
}else{
userBean = new JoinBean();
BeanUtils.copyProperties(userBean, bean);
}
}
// 拼接数据,并输出
for(JoinBean bean:orderList){
bean.setUserName(userBean.getUserName());
bean.setUserAge(userBean.getUserAge());
bean.setUserFriend(userBean.getUserFriend());
context.write(bean, NullWritable.get());
}
} catch (IllegalAccessException | InvocationTargetException e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(ReduceSideJoin.class);
job.setMapperClass(ReduceSideJoinMapper.class);
job.setReducerClass(ReduceSideJoinReducer.class);
job.setNumReduceTasks(2);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(JoinBean.class);
job.setOutputKeyClass(JoinBean.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\join\\input"));
FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\join\\out1"));
job.waitForCompletion(true);
}
}
这样完成以后心有不安,因为这样效率不好,reduce task 要通过迭代器把文件迭代出来缓存到内存里(不知道先拿的是哪个文件),如果先拿的是自己想要的,后面直接拼另一个文件里面的东西就好,要实现这的话需要改变排序,上面是根据userid来排序的,那就需要把表名(影响排序)和userid都放到key里,但是这样数据分发(orderid和userid都影响分发数据)的时候也会出问题,这时候需要把orderid也放到key里(只按照orderid分区),Partitioner+CompareTo+GroupingComparator 组合来高效实现。过两天把学过的组合起来用一下。