mapreduce实现框架复习-练习mapreduce-join算法(seventeen day)

先复习两个核心点:
map reduce编程模型:把数据运算流程分成2个阶段:
阶段1: 读取原始数据,形成key-value数据(map方法)
阶段2:将阶段1的key-value数据按照相同key分组聚合(reduce方法)

mapreduce编程模型的具体实现(软件) : hadoop中 的mapreduce框架,spark;
hadoop中的mapreduce框架:
对编程模型阶段1的实现就是: map task对编程模型阶段2的实现就是: reduce task

map task:
读数据: InputFormat-- >TextInputFormat读文本文件
--> SequenceFileInputFormat读Sequence文 件-->DBInputFormat读数据库
处理数据: maptask通过调用lapper类的map()方法实现对数据的处理

分区:将map阶段产生的key-value数据,分发给若干个reduce task来分担负载,maptask调用Parti ti oner类的getParti ti on()方法来决定如何划分数据给不同的reduce task对key- value数据做排序:调用key. compareTo()方法来实现对key-val ue数据排序

reduce task
读数据:通过http方式从maptask产生的数据文件中下载属于自己的“区”的数据到本地磁盘,然后将多个“同区文件”做合并(归并排序)处理数据通过调用GroupingComparator的compare(方法来判断文件 中的哪些key value属于同-组, 然后将这一组数据传 给Reducer类的reduce()方法聚合一次
输出结果。调用OutputF orma t组件将结果key-value数据写出去
OutputF ormat --> TextOutputFormat写文本文件(一对key-value写- 行,分隔符用\t)
--> SequenceF i1e0utputF ormat写Sequence文件 (直接将key-value对象序列化到文件中)--> DBOutputF ormat

 

再说join,就类似于sql里面的left join,差不多意思:

1、先写bean

public class JoinBean implements Writable {

	private String orderId;
	private String userId;
	private String userName;
	private int userAge;
	private String userFriend;
	private String tableName;

	public void set(String orderId, String userId, String userName, int userAge, String userFriend, String tableName) {
		this.orderId = orderId;
		this.userId = userId;
		this.userName = userName;
		this.userAge = userAge;
		this.userFriend = userFriend;
		this.tableName = tableName;
	}

	public String getTableName() {
		return tableName;
	}

	public void setTableName(String tableName) {
		this.tableName = tableName;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getUserId() {
		return userId;
	}

	public void setUserId(String userId) {
		this.userId = userId;
	}

	public String getUserName() {
		return userName;
	}

	public void setUserName(String userName) {
		this.userName = userName;
	}

	public int getUserAge() {
		return userAge;
	}

	public void setUserAge(int userAge) {
		this.userAge = userAge;
	}

	public String getUserFriend() {
		return userFriend;
	}

	public void setUserFriend(String userFriend) {
		this.userFriend = userFriend;
	}

	@Override
	public String toString() {
		return this.orderId + "," + this.userId + "," + this.userAge + "," + this.userName + "," + this.userFriend;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(this.orderId);
		out.writeUTF(this.userId);
		out.writeUTF(this.userName);
		out.writeInt(this.userAge);
		out.writeUTF(this.userFriend);
		out.writeUTF(this.tableName);

	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.orderId = in.readUTF();
		this.userId = in.readUTF();
		this.userName = in.readUTF();
		this.userAge = in.readInt();
		this.userFriend = in.readUTF();
		this.tableName = in.readUTF();

	}

}

2、再写mr(map把多个文件变成相同的bean,userid为key,bean为value)reduce把相同id的bean拿到,然后处理。

public class ReduceSideJoin {

	public static class ReduceSideJoinMapper extends Mapper {
		String fileName = null;
		JoinBean bean = new JoinBean();
		Text k = new Text();

		/**
		 * maptask在做数据处理时,会先调用一次setup() 钓完后才对每一行反复调用map()
		 */
		@Override
		protected void setup(Mapper.Context context)
				throws IOException, InterruptedException {
			FileSplit inputSplit = (FileSplit) context.getInputSplit();
			fileName = inputSplit.getPath().getName();
		}

		@Override
		protected void map(LongWritable key, Text value, Mapper.Context context)
				throws IOException, InterruptedException {

			String[] fields = value.toString().split(",");

			if (fileName.startsWith("order")) {
				bean.set(fields[0], fields[1], "NULL", -1, "NULL", "order");
			} else {
				bean.set("NULL", fields[0], fields[1], Integer.parseInt(fields[2]), fields[3], "user");
			}
			k.set(bean.getUserId());
			context.write(k, bean);

		}

	}

	public static class ReduceSideJoinReducer extends Reducer {

		@Override
		protected void reduce(Text key, Iterable beans, Context context)
				throws IOException, InterruptedException {
			ArrayList orderList = new ArrayList<>();
			JoinBean userBean = null;

			try {
				// 区分两类数据
				for (JoinBean bean : beans) {
					if ("order".equals(bean.getTableName())) {
						JoinBean newBean = new JoinBean();
						BeanUtils.copyProperties(newBean, bean);
						orderList.add(newBean);
					}else{
						userBean = new JoinBean();
						BeanUtils.copyProperties(userBean, bean);
					}

				}
				
				// 拼接数据,并输出
				for(JoinBean bean:orderList){
					bean.setUserName(userBean.getUserName());
					bean.setUserAge(userBean.getUserAge());
					bean.setUserFriend(userBean.getUserFriend());
					
					context.write(bean, NullWritable.get());
					
				}
			} catch (IllegalAccessException | InvocationTargetException e) {
				e.printStackTrace();
			}

		}

	}
	
	
	public static void main(String[] args) throws Exception {

		
		Configuration conf = new Configuration();  
		
		Job job = Job.getInstance(conf);

		job.setJarByClass(ReduceSideJoin.class);

		job.setMapperClass(ReduceSideJoinMapper.class);
		job.setReducerClass(ReduceSideJoinReducer.class);
		
		job.setNumReduceTasks(2);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(JoinBean.class);
		
		job.setOutputKeyClass(JoinBean.class);
		job.setOutputValueClass(NullWritable.class);

		FileInputFormat.setInputPaths(job, new Path("F:\\mrdata\\join\\input"));
		FileOutputFormat.setOutputPath(job, new Path("F:\\mrdata\\join\\out1"));

		job.waitForCompletion(true);
	}

}

 

这样完成以后心有不安,因为这样效率不好,reduce task 要通过迭代器把文件迭代出来缓存到内存里(不知道先拿的是哪个文件),如果先拿的是自己想要的,后面直接拼另一个文件里面的东西就好,要实现这的话需要改变排序,上面是根据userid来排序的,那就需要把表名(影响排序)和userid都放到key里,但是这样数据分发(orderid和userid都影响分发数据)的时候也会出问题,这时候需要把orderid也放到key里(只按照orderid分区),Partitioner+CompareTo+GroupingComparator 组合来高效实现。过两天把学过的组合起来用一下。

你可能感兴趣的:(向大数据进军~每天记)