mr项目优化总结

---------------------------------mr运行参数调优

MapReduce任务参数调优

Hadoop优化 第一篇 : HDFS/MapReduce

MapReduce相关参数

MapReduce官方文档

以上三篇可以作为内部调优的参考,但是个人感觉,参数调优适用于平台内部调优,如果对mr没有深层次的了解,盲目调节,反而适得其反

代码中参数调节方式:

configuration.setDouble(Job.SHUFFLE_INPUT_BUFFER_PERCENT, 0.25);

---------------------------------减少不必要的reduce

map过后是copy  merge reduce ,流程多,耗资源,如果仅仅是为了取得一些数据,不需要归约,做计算的话,就没有必要用reduce

可以如下设置取消reduce:

job.setNumReduceTasks(0);

mapReduce具体执行流程参考:

Hadoop Map/Reduce执行流程详解

Hadoop MapReduce执行过程详解


取消reduce后,可以在map task中进行的多路径输出:

public class ClearDataFixMapper extends Mapper {
	private MultipleOutputs mos;
	private static final String SEP_CLMN = BaseConstant.DATA_SEPARATOR_COLUMN;
	@Override
	protected void setup(Context context) throws IOException, InterruptedException {
		mos = new MultipleOutputs(context);
	}

	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		mos.write(key, resultVal, uri);
		
	}
	@Override
	protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
		mos.close();
	}
}

那么问题来了,由于map每个map task中 mos都是单独输出的,一旦输出数据量变大,就会随之而来小文件过多的问题,对此,有两种解决方案,一种是通过hadoop 的IO,在task执行之前聚合数据,但是此方法有缺陷,就是不能方便地聚合sequencefile,还有一个缺陷就是,对namenode的请求过多,可能会报RemoteException。聚合方式如下:

		//数据量统计以及小文件整合
		long blockSize = 0;
		long totalSize=0;
		FileStatus[] statuses = hadoopFs.globStatus(new Path(inputPath + "*"));
		hadoopFs.createNewFile(new Path(inputPath+"infile"));
		FSDataOutputStream outstream = hadoopFs.create(new Path(inputPath+"infile"),true);
		FSDataInputStream inputStream=null;
		for (FileStatus status : statuses) {
			totalSize += status.getLen();
			blockSize = status.getBlockSize();
			if(!status.getPath().toString().contains("infile")){
				inputStream=hadoopFs.open(status.getPath());
				IOUtils.copyBytes(inputStream, outstream, configuration,false);
				inputStream.close();
				outstream.flush();
				//聚合以后删除被聚合的文件
				hadoopFs.delete(status.getPath(), true);
			}
		}
		if(outstream!=null){
			outstream.close();
		}

第二种方式是使用combinesequencefile:

//注意文件路径如果是文件夹,文件结尾不要加 /
MultipleInputs.addInputPath(job,new Path(inputPath), CombineSequenceFileInputFormat.class);
//map split大小,必须设置,否则就是一个 map task
FileInputFormat.setMinInputSplitSize(job, 1);
FileInputFormat.setMaxInputSplitSize(job, 1024*1024*64);

---------------------------------改进数据发送部分:大批量数据发送时,一定要注意关闭关闭资源,这里数据发送使用httpclient,demo如下:

public class KVSender2 {
	private static Log log = LogFactory.getLog(KVSender2.class);

	public static void KVSender(String url, Map params) {
		// String result = "";
		CloseableHttpClient httpClient = null;
		CloseableHttpResponse response = null;
		HttpGet request = new HttpGet(url);
		request.setConfig(RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build());

		// 获取当前客户端对象
		httpClient = HttpClients.createDefault();

		// 通过请求对象获取响应对象
		try {
			for (Map.Entry entry : params.entrySet()) {
				request.setHeader(entry.getKey(), entry.getValue());
			}
			response = httpClient.execute(request);

			// 判断网络连接状态码是否正常(0--200都数正常)
			// System.out.println(response.getStatusLine().getStatusCode());
			if (response.getStatusLine().getStatusCode() > HttpStatus.SC_OK) {
				// 出错就重发送一次
				httpClient.execute(request);
				log.info("status code is " + response.getStatusLine().getStatusCode());
				// result = EntityUtils.toString(response.getEntity(), "utf-8");
			}
			EntityUtils.consume(response.getEntity());
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (ParseException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			// 释放资源
			if (response != null) {
				try {
					response.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
			if (httpClient != null) {
				try {
					httpClient.close();
				} catch (IOException e) {
					e.printStackTrace();
				}
			}
			request.releaseConnection();
		}

	}
}


你可能感兴趣的:(MapReduce)