---------------------------------mr运行参数调优
MapReduce任务参数调优
Hadoop优化 第一篇 : HDFS/MapReduce
MapReduce相关参数
MapReduce官方文档
以上三篇可以作为内部调优的参考,但是个人感觉,参数调优适用于平台内部调优,如果对mr没有深层次的了解,盲目调节,反而适得其反
代码中参数调节方式:
configuration.setDouble(Job.SHUFFLE_INPUT_BUFFER_PERCENT, 0.25);
---------------------------------减少不必要的reduce
map过后是copy merge reduce ,流程多,耗资源,如果仅仅是为了取得一些数据,不需要归约,做计算的话,就没有必要用reduce
可以如下设置取消reduce:
job.setNumReduceTasks(0);
mapReduce具体执行流程参考:
Hadoop Map/Reduce执行流程详解
Hadoop MapReduce执行过程详解
取消reduce后,可以在map task中进行的多路径输出:
public class ClearDataFixMapper extends Mapper {
private MultipleOutputs mos;
private static final String SEP_CLMN = BaseConstant.DATA_SEPARATOR_COLUMN;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
mos = new MultipleOutputs(context);
}
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
mos.write(key, resultVal, uri);
}
@Override
protected void cleanup(Mapper.Context context) throws IOException, InterruptedException {
mos.close();
}
}
那么问题来了,由于map每个map task中 mos都是单独输出的,一旦输出数据量变大,就会随之而来小文件过多的问题,对此,有两种解决方案,一种是通过hadoop 的IO,在task执行之前聚合数据,但是此方法有缺陷,就是不能方便地聚合sequencefile,还有一个缺陷就是,对namenode的请求过多,可能会报RemoteException。聚合方式如下:
//数据量统计以及小文件整合
long blockSize = 0;
long totalSize=0;
FileStatus[] statuses = hadoopFs.globStatus(new Path(inputPath + "*"));
hadoopFs.createNewFile(new Path(inputPath+"infile"));
FSDataOutputStream outstream = hadoopFs.create(new Path(inputPath+"infile"),true);
FSDataInputStream inputStream=null;
for (FileStatus status : statuses) {
totalSize += status.getLen();
blockSize = status.getBlockSize();
if(!status.getPath().toString().contains("infile")){
inputStream=hadoopFs.open(status.getPath());
IOUtils.copyBytes(inputStream, outstream, configuration,false);
inputStream.close();
outstream.flush();
//聚合以后删除被聚合的文件
hadoopFs.delete(status.getPath(), true);
}
}
if(outstream!=null){
outstream.close();
}
第二种方式是使用combinesequencefile:
//注意文件路径如果是文件夹,文件结尾不要加 /
MultipleInputs.addInputPath(job,new Path(inputPath), CombineSequenceFileInputFormat.class);
//map split大小,必须设置,否则就是一个 map task
FileInputFormat.setMinInputSplitSize(job, 1);
FileInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
public class KVSender2 {
private static Log log = LogFactory.getLog(KVSender2.class);
public static void KVSender(String url, Map params) {
// String result = "";
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;
HttpGet request = new HttpGet(url);
request.setConfig(RequestConfig.custom().setSocketTimeout(2000).setConnectTimeout(2000).build());
// 获取当前客户端对象
httpClient = HttpClients.createDefault();
// 通过请求对象获取响应对象
try {
for (Map.Entry entry : params.entrySet()) {
request.setHeader(entry.getKey(), entry.getValue());
}
response = httpClient.execute(request);
// 判断网络连接状态码是否正常(0--200都数正常)
// System.out.println(response.getStatusLine().getStatusCode());
if (response.getStatusLine().getStatusCode() > HttpStatus.SC_OK) {
// 出错就重发送一次
httpClient.execute(request);
log.info("status code is " + response.getStatusLine().getStatusCode());
// result = EntityUtils.toString(response.getEntity(), "utf-8");
}
EntityUtils.consume(response.getEntity());
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
// 释放资源
if (response != null) {
try {
response.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (httpClient != null) {
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
request.releaseConnection();
}
}
}