说明:本文提及的所有代码和配置参数,都是基于Hadoop 2.5.0-cdh5.2.0环境。
MapReduce(MR)程序中经常需要访问外部的文件,例如:外部的jar包或数据文件。对于前者,可以拷贝到hadoop的lib路径下(本文的CDH环境中,真实路径为/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/)。很明显,这种方法有些笨拙,尤其是节点较多的集群。对于后者,可将数据文件内容序列化后写入Configuration,然后MR使用时再反序列化。这种方法在数据稍大时(数M以上),也不值得提倡。
针对上述方法的不足,可采用下述两种方法之一来弥补。首先声明:以下两种方法本质上都是基于Hadoop的Distributed Cache(DC)机制的。关于DC的介绍,你轻点鼠标能搜到一大堆,所以在此不做介绍。
方法一:使用GenericOptionsParser工具的files和libjars参数
例如:hadoop jar TestLibJar.jar -libjars sqljdbc4-1.0.jar -files mydata.dat /tmp/input /tmp/output
参数中指定的是本地文件。Job提交时,将上传它们到HDFS的${yarn.app.mapreduce.am.staging-dir}/${user}/.staging/${job_id}路径下,然后再分发并缓存到到各个NodeManger的${yarn.nodemanager.local-dirs}/usercache/${user}/filecache下。HDFS上的文件会在Job结束后自动清除,但是NodeManger上的缓存文件可能不会马上清除,因为它的清除机制是由几个参数综合决策的,例如参数yarn.nodemanager.localizer.cache.target-size-m等。
上述命令参数中指定的文件,可用于MR代码中,代码示例如下:
public class TestDistruteCache extends Configured implements Tool
{
private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);
public static void readFile(final String file)
{
try
{
final Path path = new Path(file);
final BufferedReader reader = new BufferedReader(new FileReader(new File(
path.getName())));
LOG.info("The first line is: {}", reader.readLine());
reader.close();
}
catch (final Exception e)
{
e.printStackTrace();
}
}
public static void operatorSqlServer()
{
//......;
}
public static class Map extends Mapper
{
private URI[] caches = null;
@Override
public void setup(final Context context) throws IOException
{
operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
readFile("mydata.dat"); //Access mydata.dat
}
@Override
public void map(final LongWritable key, final Text value, final Context context)
throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Reduce extends Reducer
{
private URI[] caches = null;
@Override
public void setup(final Context context)
{
try
{
operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
readFile("mydata.dat"); //Access mydata.dat
}
catch (final IOException e)
{
e.printStackTrace();
}
}
public void reducer(final LongWritable key, final Iterable values,
final Context context) throws IOException, InterruptedException
{
for (final Text value : values)
{
context.write(key, value);
}
}
}
public int run(final String[] args) throws Exception
{
final Configuration conf = getConf();
final Job job = Job.getInstance(conf);
job.setJarByClass(TestDistruteCache.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(2);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(final String[] args) throws Exception
{
ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
}
}
每个参数如需指定多个文件,只需逗号分隔文件名即可。
方法二:非参数方式,使用DC的API方式
此方法会将代码中指定的HDFS文件,分发并缓存到各个NodeManger的${yarn.nodemanager.local-dirs}/filecache下,缓存清除机制与方法一相同。当然访问外部文件也是通过代码了......闲话少说,直接上代码:
public class TestDistruteCache extends Configured implements Tool
{
private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);
public static void readFile(final URI file)
{
try
{
final Path path = new Path(file);
final BufferedReader reader = new BufferedReader(new FileReader(new File(
path.getName())));
LOG.debug("The first line is: {}", reader.readLine());
reader.close();
}
catch (final Exception e)
{
e.printStackTrace();
}
}
public static void readFile()
{
try
{
final Path path = new Path("test.dat");
final BufferedReader reader = new BufferedReader(new FileReader(new File(
path.getName())));
LOG.debug("The first line is: {}", reader.readLine());
reader.close();
}
catch (final Exception e)
{
e.printStackTrace();
}
}
public static void operatorSqlServer()
{
//......;
}
public static class Map extends Mapper
{
private final URI[] caches = null;
@Override
public void setup(final Context context)
{
operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
readFile(); //Access test.dat
}
@Override
public void map(final LongWritable key, final Text value, final Context context)
throws IOException, InterruptedException
{
context.write(key, value);
}
}
public static class Reduce extends Reducer
{
private URI[] caches = null;
@Override
public void setup(final Context context)
{
try
{
operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
caches = context.getCacheFiles();
readFile(caches[1]); //Access mydata2.dat
}
catch (final IOException e)
{
e.printStackTrace();
}
}
public void reducer(final LongWritable key, final Iterable values,
final Context context) throws IOException, InterruptedException
{
for (final Text value : values)
{
context.write(key, value);
}
}
}
public int run(final String[] args) throws Exception
{
final Configuration conf = getConf();
final Job job = Job.getInstance(conf);
job.setJarByClass(TestDistruteCache.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setNumReduceTasks(2);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.addCacheFile(new URI("/tmp/mydata1.dat#test.dat"));
job.addCacheFile(new URI("/tmp/mydata2.dat"));
job.addArchiveToClassPath(new Path("/tmp/sqljdbc4-1.0.jar"));
job.waitForCompletion(true);
return 0;
}
public static void main(final String[] args)
throws Exception
{
ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
}
}
代码与方法一的代码大致类似,但是有几处不同:
1. run中多了三行,这是添加两个数据文件和一个jar文件到DC;
2. 为第一个数据文件指定了别名(test.dat),所以在新增的readFile()中,直接通过别名可以访问数据文件。
小结一下:
1. MR中,如果需要访问的外部数据文件小于数M,请通过Configuration访问,否则请参考本文所述两种方法之一;
2. MR中,如果需要访问外部jar包,请使用本文所述的两种方法之一;
3. 两种方法本质相同,但是留意一个细节:需要分发的文件位于不同位置。方法一是本地文件,方法二是HDFS文件。