MapReduce中如何访问外部jar包和数据文件

说明:本文提及的所有代码和配置参数,都是基于Hadoop 2.5.0-cdh5.2.0环境。

MapReduce(MR)程序中经常需要访问外部的文件,例如:外部的jar包或数据文件。对于前者,可以拷贝到hadoop的lib路径下(本文的CDH环境中,真实路径为/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/lib/)。很明显,这种方法有些笨拙,尤其是节点较多的集群。对于后者,可将数据文件内容序列化后写入Configuration,然后MR使用时再反序列化。这种方法在数据稍大时(数M以上),也不值得提倡。

针对上述方法的不足,可采用下述两种方法之一来弥补。首先声明:以下两种方法本质上都是基于Hadoop的Distributed Cache(DC)机制的。关于DC的介绍,你轻点鼠标能搜到一大堆,所以在此不做介绍。

方法一:使用GenericOptionsParser工具的files和libjars参数

例如:hadoop jar TestLibJar.jar -libjars sqljdbc4-1.0.jar -files mydata.dat /tmp/input /tmp/output

参数中指定的是本地文件。Job提交时,将上传它们到HDFS的${yarn.app.mapreduce.am.staging-dir}/${user}/.staging/${job_id}路径下,然后再分发并缓存到到各个NodeManger的${yarn.nodemanager.local-dirs}/usercache/${user}/filecache下。HDFS上的文件会在Job结束后自动清除,但是NodeManger上的缓存文件可能不会马上清除,因为它的清除机制是由几个参数综合决策的,例如参数yarn.nodemanager.localizer.cache.target-size-m等。

上述命令参数中指定的文件,可用于MR代码中,代码示例如下:

public class TestDistruteCache extends Configured implements Tool
{
    private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);

    public static void readFile(final String file)
    {
        try
        {
            final Path path = new Path(file);
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.info("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void operatorSqlServer()
    {
        //......;
    }

    public static class Map extends Mapper
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context) throws IOException
        {
            operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
            readFile("mydata.dat");   //Access mydata.dat
        }

        @Override
        public void map(final LongWritable key, final Text value, final Context context)
            throws IOException, InterruptedException
        {
            context.write(key, value);
        }
    }

    public static class Reduce extends Reducer
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            try
            {
                operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
                readFile("mydata.dat"); //Access mydata.dat
            }
            catch (final IOException e)
            {
                e.printStackTrace();
            }
        }

        public void reducer(final LongWritable key, final Iterable values,
                            final Context context) throws IOException, InterruptedException
        {
            for (final Text value : values)
            {
                context.write(key, value);
            }
        }
    }

    public int run(final String[] args)  throws Exception
    {
        final Configuration conf = getConf();

        final Job job = Job.getInstance(conf);
        job.setJarByClass(TestDistruteCache.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(2);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

        return 0;
    }

    public static void main(final String[] args) throws Exception
    {
        ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
    }
}

上述示例中,参数中指定了一个jar和一个数据文件,代码中对两者都进行了访问。在我的测试集群上,可以看到这两个文件都被分发到/yarn/nm/usercache/root/filecache/下面,并且在Job执行过程中,可以在HDFS的/user/root/.staging/路径下看到这两个文件。

每个参数如需指定多个文件,只需逗号分隔文件名即可。

方法二:非参数方式,使用DC的API方式

此方法会将代码中指定的HDFS文件,分发并缓存到各个NodeManger的${yarn.nodemanager.local-dirs}/filecache下,缓存清除机制与方法一相同。当然访问外部文件也是通过代码了......闲话少说,直接上代码:

public class TestDistruteCache extends Configured implements Tool
{
    private static final Logger LOG = LoggerFactory.getLogger(TestDistruteCache.class);

    public static void readFile(final URI file)
    {
        try
        {
            final Path path = new Path(file);
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.debug("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void readFile()
    {
        try
        {
            final Path path = new Path("test.dat");
            final BufferedReader reader = new BufferedReader(new FileReader(new File(
                path.getName())));
            LOG.debug("The first line is: {}", reader.readLine());
            reader.close();
        }
        catch (final Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void operatorSqlServer()
    {
        //......;
    }

    public static class Map extends Mapper
    {
        private final URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
            readFile();   //Access test.dat
        }

        @Override
        public void map(final LongWritable key, final Text value, final Context context)
            throws IOException, InterruptedException
        {
            context.write(key, value);
        }
    }

    public static class Reduce extends Reducer
    {
        private URI[] caches = null;

        @Override
        public void setup(final Context context)
        {
            try
            {
		operatorSqlServer(); // this method is to operater sqlserver database, so need sqljdbc4-1.0.jar
                caches = context.getCacheFiles();
                readFile(caches[1]); //Access mydata2.dat
            }
            catch (final IOException e)
            {
                e.printStackTrace();
            }

        }

        public void reducer(final LongWritable key, final Iterable values,
                            final Context context) throws IOException, InterruptedException
        {
            for (final Text value : values)
            {
                context.write(key, value);
            }
        }
    }

    public int run(final String[] args) throws Exception
    {
        final Configuration conf = getConf();

        final Job job = Job.getInstance(conf);
        job.setJarByClass(TestDistruteCache.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
        job.setNumReduceTasks(2);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.addCacheFile(new URI("/tmp/mydata1.dat#test.dat"));
        job.addCacheFile(new URI("/tmp/mydata2.dat"));
        job.addArchiveToClassPath(new Path("/tmp/sqljdbc4-1.0.jar"));

        job.waitForCompletion(true);

        return 0;
    }

    public static void main(final String[] args)
        throws Exception
    {
        ToolRunner.run(new Configuration(), new TestDistruteCache(), args);
    }
}
代码与方法一的代码大致类似,但是有几处不同:

1. run中多了三行,这是添加两个数据文件和一个jar文件到DC;

2. 为第一个数据文件指定了别名(test.dat),所以在新增的readFile()中,直接通过别名可以访问数据文件。

小结一下:

1. MR中,如果需要访问的外部数据文件小于数M,请通过Configuration访问,否则请参考本文所述两种方法之一;

2. MR中,如果需要访问外部jar包,请使用本文所述的两种方法之一;

3. 两种方法本质相同,但是留意一个细节:需要分发的文件位于不同位置。方法一是本地文件,方法二是HDFS文件。

你可能感兴趣的:(Hadoop)