Hadoop中,如何通过 -libjars 引入第三方jar

       我们可以在使用“hadoo jar”命令时,向启动的job传递“libjars”选项参数,同时配合ToolRunner工具来解析参数并运行Job,这种方式是推荐的用法之一,因为它可以简单的实现job的依赖包和hadoop classpath解耦,可以为每个job单独设置libjars参数。这些jars将会在job提交之后复制到hadoop“共享文件系统中”(hdfs,/tmp文件夹中),此后taskTracker即可load到本地并在任务子进程中加载。

      libjars中需要指定job依赖的所有的jar全路径,并且这些jars必须在当前本地文件系统中(并非集群中都需要有此jars),暂时还不支持hdfs。对于在HADOOP_CLASSPATH或者mapred.child.env中已经包含了jars,则不需要再-libjars参数中再次指定。因为libjars需要指定jar的全路径名,所以如果jars特别多的话,操作起来非常不便,所以我们通常将多个job共用的jars通过HADOOP_CLASSPATH或者mapred.child.end方式配置,将某个job依赖的额外的jars(少量的)通过-libjars选项指定。

1. 如果不开发ToolRunner工具,则可以在mappreduce方法中添加如下解析参数的代码

CommandLine commandLine = new GenericOptionsParser(configuration, args).getCommandLine();
String[] tmpArgs = commandLine.getArgs();
具体代码如下
 public static void main(String[] args) throws Exception {
        Properties properties = System.getProperties();
        properties.setProperty("HADOOP_USER_NAME", "hadoop");

        Configuration configuration = new Configuration();
        configuration.set("fs.defaultFS", "hdfs://hadoop000:8020");
        configuration.set("dfs.client.use.datanode.hostname", "true");
        configuration.set("mapred.jar", "D:\\ruoze_g7\\ruoze\\hadoop-project\\target\\hadoop-project-1.0-customer-dependencies.jar");
        configuration.set("mapreduce.framework.name", "yarn");
        configuration.set("yarn.resourcemanager.hostname", "hadoop000");
        configuration.set("mapreduce.app-submission.cross-platform", "true");

        CommandLine commandLine = new GenericOptionsParser(configuration, args).getCommandLine();
        String[] tmpArgs = commandLine.getArgs();

        Job job = Job.getInstance(configuration, "ETLApp");

        job.setJarByClass(ETLApp.class);

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

//        String input = args[1];
        String input = "hdfs://hadoop000:8020/hadoop/project/input/access/day=";
        // 在服务器上,该参数date是args数组的第二个value
        String date = args[2];
        if (StringUtils.isNotEmpty(date)) {
            input =  input + date;
        }

        // String output = args[2];
        String output = "hdfs://hadoop000:8020/hadoop/project/wide/access/day=";

        if (StringUtils.isNotEmpty(date)) {
            output =  output + date;
        }

        FileUtils.deleteTarget(output, configuration);
//        String cacheFilePath = args[3];
        String  cacheFilePath =  "hdfs://hadoop000:8020/hadoop/project/input/base/ip.txt";

        // 加载数据到分布式缓存
        job.addCacheFile(new URI(cacheFilePath));

        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        boolean result = job.waitForCompletion(true);

        CounterGroup etl = job.getCounters().getGroup("etl");
        Iterator iterator = etl.iterator();
        while (iterator.hasNext()){
            Counter next = iterator.next();
            System.out.println("----------" + next.getName() + " : " + next.getValue());
        }
        System.exit(result ? 0:1);
    }

将代代码打包放到服务器上,并将依赖包拷贝到相应的path下,如下写法,程序中取"20190921"要通过args[2]

hadoop jar hadoop-project-1.0.jar com.wxx.bigdata.mr.ETLApp -libjars /home/hadoop/lib/thirdjar/hadoop/fastjson-1.2.51.jar 20190921

执行命令,就不会报找不到第三方依赖包的exception
2.开发一个ToolRunner,并实现run方法
具体参考: http://grepalex.com/2013/02/25/hadoop-libjars/ 
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/util/ToolRunner.html

 

你可能感兴趣的:(Hadoop)