我前段时间在完成一个公司业务时,遇到了一个这样的需求:将HDFS上按每天每小时存储的数据进行数据预处理,然后对应按天存储在HDFS........由此可得,MapReduce的输入路径是:
/user/data/yyyy/MM/dd/HH/
/user/out/yyyy/MM/dd/
在设计代码的时候,发现FileInputFormat.addInputPath()难堪此大任,于是,我就通过APIs等资料,找到了FileInputFormat.setInputPaths()的解决方案。不过,我将在下面对MapReduce的输入/输出进行总结和介绍。
FileInputFormat.addInputPath()是我们最常用的设置MapReduce输入路径的方法了。其实,FileInputFormat有两个这样的方法:
static void addInputPath(Job job, Path path) static void addInputPaths(Job job, String commaSeperatedPaths)
FileInputFormat.addInputPath(job, new Path(args[0])); FileInputFormat.addInputPath(job, new Path(args[1])); FileInputFormat.addInputPath(job, new Path(args[2]));
String paths = strings[0] + "," + strings[1]; FileInputFormat.addInputPaths(job, paths);
MultipleInputs的addInputPath有两种定义方式:
static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass) static void addInputPath(Job job, Path path, Class<? extends InputFormat> inputFormatClass, Class<? extends Mapper> mapperClass)
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class);
后者可以对不同的路径指定不同的Mapper,故可以指定不同Mapper处理不同类型的文件。
MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, MultiPathMR.MultiMap1.class); MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, MultiPathMR.MultiMap2.class);
FileInputFormat有三个设置路径的方法:
static void setInputPathFilter(Job job, Class<? extends PathFilter> filter) static void setInputPaths(Job job, Path... inputPaths) static void setInputPaths(Job job, String commaSeparatedPaths)
通配符 |
描述 |
* |
匹配0个或多个字符 |
? |
匹配单个字符 |
[ab] |
匹配集合{a, b}中的单个字符 |
[^ab] |
匹配不在集合{a, b}中的单个字符 |
[a-b] |
匹配闭区间[a, b]中的单个字符,其顺序按字典字母排序 |
[^a-b] |
匹配不在闭区间[a, b]中的单个字符 |
{a, b} |
匹配a表达式或b表达式 |
\c |
匹配元字符c |
/user/yyyy/mm/dd/*/
FileInputFormat.setInputPaths(job, new Path(strings[0]));
Path[] paths = {new Path(strings[0]), new Path(strings[1])}; FileInputFormat.setInputPaths(job, paths);
String paths = strings[0] + "," + strings[1]; FileInputFormat.setInputPaths(job, paths);
void write(KEYOUT key, VALUEOUT value, String baseOutputPath) <K, V> void write(String namedOutput, K key, V value) <K, V> void write(String namedOutput, K key, V value, String baseOutputPath)
hello,world hello,hadoop hello,spark
public class MultiOutMR { public static class MultiOutMapper extends Mapper<Object, Text, Text, IntWritable> { private Text outKey = new Text(); private IntWritable outValue = new IntWritable(1); @Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] line = value.toString().trim().split(","); for(String word : line){ outKey.set(word); context.write(outKey, outValue); } } } public static class MultiOutReducer extends Reducer<Text, IntWritable, Text, LongWritable> { private LongWritable count = new LongWritable(); private MultipleOutputs outputs; @Override protected void setup(Context context) throws IOException, InterruptedException { outputs = new MultipleOutputs(context); } @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable value : values){ sum += value.get(); } count.set(sum); Configuration conf = context.getConfiguration(); String type = conf.get("type"); if(type.equalsIgnoreCase("namedOutput")) { if(key.toString().equals("hello")) { outputs.write("hello", key, count); } else { outputs.write("IT", key, count); } } else if(type.equalsIgnoreCase("baseOutputPath")){ outputs.write(key, count, key.toString()); } else { if(key.toString().equals("hello")) { outputs.write("hello", key, count, key.toString()); } else { outputs.write("IT", key, count, key.toString()); } } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { outputs.close(); } } }
public class Driver extends Configured implements Tool { @Override public int run(String[] strings) throws Exception { Configuration conf = getConf(); conf.set("type", strings[2]); Job job = new Job(conf, "Multiple Output"); job.setJarByClass(Driver.class); job.setMapperClass(MultiOutMR.MultiOutMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(MultiOutMR.MultiOutReducer.class); if(!strings[2].equalsIgnoreCase("baseOutputPath")){ MultipleOutputs.addNamedOutput(job, "hello", TextOutputFormat.class, Text.class, LongWritable.class); MultipleOutputs.addNamedOutput(job, "IT", TextOutputFormat.class, Text.class, LongWritable.class); } FileInputFormat.addInputPath(job, new Path(strings[0])); FileOutputFormat.setOutputPath(job, new Path(strings[1])); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args)throws Exception{ Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if(otherArgs.length != 3){ System.err.println("Usage: <input> <input> <output type>"); System.out.println("Type:\n" + "namedOutput - the named output name.\n" + "baseOutputPath - base-output path to write the record to. Note: Framework will generate unique filename for the baseOutputPath.\n" + "all - contains namedOutput and baseOutputPath."); System.exit(1); } System.exit(ToolRunner.run(conf, new Driver(), otherArgs)); } }
参考文献:
http://blog.zaloni.com/using-globs-and-wildcards-with-mapreduce