解决了hadoop streaming中指定自定义的inputformat java类
想在streaming中用自己的输入类:
看到网上说:
How do I provide my own input/output format with streaming?
At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar.
也没找到其他的解决办法,就想到自己编译了
1)用grep TextInputFormat 方式找到stream的源码在: hadoop-0.20.2/src/contrib/streaming/src/java/org/apache/hadoop/streaming下:
查看StreamJob.java
inputFormatSpec_ = (String)cmdLine.getOptionValue("inputformat");
if (inReaderSpec_ == null && inputFormatSpec_ == null) { fmt = TextInputFormat.class; } else if (inputFormatSpec_ != null) { if (inputFormatSpec_.equals(TextInputFormat.class.getName()) || inputFormatSpec_.equals(TextInputFormat.class.getCanonicalName()) || inputFormatSpec_.equals(TextInputFormat.class.getSimpleName())) { fmt = TextInputFormat.class; } else if (inputFormatSpec_.equals(KeyValueTextInputFormat.class .getName()) || inputFormatSpec_.equals(KeyValueTextInputFormat.class .getCanonicalName()) || inputFormatSpec_.equals(KeyValueTextInputFormat.class.getSimpleName())) { } else if (inputFormatSpec_.equals(SequenceFileInputFormat.class .getName()) || inputFormatSpec_ .equals(org.apache.hadoop.mapred.SequenceFileInputFormat.class .getCanonicalName()) || inputFormatSpec_ .equals(org.apache.hadoop.mapred.SequenceFileInputFormat.class.getSimpleName())) { } else if (inputFormatSpec_.equals(SequenceFileAsTextInputFormat.class .getName()) || inputFormatSpec_.equals(SequenceFileAsTextInputFormat.class .getCanonicalName()) || inputFormatSpec_.equals(SequenceFileAsTextInputFormat.class.getSimpleName())) { fmt = SequenceFileAsTextInputFormat.class; } else { c = StreamUtil.goodClassOrNull(jobConf_, inputFormatSpec_, defaultPackage); if (c != null) { fmt = c; } else { fail("-inputformat : class not found : " + inputFormatSpec_); }
String defaultPackage = this.getClass().getPackage().getName();
package org.apache.hadoop.streaming;
看到源码hadoop streaming会自动找包org.apache.hadoop.streaming的类做为inputformat类。
2)于是改写SdfInputFormat.java使其在包org.apache.hadoop.streaming
SdfInputFormat.java:
package org.apache.hadoop.streaming;
错误方法1:把SdfInputFormat.java 和其他的stream java文件放一起,编译:
(刚开始出现软件包 org.apache.commons.logging 不存在的错误先是上网上搜,后来无意中搜到hadoop-0.20.2/lib/* 中包含了缺少的jar)
javac -classpath ~/hadoop-1.0.0/share/hadoop/hadoop-core-1.0.0.jar:classes-line/:/home/mjiang/hadoop-0.20.2/lib/* -d classes-streaming/ ~/Downloads/hadoop-1.0.0/hadoop-1.0.0/src/contrib/streaming/src/java/org/apache/hadoop/streaming/* -Xlint:deprecation
jar -cvf jar-streaming/hadoop-streaming-1.0.0-my.jar -C classes-streaming/ .
运行:
hadoop jar jar-streaming/hadoop-streaming-1.0.0-my.jar HadoopStreaming -input "user/mjiang/target-seq/sdfgz.seq" -output "test/output" -mapper 'babel -isdf -ofpt -xh -xf FP4' -inputformat SdfInputFormat -numReduceTasks 0
hadoop jar ~/hadoop-1.0.0/share/hadoop/contrib/streaming/hadoop-streaming-1.0.0.jar
出现 找不到主类的错误
觉得可能编译出想了问题(先后顺序或者是其他),不知道
~~~~~~~~~~~
正确方法:于是想到直接将编译好的SdfInputFormat.class添加到已经存在的jar包。
jar uf hadoop-streaming-1.0.0.jar classes-streaming/org/apache/hadoop/streaming/SdfInputFormat.class
(jar uf 也是实验了好几个才出来的 )
OK