利用sequenceFile打包多个小文件,MapFile是sequenceFile的排序形式,程序如下:
public class testSequenceFile { public static void main(String[] args) throws IOException{ Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FileStatus[] files = fs.listStatus(new Path(args[0])); Text key = new Text(); Text value = new Text(); SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(args[1]),key.getClass() , value.getClass()); InputStream in = null; byte[] buffer = null; for(int i=0;i<files.length;i++){ key.set(files[i].getPath().getName()); in = fs.open(files[i].getPath()); buffer = new byte[(int) files[i].getLen()]; IOUtils.readFully(in, buffer, 0, buffer.length); value.set(buffer); IOUtils.closeStream(in); System.out.println(key.toString()+"\n"+value.toString()); writer.append(key, value); } IOUtils.closeStream(writer); } }
这里需要注意的是sequenceFile是二进制文件,cat more less 之类的命令都不能以文本形式显示顺序文件的内容,需要用到fs命令的-text选项,该选项可以查看文件的代码,检测出文件的类型并适当的转化成文本,如下图“
KeXie@KeXie-PC ~/hadoop-0.20.2 $ hadoop fs -cat soutput SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text*org.apache.hadoop.io.compress.DefaultCodec▒A▒=▒▒=U▒2▒,a.txtx▒▒▒L▒,*▒▒,,M▒▒▒▒<▒A#b.txtx▒▒L▒H▒▒▒y▒\▒▒▒y▒\@6n:c.txtx▒▒+*▒▒,,M▒▒▒▒%▒▒ KeXie@KeXie-PC ~/hadoop-0.20.2 $ hadoop fs -text soutput a.txt xie chen liang quan b.txt chen chen wen an wen c.txt mo an an