实现文件是mahout-distribution-0.6/integration/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java
sequence化的意义
原始文档不能被hadoop处理,需要一个转化过程,这个过程就是sequence化
定义类
sequencefile格式是hadoop提供的,通常我们创建一个hadoop job类的时候是这么做的:
public class MyJob extends Configured implements Tool
{
。。。。。。
}
mahout提供了一个AbstractJob类,我们先看看其定义:
public abstract class AbstractJob extends Configured implements Tool
SequenceFilesFromDirectory将以AbstractJob方式生成。
输入参数处理
这个里边比较重要的一个方法是parseDirectories,会设置inputPath和outputPath:
this.inputPath = new Path(cmdLine.getValue(inputOption).toString());
this.outputPath = new Path(cmdLine.getValue(outputOption).toString());
如果输入中没有inputpath和outputpath,程序会退出并打印原因
parseDirectories是由parseArguments调用的,parseArguments还会将options放入GroupBuilder
sequencefile的writer使用的是ChunkedWriter,可以输入参数指定
PathFilter
hadoop的FileSystem在调用listStatus函数时,可以指定第二个参数为PathFilter并且会调用PathFilter的accept函数,代码如下:
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException { ArrayList<FileStatus> results = new ArrayList<FileStatus>(); listStatus(results, f, filter); return results.toArray(new FileStatus[results.size()]); } private void listStatus(ArrayList<FileStatus> results, Path f, PathFilter filter) throws IOException { FileStatus listing[] = listStatus(f); if (listing != null) { for (int i = 0; i < listing.length; i++) { if (filter.accept(listing[i].getPath())) { results.add(listing[i]); } } } }
@Override public final boolean accept(Path current) { log.debug("CURRENT: {}", current.getName()); try { for (FileStatus fst : fs.listStatus(current)) { log.debug("CHILD: {}", fst.getPath().getName()); process(fst, current); } } catch (IOException ioe) { throw new IllegalStateException(ioe); } return false; }
@Override protected void process(FileStatus fst, Path current) throws IOException { FileSystem fs = getFs(); ChunkedWriter writer = getWriter(); if (fst.isDir()) { String dirPath = getPrefix() + Path.SEPARATOR + current.getName() + Path.SEPARATOR + fst.getPath().getName(); fs.listStatus(fst.getPath(), new PrefixAdditionFilter(getConf(), dirPath, getOptions(), writer, getCharset(), fs)); } else { InputStream in = null; try { in = fs.open(fst.getPath()); StringBuilder file = new StringBuilder(); for (String aFit : new FileLineIterable(in, getCharset(), false)) { file.append(aFit).append('\n'); } String name = current.getName().equals(fst.getPath().getName()) ? current.getName() : current.getName() + Path.SEPARATOR + fst.getPath().getName(); writer.write(getPrefix() + Path.SEPARATOR + name, file.toString()); } finally { Closeables.closeQuietly(in); } } }
ChunkedWriter的write操作实际是调用SequenceFile.Writer完成的,下面ChunkedWriter源代码很清楚展示了这一点:
public void write(String key, String value) throws IOException { if (currentChunkSize > maxChunkSizeInBytes) { Closeables.closeQuietly(writer); currentChunkID++; writer = new SequenceFile.Writer(fs, conf, getPath(currentChunkID), Text.class, Text.class); currentChunkSize = 0; } Text keyT = new Text(key); Text valueT = new Text(value); currentChunkSize += keyT.getBytes().length + valueT.getBytes().length; // Overhead writer.append(keyT, valueT); }