本章开始分析nutch源码的最后一步,即通过“bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize”命令在solr服务器上建立索引。
首先看nutch执行脚本的其中一段,
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
shift
solrindex最后执行IndexingJob的main函数,并将参数“http://localhost:8983/solr”存入名称为solr.server.url变量。
IndexingJob::main
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new IndexingJob(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
index(crawlDb, linkDb, segments, noCommit, deleteGone, params, filter, normalize, addBinaryContent, base64);
return 0;
}
public void index(Path crawlDb, Path linkDb, List segments,
boolean noCommit, boolean deleteGone, String params,
boolean filter, boolean normalize, boolean addBinaryContent,
boolean base64) throws IOException {
final JobConf job = new NutchJob(getConf());
job.setJobName("Indexer");
IndexWriters writers = new IndexWriters(getConf());
IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job, addBinaryContent);
...
final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-"
+ new Random().nextInt());
FileOutputFormat.setOutputPath(job, tmp);
RunningJob indexJob = JobClient.runJob(job);
writers.open(job, "commit");
writers.commit();
}
solrindex命令最后执行IndexingJob的run函数,进而执行index函数,该函数首先通过IndexWriters创建SolrIndexWriter,然后调用initMRJob函数初始化Job,设置该Job的输出为临时目录,然后执行该Job。
IndexerMapReduce::initMRJob
public static void initMRJob(Path crawlDb, Path linkDb,
Collection segments, JobConf job, boolean addBinaryContent) {
for (final Path segment : segments) {
FileInputFormat.addInputPath(job, new Path(segment,
CrawlDatum.FETCH_DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment,
CrawlDatum.PARSE_DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME));
FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME));
if (addBinaryContent) {
FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
}
}
FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));
if (linkDb != null) {
Path currentLinkDb = new Path(linkDb, LinkDb.CURRENT_NAME);
FileInputFormat.addInputPath(job, currentLinkDb);
}
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(IndexerMapReduce.class);
job.setReducerClass(IndexerMapReduce.class);
job.setOutputFormat(IndexerOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setMapOutputValueClass(NutchWritable.class);
job.setOutputValueClass(NutchWritable.class);
}
设置Job的输入为crawl/segments/*/下的crawl_fetch、crawl_parse、parse_data、parse_text、content目录,crawl/crawldb下的current目录和crawl下的linkdb目录。设置Mapper和Reducer为IndexerMapReduce,写函数为IndexerOutputFormat,下面一一来看。
IndexerMapReduce::map
public void map(Text key, Writable value,
OutputCollector output, Reporter reporter)
throws IOException {
String urlString = filterUrl(normalizeUrl(key.toString()));
if (urlString == null) {
return;
} else {
key.set(urlString);
}
output.collect(key, new NutchWritable(value));
}
map函数很简单,就是对url进行标准化和过滤后就传给Reducer。
IndexerMapReduce::reduce
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
Inlinks inlinks = null;
CrawlDatum dbDatum = null;
CrawlDatum fetchDatum = null;
Content content = null;
ParseData parseData = null;
ParseText parseText = null;
while (values.hasNext()) {
final Writable value = values.next().get(); // unwrap
if (value instanceof Inlinks) {
inlinks = (Inlinks) value;
} else if (value instanceof CrawlDatum) {
final CrawlDatum datum = (CrawlDatum) value;
if (CrawlDatum.hasDbStatus(datum)) {
dbDatum = datum;
} else if (CrawlDatum.hasFetchStatus(datum)) {
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
fetchDatum = datum;
}
} else if (CrawlDatum.STATUS_LINKED == datum.getStatus()
|| CrawlDatum.STATUS_SIGNATURE == datum.getStatus()
|| CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
continue;
}
} else if (value instanceof ParseData) {
parseData = (ParseData) value;
if (deleteRobotsNoIndex) {
String robotsMeta = parseData.getMeta("robots");
if (robotsMeta != null
&& robotsMeta.toLowerCase().indexOf("noindex") != -1) {
output.collect(key, DELETE_ACTION);
return;
}
}
} else if (value instanceof ParseText) {
parseText = (ParseText) value;
} else if (value instanceof Content) {
content = (Content)value;
}
}
...
NutchDocument doc = new NutchDocument();
doc.add("id", key.toString());
final Metadata metadata = parseData.getContentMeta();
doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));
doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));
final Parse parse = new ParseImpl(parseText, parseData);
float boost = 1.0f;
boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum, parse,
inlinks, boost);
doc.setWeight(boost);
doc.add("boost", Float.toString(boost));
fetchDatum.setSignature(dbDatum.getSignature());
final Text url = (Text) dbDatum.getMetaData().get(
Nutch.WRITABLE_REPR_URL_KEY);
String urlString = filterUrl(normalizeUrl(url.toString()));
url.set(urlString);
fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
if (content != null) {
String binary;
if (base64) {
binary = Base64.encodeBase64String(content.getContent());
} else {
binary = new String(content.getContent());
}
doc.add("binaryContent", binary);
}
NutchIndexAction action = new NutchIndexAction(doc, NutchIndexAction.ADD);
output.collect(key, action);
}
首先获取各个文件夹下的输入,分别为crawl/segments/*/目录下crawl_fetch存入的CrawlDatum,crawl_parse目录下存入的CrawlDatum,parse_data目录下存入的ParseData,parse_text目录下存入的ParseText,content目录下存入的Content,crawl/crawldb/current目录下存入的CrawlDatum,crawl/linkdb目录下存入的Inlinks。
省略的部分检查是否要删除数据,或者跳过数据。
reduce函数接下来创建NutchDocument,创建lucene中的域,设置各个域名和域值,其中id为url地址,segment为段名,即crawl/segments下的目录名,digest为签名信息,boost为indexerScore函数计算的文档分数,binaryContent为文档未解析时的内容,即带标签的内容,最后将这些信息封装进NutchIndexAction中。
下面来看IndexerOutputFormat如何将NutchIndexAction写入临时文件中。
IndexerOutputFormat::getRecordWriter
public RecordWriter getRecordWriter(
FileSystem ignored, JobConf job, String name, Progressable progress)
throws IOException {
final IndexWriters writers = new IndexWriters(job);
writers.open(job, name);
return new RecordWriter() {
public void close(Reporter reporter) throws IOException {
writers.close();
}
public void write(Text key, NutchIndexAction indexAction)
throws IOException {
if (indexAction.action == NutchIndexAction.ADD) {
writers.write(indexAction.doc);
} else if (indexAction.action == NutchIndexAction.DELETE) {
writers.delete(key.toString());
}
}
};
}
writers被创建为SolrIndexWriter,其open函数内部建立与solr服务器的连接,最后通过SolrIndexWriter的write函数将数据传给solr服务器用于建立索引,下面一一来看。
SolrIndexWriter::open
public void open(JobConf job, String name) throws IOException {
solrClients = SolrUtils.getSolrClients(job);
init(solrClients, job);
}
public static ArrayList getSolrClients(JobConf job) throws MalformedURLException {
String[] urls = job.getStrings(SolrConstants.SERVER_URL);
ArrayList solrClients = new ArrayList();
for (int i = 0; i < urls.length; i++) {
SolrClient sc = new HttpSolrClient(urls[i]);
solrClients.add(sc);
}
return solrClients;
}
SolrIndexWriter的open函数的主要功能是根据solr服务器的地址,创建HttpSolrClient连接,然后调用init函数对其进行初始化。
SolrIndexWriter::write
public void write(NutchDocument doc) throws IOException {
final SolrInputDocument inputDoc = new SolrInputDocument();
for (final Entry e : doc) {
for (final Object val : e.getValue().getValues()) {
Object val2 = val;
if (val instanceof Date) {
val2 = DateUtil.getThreadLocalDateFormat().format(val);
}
if (e.getKey().equals("content") || e.getKey().equals("title")) {
val2 = SolrUtils.stripNonCharCodepoints((String) val);
}
inputDoc.addField(solrMapping.mapKey(e.getKey()), val2, e.getValue()
.getWeight());
String sCopy = solrMapping.mapCopyKey(e.getKey());
if (sCopy != e.getKey()) {
inputDoc.addField(sCopy, val);
}
}
}
inputDoc.setDocumentBoost(doc.getWeight());
inputDocs.add(inputDoc);
totalAdds++;
if (inputDocs.size() + numDeletes >= batchSize) {
push();
}
}
write函数的主要功能是遍历文档中的域Field,将其添加到inputDoc中,最后通过push函数将inputDoc中的数据发给solr服务器。
SolrIndexWriter::write->push
public void push() throws IOException {
UpdateRequest req = new UpdateRequest();
req.add(inputDocs);
req.setAction(AbstractUpdateRequest.ACTION.OPTIMIZE, false, false);
req.setParams(params);
for (SolrClient solrClient : solrClients) {
NamedList res = solrClient.request(req);
}
inputDocs.clear();
}
push函数创建UpdateRequest请求,最后通过HttpSolrClient的request函数将请求发送给solr服务器。
SolrIndexWriter::write->push->HttpSolrClient::request
public NamedList
传入的processor默认为BinaryResponseParser,createMethod函数用于封装一个http请求,executeMethod执行该请求,获得返回结果并处理。
SolrIndexWriter::write->push->HttpSolrClient::request->createMethod
protected HttpRequestBase createMethod(final SolrRequest request, String collection) throws IOException, SolrServerException {
SolrParams params = request.getParams();
Collection streams = requestWriter.getContentStreams(request);
String path = requestWriter.getPath(request);
ResponseParser parser = request.getResponseParser();
ModifiableSolrParams wparams = new ModifiableSolrParams(params);
wparams.set(CommonParams.WT, parser.getWriterType());
wparams.set(CommonParams.VERSION, parser.getVersion());
String basePath = baseUrl;
if (SolrRequest.METHOD.POST == request.getMethod() || SolrRequest.METHOD.PUT == request.getMethod()) {
String url = basePath + path;
boolean hasNullStreamName = false;
if (streams != null) {
for (ContentStream cs : streams) {
if (cs.getName() == null) {
hasNullStreamName = true;
break;
}
}
}
...
String fullQueryUrl = url + wparams.toQueryString();
HttpEntityEnclosingRequestBase postOrPut = SolrRequest.METHOD.POST == request.getMethod() ?
new HttpPost(fullQueryUrl) : new HttpPut(fullQueryUrl);
final ContentStream[] contentStream = new ContentStream[1];
for (ContentStream content : streams) {
contentStream[0] = content;
break;
}
if (contentStream[0] instanceof RequestWriter.LazyContentStream) {
Long size = contentStream[0].getSize();
postOrPut.setEntity(new InputStreamEntity(contentStream[0].getStream(), size == null ? -1 : size) {
@Override
public Header getContentType() {
return new BasicHeader("Content-Type", contentStream[0].getContentType());
}
@Override
public boolean isRepeatable() {
return false;
}
});
} else {
Long size = contentStream[0].getSize();
postOrPut.setEntity(new InputStreamEntity(contentStream[0].getStream(), size == null ? -1 : size) {
@Override
public Header getContentType() {
return new BasicHeader("Content-Type", contentStream[0].getContentType());
}
@Override
public boolean isRepeatable() {
return false;
}
});
}
return postOrPut;
}
}
throw new SolrServerException("Unsupported method: " + request.getMethod());
}
getPath返回请求路径,例如创建索引对应的/update,查询则是/select,parser默认为BinaryResponseParser,basePath设置为solr的服务器地址,例如http://127.0.0.1:8983/solr/testCore,ContentStream的getName函数会创建XML格式的请求,fullQueryUrl为最终的url地址,然后根据该地址创建HttpPost或者HttpPut请求,最后对HttpPost或HttpPut进行相应的设置并返回。
SolrIndexWriter::write->push->HttpSolrClient::request->executeMethod
protected NamedList
httpClient默认为SystemDefaultHttpClient,调用其execute函数向服务器执行请求并返回结果,httpStatus获得请求返回状态码,例如200、404,获得返回的头部ctHeader,和文档类型contentType,EntityUtils的getContentCharSet函数获取返回的编码类型,executeMethod函数最后调用BinaryResponseParser的processResponse处理返回结果。
processResponse函数最后调用JavaBinCodec的unmarshal函数进行处理。
SolrIndexWriter::write->push->HttpSolrClient::request->executeMethod->JavaBinCodec::unmarshal
public Object unmarshal(InputStream is) throws IOException {
FastInputStream dis = FastInputStream.wrap(is);
return readVal(dis);
}
public Object readVal(DataInputInputStream dis) throws IOException {
tagByte = dis.readByte();
switch (tagByte >>> 5) {
case STR >>> 5:
return readStr(dis);
case SINT >>> 5:
return readSmallInt(dis);
case SLONG >>> 5:
return readSmallLong(dis);
case ARR >>> 5:
return readArray(dis);
case ORDERED_MAP >>> 5:
return readOrderedMap(dis);
case NAMED_LST >>> 5:
return readNamedList(dis);
case EXTERN_STRING >>> 5:
return readExternString(dis);
}
switch (tagByte) {
case NULL:
return null;
case DATE:
return new Date(dis.readLong());
case INT:
return dis.readInt();
case BOOL_TRUE:
return Boolean.TRUE;
case BOOL_FALSE:
return Boolean.FALSE;
case FLOAT:
return dis.readFloat();
case DOUBLE:
return dis.readDouble();
case LONG:
return dis.readLong();
case BYTE:
return dis.readByte();
case SHORT:
return dis.readShort();
case MAP:
return readMap(dis);
case SOLRDOC:
return readSolrDocument(dis);
case SOLRDOCLST:
return readSolrDocumentList(dis);
case BYTEARR:
return readByteArray(dis);
case ITERATOR:
return readIterator(dis);
case END:
return END_OBJ;
case SOLRINPUTDOC:
return readSolrInputDocument(dis);
case ENUM_FIELD_VALUE:
return readEnumFieldValue(dis);
case MAP_ENTRY:
return readMapEntry(dis);
}
}
public SimpleOrderedMap readOrderedMap(DataInputInputStream dis) throws IOException {
int sz = readSize(dis);
SimpleOrderedMap nl = new SimpleOrderedMap<>();
for (int i = 0; i < sz; i++) {
String name = (String) readVal(dis);
Object val = readVal(dis);
nl.add(name, val);
}
return nl;
}
假设结果类型为ORDERED_MAP,则通过readOrderedMap函数处理http返回结果,该函数最终将结果封装成map。