1. 概述
Fetch主要是从待抓取列表中取出url,进行抓取解析,期间产生crawl_parse,carwl_fetch,parse_data,parse_text文件夹。本次将讲解Fetch的大致流程,重点将是各个文件夹的产生过程以及包含的内容。对于Fetch的生产者、消费者模型,这些将不会讲解。
2. 正文
在Fetcher类的fetch()方法中,设置了执行fetch操作的job。其中,
job.setOutputFormat(GeneralChannelFetcherOutputFormat.class);方法是重要的。后面的各个文件夹的产生都由它控制。(GeneralChannelFetcherOutputFormat.class是在Nutch源码的基础上修改过的代码。)
实现抓取过程的是FetchThread类中的run()方法。
ProtocolOutput output = protocol.getProtocolOutput(fit.url,fit.datum);
ProtocolStatus status = output.getStatus();
Content content = output.getContent();
这几行代码实现url源码的抓取,将生成的内容放到Content对象中。
接下来,根据status的状态信息,进行相应的操作。
switch(status.getCode())
case ProtocolStatus.SUCCESS: // got a page
pstatus =output(fit.url, fit.datum, content, status,
CrawlDatum.STATUS_FETCH_SUCCESS);
当状态时success时,会先执行output方法。Output方法也是一个重要的方法,下面来看看output方法。
private ParseStatus output(Text key, CrawlDatum datum,
Content content, ProtocolStatus pstatus, int status) {
datum.setStatus(status);
datum.setFetchTime(System.currentTimeMillis());
if (pstatus != null)datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);
//上面的代码实现为value的抓取状态设值。如抓取时间。
ParseResult parseResult= null;
if (content != null) {
Metadata metadata =content.getMetadata();
// add segment tometadata
metadata.set(Nutch.SEGMENT_NAME_KEY, segmentName);
// add score tocontent metadata so that ParseSegment can pick it up.
try {
scfilters.passScoreBeforeParsing(key, datum, content);
} catch (Exception e){
if(LOG.isWarnEnabled()) {
e.printStackTrace(LogUtil.getWarnStream(LOG));
LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");
}
}
/* Note: Fetcher willonly follow meta-redirects coming from the
* original URL. */
if (parsing &&status == CrawlDatum.STATUS_FETCH_SUCCESS) {
try {
parseResult =this.parseUtil.parse(content);//对抓取到的源码进行解析
} catch (Exceptione) {
LOG.warn("Error parsing: " + key + ": " +StringUtils.stringifyException(e));
}
if (parseResult ==null) {
byte[] signature =
SignatureFactory.getSignature(getConf()).calculate(content,
newParseStatus().getEmptyParse(conf));
datum.setSignature(signature);
}
}
/* Store status codein content So we can read this value during
* parsing (as aseparate job) and decide to parse or not.
*/
content.getMetadata().add(Nutch.FETCH_STATUS_KEY,Integer.toString(status));
}
//涉及到setOutputFormat中设定的类了。
try {
output.collect(key,new NutchWritable(datum));
if (content != null&& storingContent)
output.collect(key,new NutchWritable(content));
if (parseResult !=null) {
for (Entry<Text,Parse> entry : parseResult) {
Text url =entry.getKey();
Parse parse =entry.getValue();
ParseStatusparseStatus = parse.getData().getStatus();
if(!parseStatus.isSuccess()) {
LOG.warn("Error parsing: " + key + ": " +parseStatus);
parse =parseStatus.getEmptyParse(getConf());
}
// Calculate pagesignature. For non-parsing fetchers this will
// be done inParseSegment
byte[] signature =
SignatureFactory.getSignature(getConf()).calculate(content, parse);
// Ensure segmentname and score are in parseData metadata
parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,
segmentName);
parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,
StringUtil.toHexString(signature));
// Pass fetch timeto content meta
parse.getData().getContentMeta().set(Nutch.FETCH_TIME_KEY,
Long.toString(datum.getFetchTime()));
if (url.equals(key))
datum.setSignature(signature);
try {
scfilters.passScoreAfterParsing(url, content, parse);
} catch (Exceptione) {
if(LOG.isWarnEnabled()) {
e.printStackTrace(LogUtil.getWarnStream(LOG));
LOG.warn("Couldn't pass score, url " + key + " (" +e + ")");
}
}
output.collect(url, new NutchWritable(
newParseImpl(new ParseText(parse.getText()),
parse.getData(), parse.isCanonical())));
}
}
} catch (IOException e){
if(LOG.isFatalEnabled()) {
e.printStackTrace(LogUtil.getFatalStream(LOG));
LOG.fatal("fetchercaught:"+e.toString());
}
}
在output.collect()方法中,就涉及到相关文件的生成了。下面就来看看
GeneralChannelFetcherOutputFormat.class做了点什么。
public RecordWriter<Text, NutchWritable> getRecordWriter(finalFileSystem fs,
final JobConf job,
final String name,
final Progressable progress) throws IOException {
Path out =FileOutputFormat.getOutputPath(job);
final Path fetch =
newPath(new Path(out, CrawlDatum.FETCH_DIR_NAME),
name);/*crawl-fetch��������key-datum��
map���ͣ���ŵ���url��״̬��Ϣ*/
final Path content =
new Path(new Path(out,Content.DIR_NAME), name);
final CompressionTypecompType =
SequenceFileOutputFormat.getOutputCompressionType(job);
final MapFile.WriterfetchOut =
new MapFile.Writer(job,fs, fetch.toString(), Text.class, CrawlDatum.class,
compType, progress);
return newRecordWriter<Text, NutchWritable>() {
private MapFile.WritercontentOut;
privateRecordWriter<Text, Parse> parseOut;
{
if(GeneralChannelFetcher.isStoringContent(job)) {
contentOut = newMapFile.Writer(job, fs, content.toString(),
Text.class, Content.class,
compType, progress);
}
if(GeneralChannelFetcher.isParsing(job)) {
parseOut = newGeneralChannelParseOutputFormat().getRecordWriter(fs, job, name, progress);
}
}
public void write(Textkey, NutchWritable value)
throws IOException {
Writable w =value.get();
if (w instanceofCrawlDatum)
fetchOut.append(key, w);
else if (winstanceof Content)
contentOut.append(key, w);
else if (winstanceof Parse)
parseOut.write(key, (Parse)w);
}
public voidclose(Reporter reporter) throws IOException {
fetchOut.close();
if (contentOut !=null) {
contentOut.close();
}
if (parseOut !=null) {
parseOut.close(reporter);
}
}
};
}
}
从中可以看出,根据不同的crawlDatun的内容,输出到不同的目录中。
if (w instanceof CrawlDatum)
fetchOut.append(key, w);
else if (winstanceof Content)
contentOut.append(key, w);
else if (winstanceof Parse)
parseOut.write(key, (Parse)w);
从这段代码可以看出,crawl_fetch中的内容是value,及其抓取状态信息。Content中的内容是网页的源码。而segments中的其他文件内容的产生,则由另外一个类来实现——GeneralChannelParseOutputFormat。
下面就来了解下这个类。
public RecordWriter<Text, Parse> getRecordWriter(FileSystemfs, JobConf job,
String name, Progressable progress) throws IOException {
this.filters = newURLFilters(job);
this.normalizers = newURLNormalizers(job, URLNormalizers.SCOPE_OUTLINK);
this.scfilters = newScoringFilters(job);
final int interval =job.getInt("db.fetch.interval.default", 2592000);
final booleanignoreExternalLinks = job.getBoolean("db.ignore.external.links",false);
int maxOutlinksPerPage =job.getInt("db.max.outlinks.per.page", 100);
final int maxOutlinks =(maxOutlinksPerPage < 0) ? Integer.MAX_VALUE
: maxOutlinksPerPage;
final CompressionTypecompType = SequenceFileOutputFormat.getOutputCompressionType(job);
Path out =FileOutputFormat.getOutputPath(job);
Path text = new Path(newPath(out, ParseText.DIR_NAME), name);
Path data = new Path(newPath(out, ParseData.DIR_NAME), name);
Path crawl = new Path(newPath(out, CrawlDatum.PARSE_DIR_NAME), name);
final String[]parseMDtoCrawlDB =job.get("db.parsemeta.to.crawldb","").split(" *,*");
final MapFile.WritertextOut =
new MapFile.Writer(job,fs, text.toString(), Text.class, ParseText.class,
CompressionType.RECORD,progress);
final MapFile.WriterdataOut =
new MapFile.Writer(job,fs, data.toString(), Text.class, ParseData.class,
compType, progress);
final SequenceFile.WritercrawlOut =
SequenceFile.createWriter(fs, job, crawl, Text.class, CrawlDatum.class,
compType, progress);
return newRecordWriter<Text, Parse>() {
public void write(Textkey, Parse parse)
throws IOException {
String[]secondleveldoamin=new String[]{"org","com","edu","net","ac","gov"};//�����
String fromUrl =key.toString();
String fromHost =null;
String toHost =null;
Stringfromdomain=null;
Stringtodomain=null;
textOut.append(key,new ParseText(parse.getText()));
ParseData parseData= parse.getData();
// recover thesignature prepared by Fetcher or ParseSegment
String sig =parseData.getContentMeta().get(Nutch.SIGNATURE_KEY);
if (sig != null) {
byte[] signature =StringUtil.fromHexString(sig);
if (signature !=null) {
// append aCrawlDatum with a signature
CrawlDatum d =new CrawlDatum(CrawlDatum.STATUS_SIGNATURE, 0);
d.setSignature(signature);
crawlOut.append(key, d);
}
}
// see if the parsemetadata contain things that we'd like
// to pass to themetadata of the crawlDB entry
CrawlDatum parseMDCrawlDatum= null;
for (String mdname :parseMDtoCrawlDB) {
String mdvalue =parse.getData().getParseMeta().get(mdname);
if (mdvalue != null){
if(parseMDCrawlDatum == null) parseMDCrawlDatum = new CrawlDatum(
CrawlDatum.STATUS_PARSE_META, 0);
parseMDCrawlDatum.getMetaData().put(new Text(mdname),
newText(mdvalue));
}
}
if (parseMDCrawlDatum!= null) crawlOut.append(key, parseMDCrawlDatum);
try {
ParseStatuspstatus = parseData.getStatus();
if (pstatus !=null && pstatus.isSuccess() &&
pstatus.getMinorCode() == ParseStatus.SUCCESS_REDIRECT) {
String newUrl =pstatus.getMessage();
int refreshTime =Integer.valueOf(pstatus.getArgs()[1]);
try {
newUrl =normalizers.normalize(newUrl,
URLNormalizers.SCOPE_FETCHER);
} catch(MalformedURLException mfue) {
newUrl = null;
}
if (newUrl !=null) newUrl = filters.filter(newUrl);
String url =key.toString();
if (newUrl !=null && !newUrl.equals(url)) {
String reprUrl=
URLUtil.chooseRepr(url, newUrl,
refreshTime < Fetcher.PERM_REFRESH_TIME);
CrawlDatumnewDatum = new CrawlDatum();
newDatum.setStatus(CrawlDatum.STATUS_LINKED);
if (reprUrl !=null && !reprUrl.equals(newUrl)) {
newDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY,
new Text(reprUrl));
}
crawlOut.append(new Text(newUrl), newDatum);
}
}
} catch(URLFilterException e) {
// ignore
}
// collect outlinksfor subsequent db update
Outlink[] links =parseData.getOutlinks();
int outlinksToStore= Math.min(maxOutlinks, links.length);
if(ignoreExternalLinks) {
try {/*此处做了修改,对于外连接进行过滤,过滤的规则是,将由相同domain的取出,nutch自带的是将
具有相同host的取出*/
fromHost = newURL(fromUrl).getHost().toLowerCase();
String[]fromHosts=fromHost.split("\\.");
inti=fromHosts.length-1;
if(fromHosts[i].equals("cn")){
for(intj=0;j<secondleveldoamin.length;j++){
if(fromHosts[i-1].equals(secondleveldoamin[j]))
{
fromdomain=fromHosts[i-2];
break;
}
else
continue;
}
if(fromdomain==null)
fromdomain=fromHosts[i-1];
}
if(fromHosts[i].equals("org")||fromHosts[i].equals("com")
||fromHosts[i].equals("net"))
fromdomain=fromHosts[i-1];
} catch(MalformedURLException e) {
fromHost = null;
}
} else {
fromHost = null;
}
int validCount = 0;
CrawlDatum adjust =null;
List<Entry<Text, CrawlDatum>> targets = newArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);
List<Outlink>outlinkList = new ArrayList<Outlink>(outlinksToStore);
for (int i = 0; i< links.length && validCount < outlinksToStore; i++) {
String toUrl =links[i].getToUrl();
// ignore links toself (or anchors within the page)
if(fromUrl.equals(toUrl)) {
continue;
}
if(ignoreExternalLinks) {
try {
toHost = newURL(toUrl).getHost().toLowerCase();
String[]toHosts=toHost.split("\\.");
intk=toHosts.length-1;
if(toHosts[k].equals("cn")){
for(intj=0;j<secondleveldoamin.length;j++){
if(toHosts[k-1].equals(secondleveldoamin[j]))
{
todomain=toHosts[k-2];
break;
}
else
continue;
}
if(todomain==null)
todomain=toHosts[k-1];
}
if(toHosts[k].equals("org")||toHosts[k].equals("com")
||toHosts[k].equals("net"))
todomain=toHosts[k-1];
} catch (MalformedURLExceptione) {
toHost = null;
}
if (todomain ==null || !todomain.equals(fromdomain)) { // external links
continue; //skip it
}
// if(toHost==null||!toHost.equals(fromHost)){
// continue;
// }
}
try {
toUrl =normalizers.normalize(toUrl,
URLNormalizers.SCOPE_OUTLINK); // normalize the url
toUrl =filters.filter(toUrl); // filter theurl
if (toUrl ==null) {
continue;
}
} catch (Exceptione) {
continue;
}
CrawlDatum target= new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);
Text targetUrl =new Text(toUrl);
try {
scfilters.initialScore(targetUrl, target);
} catch(ScoringFilterException e) {
LOG.warn("Cannot filter init score for url " + key +
",using default: " + e.getMessage());
target.setScore(0.0f);
}
targets.add(newSimpleEntry(targetUrl, target));
outlinkList.add(links[i]);
validCount++;
}
try {
// compute scorecontributions and adjustment to the original score
adjust =scfilters.distributeScoreToOutlinks((Text)key, parseData,
targets,null, links.length);
} catch (ScoringFilterExceptione) {
LOG.warn("Cannot distribute score from " + key + ":" + e.getMessage());
}
for (Entry<Text,CrawlDatum> target : targets) {
crawlOut.append(target.getKey(), target.getValue());
}
if (adjust != null)crawlOut.append(key, adjust);
Outlink[]filteredLinks = outlinkList.toArray(new Outlink[outlinkList.size()]);
parseData = newParseData(parseData.getStatus(), parseData.getTitle(),
filteredLinks,parseData.getContentMeta(),
parseData.getParseMeta());
dataOut.append(key,parseData);
if(!parse.isCanonical()) {
CrawlDatum datum =new CrawlDatum();
datum.setStatus(CrawlDatum.STATUS_FETCH_SUCCESS);
String timeString= parse.getData().getContentMeta().get(Nutch.FETCH_TIME_KEY);
try {
datum.setFetchTime(Long.parseLong(timeString));
} catch (Exceptione) {
LOG.warn("Can't read fetch time for: " + key);
datum.setFetchTime(System.currentTimeMillis());
}
crawlOut.append(key, datum);
}
}
public voidclose(Reporter reporter) throws IOException {
textOut.close();
dataOut.close();
crawlOut.close();
}
};
}
}
这个类大致做了以下几件事情。产生crawl_parse、parse_text、parse_data三个文件夹。Prase_text就是网页中解析出来的文本内容。Crawl_parse中最主要的是包含了从ParseData中提取出来的Outlink格式化了的外连接信息,外连接由CrawlDatum.STATUS_LINKED做标记。
此外,crawl_parse中还包含了其他一些内容。但是如果要提取外连接的话,根据Liked即可获取。
在这段代码中还有个参数可以设置——ignoreExternalLinks。这个BOOLEAN参数用来设置是否需要外连接。外连接是用来更新crawldb中的内容的,当然你可以设置db.update.additions.allowed,来要求外连接是否更新到crawldb中。
当ignoreExternalLinks设置为true时,你可以更改外连接选取规则,来选择你想要的外连接。Nutch自带的是host相同的外连接,上面的代码是domain一样的外连接。
挑选出外连接之后,以装有外连接的数组为构造参数,重新构造一个ParseData,产生parse_data文件夹。