从上篇的Crawl可以看到,抓取过程是按一个一个阶段,逐步进行。所以先看Injector( org.apache.nutch.crawl.Injector)
// initialize crawlDb injector.inject(crawlDb, rootUrlDir);
,从代码可以很明显看出,nutch是建立于hadoop之上,只不过使用的是旧的api。
Injector主要功能:
1.对url文件进行规范化和过滤,将结果存入临时文件夹
2.将上述结果与老的crawldb/current合并,产生一个新的,来替换原有的。
public void inject(Path crawlDb, Path urlDir) throws IOException { SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); long start = System.currentTimeMillis(); if (LOG.isInfoEnabled()) { LOG.info("Injector: starting at " + sdf.format(start)); LOG.info("Injector: crawlDb: " + crawlDb); LOG.info("Injector: urlDir: " + urlDir); } //建立临时目录,用于mapreduce的临时输出 Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file if (LOG.isInfoEnabled()) { LOG.info("Injector: Converting injected urls to crawl db entries."); } JobConf sortJob = new NutchJob(getConf()); sortJob.setJobName("inject " + urlDir); FileInputFormat.addInputPath(sortJob, urlDir); sortJob.setMapperClass(InjectMapper.class); FileOutputFormat.setOutputPath(sortJob, tempDir); sortJob.setOutputFormat(SequenceFileOutputFormat.class); sortJob.setOutputKeyClass(Text.class); //输出数据类型为CrawlDatum.class sortJob.setOutputValueClass(CrawlDatum.class); sortJob.setLong("injector.current.time", System.currentTimeMillis()); //提交job RunningJob mapJob = JobClient.runJob(sortJob); long urlsInjected = mapJob.getCounters().findCounter("injector", "urls_injected").getValue(); long urlsFiltered = mapJob.getCounters().findCounter("injector", "urls_filtered").getValue(); LOG.info("Injector: total number of urls rejected by filters: " + urlsFiltered); LOG.info("Injector: total number of urls injected after normalization and filtering: " + urlsInjected); // merge with existing crawl db 合并已存在crawlDb if (LOG.isInfoEnabled()) { LOG.info("Injector: Merging injected urls into crawl db."); } JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb); FileInputFormat.addInputPath(mergeJob, tempDir); mergeJob.setReducerClass(InjectReducer.class); JobClient.runJob(mergeJob); CrawlDb.install(mergeJob, crawlDb); // clean up 删除临时文件夹 FileSystem fs = FileSystem.get(getConf()); fs.delete(tempDir, true); long end = System.currentTimeMillis(); LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end)); }
下面先看下对url文件操作的InjectMapper
public void configure(JobConf job) { this.jobConf = job; //初始化URLNormalizers URL规范器, urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT); interval = jobConf.getInt("db.fetch.interval.default", 2592000); //初始化过滤器 filters = new URLFilters(jobConf); //初始化分数过滤器 scfilters = new ScoringFilters(jobConf); //初始化,新加入url的得分 scoreInjected = jobConf.getFloat("db.score.injected", 1.0f); curTime = job.getLong("injector.current.time", System.currentTimeMillis()); }
可以看到findExtensions方法用来加载urlnormalizer的策略,不同的scope可以配置不同策略
/** * searches a list of suitable url normalizer plugins for the given scope. * * @param scope * Scope for which we seek a url normalizer plugin. * @return List - List of extensions to be used for this scope. If none, * returns null. * @throws PluginRuntimeException */ private List<Extension> findExtensions(String scope) { String[] orders = null; String orderlist = conf.get("urlnormalizer.order." + scope); if (orderlist == null) orderlist = conf.get("urlnormalizer.order"); if (orderlist != null && !orderlist.trim().equals("")) { orders = orderlist.trim().split("\\s+"); } String scopelist = conf.get("urlnormalizer.scope." + scope); Set<String> impls = null; if (scopelist != null && !scopelist.trim().equals("")) { String[] names = scopelist.split("\\s+"); impls = new HashSet<String>(Arrays.asList(names)); } Extension[] extensions = this.extensionPoint.getExtensions(); HashMap<String, Extension> normalizerExtensions = new HashMap<String, Extension>(); for (int i = 0; i < extensions.length; i++) { Extension extension = extensions[i]; if (impls != null && !impls.contains(extension.getClazz())) continue; normalizerExtensions.put(extension.getClazz(), extension); } List<Extension> res = new ArrayList<Extension>(); if (orders == null) { res.addAll(normalizerExtensions.values()); } else { // first add those explicitly named in correct order for (int i = 0; i < orders.length; i++) { Extension e = normalizerExtensions.get(orders[i]); if (e != null) { res.add(e); normalizerExtensions.remove(orders[i]); } } // then add all others in random order res.addAll(normalizerExtensions.values()); } return res; }
urlnormalizer相关配置文件
<!-- URL normalizer properties --> <property> <name>urlnormalizer.order</name> <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> <description>Order in which normalizers will run. If any of these isn't activated it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the ones specified here are run. </description> </property> <property> <name>urlnormalizer.regex.file</name> <value>regex-normalize.xml</value> <description>Name of the config file used by the RegexUrlNormalizer class. </description> </property> <property> <name>urlnormalizer.loop.count</name> <value>1</value> <description>Optionally loop through normalizers several times, to make sure that all transformations have been performed. </description> </property>
Urlfilter的初始化,可以看到是由插件仓库
ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint( URLFilter.X_POINT_ID); if (point == null) throw new RuntimeException(URLFilter.X_POINT_ID + " not found.");
/** * @return a cached instance of the plugin repository */ public static synchronized PluginRepository get(Configuration conf) { String uuid = NutchConfiguration.getUUID(conf); if (uuid == null) { uuid = "nonNutchConf@" + conf.hashCode(); // fallback } PluginRepository result = CACHE.get(uuid); //如果为空,初始化 if (result == null) { result = new PluginRepository(conf); CACHE.put(uuid, result); } return result; }
public PluginRepository(Configuration conf) throws RuntimeException { //初始化活动插件的集合 fActivatedPlugins = new HashMap<String, Plugin>(); //初始化扩展点的集合 fExtensionPoints = new HashMap<String, ExtensionPoint>(); this.conf = conf; //读取配置,是否自动激活 this.auto = conf.getBoolean("plugin.auto-activation", true); //读取配置,插件存放目录 String[] pluginFolders = conf.getStrings("plugin.folders"); //工具类,作用就是遍历插件存放目录,找到plugin.xml,每一个插件对应一个plugin.xml。 //根据plugin生成PluginDescriptor的集合 PluginManifestParser manifestParser = new PluginManifestParser(conf, this); Map<String, PluginDescriptor> allPlugins = manifestParser .parsePluginFolder(pluginFolders); //要排除的插件,正则表达式 Pattern excludes = Pattern.compile(conf.get("plugin.excludes", "")); //要包含的插件,正则表达式 Pattern includes = Pattern.compile(conf.get("plugin.includes", "")); //对不适用的插件进行过滤 Map<String, PluginDescriptor> filteredPlugins = filter(excludes, includes, allPlugins); //对插件的依赖关系检查 fRegisteredPlugins = getDependencyCheckedPlugins(filteredPlugins, this.auto ? allPlugins : filteredPlugins); //安装扩展点 installExtensionPoints(fRegisteredPlugins); try { installExtensions(fRegisteredPlugins); } catch (PluginRuntimeException e) { LOG.error(e.toString()); throw new RuntimeException(e.getMessage()); } displayStatus(); }
/** * Returns a list of all found plugin descriptors. * * @param pluginFolders * folders to search plugins from * @return A {@link Map} of all found {@link PluginDescriptor}s. */ public Map<String, PluginDescriptor> parsePluginFolder(String[] pluginFolders) { Map<String, PluginDescriptor> map = new HashMap<String, PluginDescriptor>(); if (pluginFolders == null) { throw new IllegalArgumentException("plugin.folders is not defined"); } for (String name : pluginFolders) { File directory = getPluginFolder(name); if (directory == null) { continue; } LOG.info("Plugins: looking in: " + directory.getAbsolutePath()); for (File oneSubFolder : directory.listFiles()) { if (oneSubFolder.isDirectory()) { String manifestPath = oneSubFolder.getAbsolutePath() + File.separator + "plugin.xml"; try { LOG.debug("parsing: " + manifestPath); PluginDescriptor p = parseManifestFile(manifestPath); map.put(p.getPluginId(), p); } catch (MalformedURLException e) { LOG.warn(e.toString()); } catch (SAXException e) { LOG.warn(e.toString()); } catch (IOException e) { LOG.warn(e.toString()); } catch (ParserConfigurationException e) { LOG.warn(e.toString()); } } } } return map; }