1. 改造代码
试用了nutch1.0一段时间,想改造到自己的全文检索程序中,nutch本身只是带有linux下运行的脚本,其实这些脚本是是为nutch在linux下运行设置类库,和调用哪些类用的。如果想在程序中调用nutch,只要改造nutch的几个主要的类的入口。
比如负责爬行的类是org.apache.nutch.crawl.Crawl,里面只有main方法,我们要调用显然用不了,自己写一个接受和main一样参数的静态方法就可以了。
以下是我的例子
在org.apache.nutch.crawl.Crawl中添加一个静态方法
/** * //20100104lqt * @param urltxt * @param pdir * @param pdepth * @param pthreads * @param ptopN */ public static void run(String realPath,String urltxt, String pdir, String pdepth, String pthreads, String ptopN) throws IOException { Configuration conf = NutchConfiguration.create(); conf.addResource("crawl-tool.xml"); JobConf job = new NutchJob(conf); Path rootUrlDir = null; Path dir = new Path("crawl-" + getDate()); int threads = job.getInt("fetcher.threads.fetch", 10); int depth = 5; long topN = Long.MAX_VALUE; if (urltxt != null) rootUrlDir = new Path(realPath+urltxt); if (pdir != null) dir = new Path(pdir); if (pdepth != null) depth = Integer.parseInt(pdepth); if (pthreads != null) threads = Integer.parseInt(pthreads); if (ptopN != null) topN = Integer.parseInt(ptopN); FileSystem fs = FileSystem.get(job); if (LOG.isInfoEnabled()) { LOG.info("crawl started in: " + dir); LOG.info("rootUrlDir = " + rootUrlDir); LOG.info("threads = " + threads); LOG.info("depth = " + depth); if (topN != Long.MAX_VALUE) LOG.info("topN = " + topN); } Path crawlDb = new Path(dir + "/crawldb"); Path linkDb = new Path(dir + "/linkdb"); Path segments = new Path(dir + "/segments"); Path indexes = new Path(dir + "/indexes"); Path index = new Path(dir + "/index"); Path tmpDir = job.getLocalPath("crawl" + Path.SEPARATOR + getDate()); Injector injector = new Injector(conf); Generator generator = new Generator(conf); Fetcher fetcher = new Fetcher(conf); ParseSegment parseSegment = new ParseSegment(conf); CrawlDb crawlDbTool = new CrawlDb(conf); LinkDb linkDbTool = new LinkDb(conf); Indexer indexer = new Indexer(conf); DeleteDuplicates dedup = new DeleteDuplicates(conf); IndexMerger merger = new IndexMerger(conf); // initialize crawlDb injector.inject(crawlDb, rootUrlDir); int i; for (i = 0; i < depth; i++) { // generate new segment Path segment = generator.generate(crawlDb, segments, -1, topN, System .currentTimeMillis()); if (segment == null) { LOG.info("Stopping at depth=" + i + " - no more URLs to fetch."); break; } fetcher.fetch(segment, threads, org.apache.nutch.fetcher.Fetcher .isParsing(conf)); // fetch it if (!Fetcher.isParsing(job)) { parseSegment.parse(segment); // parse it, if needed } crawlDbTool.update(crawlDb, new Path[] { segment }, true, true); // update crawldb } if (i > 0) { linkDbTool.invert(linkDb, segments, true, true, false); // invert links if (indexes != null) { // Delete old indexes if (fs.exists(indexes)) { LOG.info("Deleting old indexes: " + indexes); fs.delete(indexes, true); } // Delete old index if (fs.exists(index)) { LOG.info("Deleting old merged index: " + index); fs.delete(index, true); } } // index, dedup & merge FileStatus[] fstats = fs.listStatus(segments, HadoopFSUtil .getPassDirectoriesFilter(fs)); indexer.index(indexes, crawlDb, linkDb, Arrays.asList(HadoopFSUtil .getPaths(fstats))); if (indexes != null) { dedup.dedup(new Path[] { indexes }); fstats = fs.listStatus(indexes, HadoopFSUtil .getPassDirectoriesFilter(fs)); merger.merge(HadoopFSUtil.getPaths(fstats), index, tmpDir); } } else { LOG.warn("No URLs to fetch - check your seed list and URL filters."); } if (LOG.isInfoEnabled()) { LOG.info("crawl finished: " + dir); } }
2. 调用
我在一个jsp中调用
try{ Crawl.run("c:\\testproject","url.txt","C:\\crawled","3","4",null); }catch(Exception e){ e.printStackTrace(); }
参数可以用jsp传过来。
其他的命令也可以一样的改造,就可以把nutch添加到自己的应用中了。
jdk用1.6,nutch 的plugins目录放到b/s项目的项目源文件目录下, conf下的配置文件直接都放置到src目录下,lib下的jar放到b/s下的webroot的lib下,nutch-1.0.job放到src下。
如果碰到lib都加好jdk也设置1.6后还是存在exception编译问题,可能是类库中没加入1.6的类库。
nutch的原有目录如下图
加入到b/s下的位置
添加jdk1.6的类库