heritrix请求操作无法使用用户映射区域打开文件上执行

Heritirx的各类问题汇总
为什么出现这种错误?

06/14/2007 11:07:38 +0800 警告 org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence decodeToFile D:\eclipse\workspace\heritrixProject\jobs\163-20070614025526671\scratch\tt13http.ris.UTF-16BE already exists
06/14/2007 11:07:40 +0800 警告 org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence decodeToFile D:\eclipse\workspace\heritrixProject\jobs\163-20070614025526671\scratch\tt5http.ris.UTF-16BE already exists
06/14/2007 11:07:40 +0800 严重 org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence deleteFile Deleting D:\eclipse\workspace\heritrixProject\jobs\163-20070614025526671\scratch\tt5http.ris.UTF-16BE because of java.io.FileNotFoundException: D:\eclipse\workspace\heritrixProject\jobs\163-20070614025526671\scratch\tt5http.ris.UTF-16BE (请求的操作无法在使用用户映射区域打开的文件上执行。)
06/14/2007 11:07:40 +0800 严重 org.archive.crawler.extractor.ExtractorHTML extract Failed get of replay char sequence in ToeThread #5: http://mobile.163.com/0011/product/0011000B/product/000D/0BKN/0BKP/0HHY.html
java.io.FileNotFoundException: D:\eclipse\workspace\heritrixProject\jobs\163-20070614025526671\scratch\tt5http.ris.UTF-16BE (请求的操作无法在使用用户映射区域打开的文件上执行。)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence.decodeToFile(ReplayCharSequenceFactory.java:868)
at org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence.decode(ReplayCharSequenceFactory.java:746)
at org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence.<init>(ReplayCharSequenceFactory.java:682)
at org.archive.io.ReplayCharSequenceFactory$MultiByteReplayCharSequence.<init>(ReplayCharSequenceFactory.java:673)
at org.archive.io.ReplayCharSequenceFactory.getReplayCharSequence(ReplayCharSequenceFactory.java:127)
at org.archive.io.RecordingOutputStream.getReplayCharSequence(RecordingOutputStream.java:460)
at org.archive.io.RecordingInputStream.getReplayCharSequence(RecordingInputStream.java:357)
at org.archive.util.HttpRecorder.getReplayCharSequence(HttpRecorder.java:296)
at org.archive.crawler.extractor.ExtractorHTML.extract(ExtractorHTML.java:497)
at org.archive.crawler.extractor.Extractor.innerProcess(Extractor.java:67)
at org.archive.crawler.framework.Processor.process(Processor.java:103)
at org.archive.crawler.framework.ToeThread.processCrawlUri(ToeThread.java:304)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:153)原因是:在本地机器上,为每个Heritrix的线程所分配的缓冲区数量不够,导致线程需要频繁和磁盘交换。避免这一问题的方法是1. 在settings里面,将recorder-in-buffer-bytes这个属性的值改大,默认时是65536(64K),可以将其改成1048576 = 1M,这样就不会报这个错了。或是直接将order.xml里的这个值改大也行。






6/17/2007 09:45:26 +0000 严重 org.archive.crawler.framework.CrawlController ini
tialize On crawl: sohu_news Unable to setup crawl modules
java.lang.ClassCastException: org.archive.crawler.settings.ModuleType cannot be cast to org.archive.crawler.framework.Frontier
        at org.archive.crawler.framework.CrawlController.setupCrawlModules(Crawl
Controller.java:654)
        at org.archive.crawler.framework.CrawlController.initialize(CrawlControl
ler.java:377)
        at org.archive.crawler.admin.CrawlJob.setupForCrawlStart(CrawlJob.java:8
46)
        at org.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJ
obHandler.java:1142)
        at org.archive.crawler.admin.CrawlJobHandler$3.run(CrawlJobHandler.java:
1125)
        at java.lang.Thread.run(Unknown Source)
org.archive.crawler.framework.exceptions.InitializationException: On crawl: sohu
_news Unable to setup crawl modules
        at org.archive.crawler.framework.CrawlController.initialize(CrawlControl
ler.java:383)
        at org.archive.crawler.admin.CrawlJob.setupForCrawlStart(CrawlJob.java:8
46)
        at org.archive.crawler.admin.CrawlJobHandler.startNextJobInternal(CrawlJ
obHandler.java:1142)
        at org.archive.crawler.admin.CrawlJobHandler$3.run(CrawlJobHandler.java:
1125)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassCastException: org.archive.crawler.settings.ModuleType
cannot be cast to org.archive.crawler.framework.Frontier
        at org.archive.crawler.framework.CrawlController.setupCrawlModules(Crawl
Controller.java:654)
        at org.archive.crawler.framework.CrawlController.initialize(CrawlControl
ler.java:377)
        ... 4 more

这种错误有很多版本,但是基本上,只要碰上java.lang.ClassCastException

一般都是由于处理器链没有正确设置而导致

比如,在应该是Prefetcher的地方,设置成了Writer。这样就会导致错误

请严格按照以下方式来设置:

1. frontier

org.archive.crawler.frontier.BdbFrontier

2. scope

org.archive.crawler.scope.BroadScope

3. Prefetcher

org.archive.crawler.prefetch.Preselector
org.archive.crawler.prefetch.PreconditionEnforcer

4. Fetcher

org.archive.crawler.fetcher.FetchDNS
org.archive.crawler.fetcher.FetchHTTP

5. Extractor

org.archive.crawler.extractor.ExtractorHTTP
org.archive.crawler.extractor.ExtractorHTML
(这里可以按自己的需要多添几个,比如ExtractorSWF、ExtractorJS什么的,但是前两个是必不可少的)

6. Writer

可以是MirrorWriter或ARCWriter,一般建议使用MirrorWriter

7. PostProcessor

org.archive.crawler.postprocessor.CrawlStateUpdater
org.archive.crawler.postprocessor.LinksScoper
org.archive.crawler.postprocessor.FrontierScheduler
(FrontierScheduler可以自行扩展,按书上的方法)

你可能感兴趣的:(java,eclipse,多线程,thread,mobile)