Nutch-Crawl: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http

Nutch-Crawl: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
我在Run Nutch的时候出现这样的错误 -

08 / 07 / 07   04 : 05 : 41  INFO conf.Configuration: found resource crawl - urlfilter.txt at file: / home / hut / installfiles / nutch - 0.9 / out / production / nutch - 0.9 / crawl - urlfilter.txt
08 / 07 / 07   04 : 05 : 41  INFO conf.Configuration: found resource parse - plugins.xml at file: / home / hut / installfiles / nutch - 0.9 / out / production / nutch - 0.9 / parse - plugins.xml
08 / 07 / 07   04 : 05 : 41  INFO fetcher.Fetcher: fetching http: // www.yale.edu/
08 / 07 / 07   04 : 05 : 41  INFO fetcher.Fetcher: fetching http: // www.harvard.edu/
08 / 07 / 07   04 : 05 : 41  INFO fetcher.Fetcher: fetch of http: // www.harvard.edu/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
08 / 07 / 07   04 : 05 : 41  INFO fetcher.Fetcher: fetch of http: // www.yale.edu/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http

解决方法:nutch-site.xml
     < property >
        
< name > plugin.includes </ name >
        
< value >
            nutch-extensionpoints|
protocol - http | urlfilter - regex | parse - (text | html | js) | index - basic | query - (basic | site | url) | summary - basic | scoring - opic | urlnormalizer - (pass | regex | basic)
        
</ value >
        
< description > Regular expression naming plugin directory names to
            include. Any plugin not matching 
this  expression is excluded.
            In any 
case  you need at least include the nutch - extensionpoints plugin. By
            
default  Nutch includes crawling just HTML and plain text via HTTP,
            and basic indexing and search plugins. In order to use HTTPS please enable
            protocol
- httpclient, but be aware of possible intermittent problems with the
            underlying commons
- httpclient library.
        
</ description >
    
</ property >

nutch-extensionpoints| 被我错误的删除了,还原以后一切工作正常. 默认情况下nutch0.9的目录结构中并没有plugin.includes这个properties, 它会载入nutch-default.xml里面的plugin.includes所以定义的所有的plugin. 在nutch-site.xml编辑/加入 plugin.includes properties的目的是为了加入我们自己的plugin而覆盖nutch-default.xml定义的.

你可能感兴趣的:(Nutch-Crawl: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http)