Nutch爬取JS

1,修改regex-urlfilter.txt,去掉js|JS        
    # skip image and other suffixes we can't yet parse    
    # for a more extensive coverage use the urlfilter-suffix plugin    
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP)$    
        
2,变更nutch-site.xml,加入parse-js        
    <property>    
        <name>plugin.includes</name>
        <value>protocol-httpclient|urlfilter-regex|parse-(html|js|tika)|index-(basic|anchor|self)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
        <description>Regular expression naming plugin directory names to
        include.  Any plugin not matching this expression is excluded.
        In any case you need at least include the nutch-extensionpoints plugin. By
        default Nutch includes crawling just HTML and plain text via HTTP,
        and basic indexing and search plugins. In order to use HTTPS please enable
        protocol-httpclient, but be aware of possible intermittent problems with the
        underlying commons-httpclient library.
        </description>
    </property>    
        
3,变更原代码parse-js插件        
    3-1,变更类:org.apache.nutch.parse.js.JSParseFilter    
    变更方法:public Parse getParse(String url, WebPage page) {    
    变更行:if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/x-javascript"))    
    变更后:if (type != null && !type.trim().equals("") && !type.toLowerCase().startsWith("application/javascript"))    
        
    3-2,parse-js/plugin.xml    
    "application/x-javascript"->"application/javascript"    
     (*)解释一下:js mine 类型javascript----text/javascript,application/javascript, and appliation/x-javascript
    传统的javascript程序的MIME类型是“text/javascript”,其他使用的还有"application/x-javascript"(x前缀表示这是实验性的,不是标准的类型),RFC4329规定了“text/javascript”类型,因为它普遍被使用。然而,javascript程序并不是真正的文本文件,这就表示这个类型已经意味着过时了,而推荐使用"application/javascript"(去除x前缀)。然而,在写程序的时候,"application/javascript"没有很好的支持。这也行就是"application/x-javascript"被使用在很多网页中的原因。

 
4,变更nutch/conf/parse-plugin.xml        
    追加:    
    <mimeType name="application/javascript">    
        <plugin id="parse-js" />

你可能感兴趣的:(Nutch,Parse-js)