windows下nutch0.8初探

前一段时间试了一下nutch0.8没成功,然后尝试nutch-0.7.x都很顺利搞定,起初以为0.8有问题,但后来一些网友告诉我0.8没问题,我重新再试,好了,我来说说其中要注意的问题,以免新手走弯路。

我在windwos下开发,也懒得下载cygwin,把那个shell脚本改成了ant,点击ant就可以达到效果,脚本如下:
< project name ="nutch-crawl" default ="crawl" basedir ="." >


< property name ="lib.dir" location ="lib" />
< property name ="conf.dir" location ="conf" />
< property name ="urls.dir" location ="urls" />



< path id ="project.classpath" >

< fileset dir ="${lib.dir}" />
< pathelement path ="${conf.dir}" />
< fileset dir ="." includes ="nutch-*.jar" />

</ path >


< target name ="crawl" >
< echo > crwalingstarting... </ echo >
< property name ="JVM.extra.args" value ="-Xmx1000m" />
< java classname ="org.apache.nutch.crawl.Crawl" classpathref ="project.classpath" fork ="true" >
< jvmarg line ="${JVM.extra.args}" />
< arg value ="${urls.dir}" />
< arg value ="-dir" />
< arg value ="e:/xxcrawled20" />
< arg value ="-depth" />
< arg value ="2" />
< arg value ="-threads" />
< arg value ="10" />
</ java >
< echo > crwalingfinished... </ echo >
</ target >

</ project >


应该注意2点
1)增加一个目录urls,放入一个文件,文件内容填上你要爬的url
2)修改nutch-site.xml,覆盖http.agent.name属性,一定要填入值


还有一点,如果你使用上面的ant脚本,你必须注意类路径的顺序, < pathelement path ="${conf.dir}" />必须位于
< fileset dir ="." includes ="nutch-*.jar" />之前。否则jar中的那个空的nutch-site.xml会取代conf目录下你修改好的nutch-site.xml

至于搜索这个部分没什么好说的
在nutch-site.xml中加入:
< property >
< name > searcher.dir </ name >
< value > E:/xxcrawled2 </ value >
</ property >

value部分填入你crwal时设置的目录

你可能感兴趣的:(windows)