nutch之crawl命令

1、建立urls目录并添加163文件

[root@localhost nutch]#mkdir urls [root@localhost nutch]#echo http://www.163.com/>>urls/

2、编辑conf/crawl-urlfilter.txt文件,设定要抓取的网址信息

[root@localhost nutch]#vi conf/crawl-urlfilter.txt 修改MY.DOMAIN.NAME为: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*/.)*163.com/

3、编辑conf/nutch-site.xml文件,增加代理的属性,并编辑相应的属性值

<property> <name>http.agent.name</name> <value>bupt</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> 同理配置其他属性,注意http.robots.agents属性值也为bupt

4、配置tomcat的搜索目录

[root@localhost nutch]#cd ~/tomcat [root@localhost tomcat]#vi webapps/ROOT/WEB-INF/classes/nutch-site.xml 增加四行代码,修改成为 <configuration> <property> <name>searcher.dir</name> <value>/root/nutch/crawl.demo</value> </property> </configuration> value的值指向nutch抓取的页面的保存目录

5、中文乱码
nutch对中文的支持还不完善,需要修改tomcat文件夹下conf/server.xml文件

[root@localhost tomcat]#vi conf/server.xml 增加两句,修改为 <Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75" enableLookups="false" redirectPort="8443" acceptCount="100" connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8" useBodyEncodingForURI="true" />

6、执行抓取命令

[root@localhost tomcat]#cd ~/nutch [root@localhost nutch]#bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4 -topN 50 >& crawl.log </div> <!-- Baidu Button BEGIN -->

0
0

你可能感兴趣的:(tomcat,properties)