nutch windows install guider
--By Liming Liu
1 Install Cygwin
2 Install JDK
3 Install Tomcat
4 Pre-Install nutch
5 Configure and run nutch
6 Begin search
7 Referece
Download and install the latest version, must select GCC while selecting packages.
Download jdk-1_5_0_06-windows-i586-p.exe and install(acquiescently, C:/Program Files/Java/jdk1.5.0_06 ).
Set environmental variable: NUTCH_JAVA_HOME: C:/Program Files/Java/jdk1.5.0_06
JAVA_HOME: C:/Program Files/Java/jdk1.5.0_06
Download apache-tomcat-6.0.13.exe and install(acquiescently, C:/Program Files/Apache Software Foundation/Tomcat 6.0).Remember the port, account and password.
Download nutch-0.9.tar.gz and unzip to nutch-0.9(such as C:/dev/search/netch/nutch-0.9).
Start Tomcat service, open http://localhost:8080/manager/html
Move to “WAR file to deploy”, upload file: C:/dev/search/netch/nutch-0.9/nutch-0.9.war.
Close Tomcat service, change directory name “ROOT” in “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps” to “ ROOT-backup”, change directory name “nutch-0.9” in “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps” to “ ROOT”.( OR do nothing)
Create directory “urls” in “C:/dev/search/netch/nutch-0.9”.
Create a file “testurlfile” in directory “urls”.
Add line: “http://www.bokee.com “ to file “testurlfile”.
Find file “C:/dev/search/netch/nutch-0.9/conf/ crawl-urlfilter.txt”, replace “MY.DOMAIN.NAME” with “bokee.com”
Find file “C:/dev/search/netch/nutch-0.9/conf/ nutch-site.xml”, edit it to this:
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
the User-Agent header. It appears in parenthesis after the agent name.
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
Find file “C:/Program Files/Apache Software Foundation/Tomcat 6.0/webapps/ROOT/WEB-INF/classes/”, edit it to this:
Find file “C:/Program Files/Apache Software Foundation/Tomcat 6.0/conf/server.xml”.Edit the item “<Connector port="8080" …/>” to this:
<Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75" enableLookups="false" redirectPort="8443" acceptCount="100" debug="0" connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8"/>
Start tomcat service.
Start cygwin, cd to “C:/dev/search/netch/nutch-0.9”, run: bin/nutch crawl urls -dir crawl.demo -depth 2 -topN 50
Open http://localhost:8080 with internet explorer, you will see a real search engine.
(Or http://localhost:8080/nutch)
http://www.javaeye.com/topic/81627 Nutch_0.8实践 (1) X.D.Hua
http://www.ideagrace.com/club/simple/index.php?t312.html Nutch 于 winxp Kevin
http://blog.csdn.net/pwlazy/archive/2006/08/23/1109868.aspx windows下nutch0.8初探 pwlazy
Liming Liu:
刘黎明 北京科技大学计算机硕士 [email protected]