Nutch环境搭建

环境ubantu11.10 64位
环境ubantu

下载jdk1.7 64位http://www.oracle.com/technetwork/cn/java/javase/downloads/jdk7-downloads-1880260.html

?paohaijiao@ubuntu :~$ uname -a
Linux ubuntu 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:56:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

paohaijiao@ubuntu :~$ java -verison
The program 'java' can be found in the following packages:
 * gcj-4.4-jre-headless
 * gcj-4.6-jre-headless
 * openjdk-6-jre-headless
 * gcj-4.5-jre-headless
 * openjdk-7-jre-headless
Try: sudo apt-get install <selected package>
paohaijiao@ubuntu :~$

paohaijiao@ubuntu :~$ sudo  mkdir /usr/java
[sudo] password for paohaijiao:

2.开启ssh服务

sudo apt-get install openssh-server

paohaijiao@ubuntu :~/Desktop$ ls
jdk-7u79-linux-x64.tar.gz
paohaijiao@ubuntu:~/Desktop$ sudo mv jdk-7u79-linux-x64.tar.gz /usr/java  
paohaijiao@ubuntu:~/Desktop$ cd /usr/java  
paohaijiao@ubuntu:/usr/java$ ls
jdk-7u79-linux-x64.tar.gz
paohaijiao@ubuntu:/usr/java$ sudo tar -zxvf jdk-7u79-linux-x64.tar.gz
paohaijiao@ubuntu:/usr/java$ sudo rm jdk-7u79-linux-x64.tar.gz
paohaijiao@ubuntu:/usr/java$ ls
jdk1.7.0_79
paohaijiao@ubuntu:/usr/java$ cd jdk1.7.0_79

sudo vim /etc/profile

export JAVA_HOME=/usr/java/jdk1.7.0_79
export JRE_HOME=/usr/java/jdk1.7.0_79/jre  
exportCLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib  
export PATH=$PATH:$JAVA_HOME/bin
?source /etc/profile

paohaijiao@ubuntu:/usr/java/jdk1.7.0_79$ java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

http://svn.apache.org/repos/asf/nutch/tags/release-1.6/
换源
进入软件跟新->ubantu soft ->Server for United States->other->china->163
haijiao@ubuntu:/usr/java/jdk1.7.0_79$ sudo apt-get install subversion
paohaijiao@ubuntu:~$ svn co http://svn.apache.org/repos/asf/nutch/tags/release-1.6/
A    release-1.6/LICENSE.txt
 U   release-1.6
Checked out revision 1721679.

paohaijiao@ubuntu:~$ cd release-1.6
paohaijiao@ubuntu:~/release-1.6$ ls
build.xml    conf                ivy   lib          NOTICE.txt  README.txt
CHANGES.txt  default.properties  KEYS  LICENSE.txt  pom.xml     src

paohaijiao@ubuntu:~/release-1.6$ ant
The program 'ant' can be found in the following packages:
 * ant
 * ant1.7
Try: sudo apt-get install <selected package>
paohaijiao@ubuntu:~/release-1.6$ sudo apt-get install ant
aijiao@ubuntu:~/release-1.6$ cd /home/paohaijiao/release-1.6
进入到有builed.xml的文件夹下ant编译
1通过nutch 诞生了 hadoop tika gora(数据存储层抽象)
2nutch 通过ivy来进行依赖管理1.2之后
3nutch 是svn进行管理的
4lucene nutch hadoop相当有名
job:
      [jar] Building jar: /home/paohaijiao/release-1.6/build/apache-nutch-1.6.job

runtime:
    [mkdir] Created dir: /home/paohaijiao/release-1.6/runtime
    [mkdir] Created dir: /home/paohaijiao/release-1.6/runtime/local
    [mkdir] Created dir: /home/paohaijiao/release-1.6/runtime/deploy
     [copy] Copying 1 file to /home/paohaijiao/release-1.6/runtime/deploy
     [copy] Copying 2 files to /home/paohaijiao/release-1.6/runtime/deploy/bin
     [copy] Copying 1 file to /home/paohaijiao/release-1.6/runtime/local/lib
     [copy] Copying 1 file to /home/paohaijiao/release-1.6/runtime/local/lib/native
     [copy] Copying 23 files to /home/paohaijiao/release-1.6/runtime/local/conf
     [copy] Copying 2 files to /home/paohaijiao/release-1.6/runtime/local/bin
     [copy] Copying 50 files to /home/paohaijiao/release-1.6/runtime/local/lib
     [copy] Copying 126 files to /home/paohaijiao/release-1.6/runtime/local/plugins
     [copy] Copied 2 empty directories to 2 empty directories under /home/paohaijiao/release-1.6/runtime/local/test
编译了两个小时
     paohaijiao@ubuntu:~/release-1.6$ ls
build      CHANGES.txt  default.properties  KEYS  LICENSE.txt  pom.xml     runtime
build.xml  conf         ivy                 lib   NOTICE.txt   README.txt  src

paohaijiao@ubuntu:~/release-1.6/runtime$ ls
deploy(hadoop) local
paohaijiao@ubuntu:~/release-1.6/runtime/local$ ls
bin  conf  lib  plugins  test
paohaijiao@ubuntu:~/release-1.6/runtime/local$ ls bin
crawl  nutch
nutch 和hadoop怎么联系起来的通过nutch脚本
paohaijiao@ubuntu:~/release-1.6/runtime/local$ bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
paohaijiao@ubuntu:~/release-1.6/runtime/local$
nutch入门重点分析nutch脚本实现 通过hadoop命令把apache-nutch-1.6.jar提交给hadoop jobtracker
paohaijiao@ubuntu:~/release-1.6/runtime/local$ bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
paohaijiao@ubuntu:~/release-1.6/runtime/local$ mkdir urls
paohaijiao@ubuntu:~/release-1.6/runtime/local$ vi urls/url.txt
paohaijiao@ubuntu:~/release-1.6/runtime/local$ vi urls/url.txt
http://blog.tianya.cn/
paohaijiao@ubuntu:~/release-1.6/runtime/local$ nohup bin/nutch crawl urls  -dir data -threads 100 -depth 5 &











你可能感兴趣的:(Nutch环境搭建)