ant编译apache-nutch-2.3.1结合mysql实现爬虫

1 、安装ant(省略)

目前官方2.x只提供了源码下载,不再提供编译的版本,需要用户自己去编译。

2 下载nutch

2.1 下载地址:

http://www.apache.org/dyn/closer.lua/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz  
tar -zxvf apache-nutch-2.2.1-bin.tar.gz

2.2 下载sonar的jar包,将jar包放到apache-nutch-2.2.1目录下
修改build.xml,引入上面添加的jar包:

 
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml"> 
    <classpath path="${ant.library.dir}" /> 
    <classpath path="${mysql.library.dir}" /> 
    <classpath><fileset dir="." includes="sonar*.jar" />classpath> 
taskdef> 

2.3 nutch编译安装时需要从maven资源库下载jar包,将其修改

vi apache-nutch-2.2.1/ivy/ivy.setting.xml
"repo.maven.org" value="http://repo1.maven.org/maven2/" override="false"/>
将value=http://repo1.maven.org/maven2/修改成value= http://repo2.maven.org/maven2/

3 nutch存储采用mysql

修改apache-nutch-2.2.1/ivy/ivy.xml文件,取消注释

<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/> 
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" /> 

修改

<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>  

<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/> 

4 数据库连接配置

修改 apache-nutch-2.2.1/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:

###############################
# Default SqlStore properties #
###############################

gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:mysql://10.202.13.175:3306/nutch-test
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=sf123456

5 手动创建数据库nutch和数据表webpage(可以自动创建)

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

6 修改 ${NUTCH_HOME}/nutch-site.xml 配置文件

 中添加以下内容
<property>  
    <name>http.agent.namename>  
    YourNutchSpider  
property>  

<property>  
    <name>http.accept.languagename>  
    ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3  
    Value of the Accept-Language request header field.  
        This allows selecting non-English language as default one to retrieve.  
        It is a useful setting for search engines build for certain national group.  
      
property>

<property>  
    <name>storage.data.store.classname>  
    org.apache.gora.sql.store.SqlStore  
    The Gora DataStore class for storing and retrieving data.  
        Currently the following stores are available:.  
      
property>

<property>  
    <name>parser.character.encoding.defaultname>  
    utf-8  
    The character encoding to fall back to when no other information  
    is available  
property>
<property>  
    <name>generate.batch.idname>  
    *  
property>  

7 ant编译

切换到apache-nutch.2.2.1主目录下,运行ant命令

8 手动添加jar包(非必须)

用ivy下载jar包会有几个下载不了,我们可以直接去http://repo2.maven.org/maven2/org/bouncycastle/bcprov-jdk15on/1.52/bcprov-jdk15on-1.52.jar下载
然后将jar包拷贝到ivy仓库,然后重新编译

9 网页抓取配置

9.1设置抓取的网站

cd apache-nutch-2.2.1/runtime/local 
mkdir -p urls 
echo 'http://www.sina.com' > urls/seed.txt

9.2执行爬取操作

cd apache-nutch-2.2.1\runtime\deploy\bin
bin/nutch crawl urls -depth 3 -topN 5

你可能感兴趣的:(nutch)