搜索引擎环境搭建nutch2.2.1+solr4.2+mysql5.7(附PHP solr拓展安装)

如果jdk和ant已经搭建好可以跳过前面相应步骤



安装配置JDK


1.下载jdk8(http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)

2.解压tar包到 /usr/local/java下

# mkdir /usr/local/java
# tar -zxvf jdk-8xxx-linux-xxx.tar.gz /usr/local/java/

3.进入/usr/local/java目录找到jdk安装脚本

# ./jdk-8xxx-linux-xxx.bin

4.配置jdk环境变量

  • vi打开文本
# vi /etc/profile
  • 最后一行插入(JAVA_HOME为安装目录,后面版本号按具体修改)
export JAVA_HOME=/usr/local/jdk1.8.xxx
export PATH=$JAVA_HOME/bin:$PATH 
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar 
  • 保存文本后source
# source /etc/profile

5.验证jdk安装是否成功

# java --version
# javac

安装配置ant


6.下载ant(http://ant.apache.org)

7.解压tar包到 /usr/local/ant下

# mkdir /usr/local/ant
# tar -zxvf  apache-ant-1.9.7-bin.tar.gz /usr/local/ant/

8.配置ant环境变量

  • vi打开文本
# vi /etc/profile
  • 最后一行插入
export ANT_HOME=/usr/apache-ant-1.9.2
export PATH=$PATH:$ANT_HOME/bin
  • 保存文本后source
# source /etc/profile

9.验证ant安装是否成功

# ant -version

编译nutch


10.下载nutch2.2.1并解压(http://nutch.apache.org)

# mkdir /usr/local/nutch
# tar -zxvf apache-nutch-2.2.1-src.tar /usr/local/nutch/

11.配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml

${APACHE_NUTCH_HOME}表示nutch解压目录,如/usr/local/nutch,自行替换

(1)将以下行的注释取消
    <dependency org=”mysql” name=”mysql-connector-Java” rev=”5.1.18″ conf=”*->default”/>
(2)修改以下行
    <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>,改成<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
(3)将以下行的注释取消
    <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />

如下:
搜索引擎环境搭建nutch2.2.1+solr4.2+mysql5.7(附PHP solr拓展安装)_第1张图片

12.数据库连接配置

#vi ${NUTCH_HOME}/conf/gora.properties
  • 默认存储配置
###############################
# Default SqlStore properties #
###############################
        #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver       #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password=
  • 取消以下代码注释,按照具体数据库名称,用户名,密码修改以下项:
###############################
# Default SqlStore properties #
################################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver     gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)

如图:
搜索引擎环境搭建nutch2.2.1+solr4.2+mysql5.7(附PHP solr拓展安装)_第2张图片

13.修改${APACHE_NUTCH_HOME}/conf/nutch-site.xml的configuration标签里加入:

<property>
    <name>http.agent.namename>
    <value>Spidervalue>
property>

<property>
    <name>http.accept.languagename>
    <value>ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3value>
property>

<property>
     <name>http.accept.languagename>
     <value>ja-jp,en-us,en-gb,en;q=0.7,*;q=0.3value>
     <description>Value of theAccept-Language request header field.
             This allows selecting non-Englishlanguage as default one to retrieve.
             It is a useful setting for search enginesbuild for certain national group.
     description>
property>

<property>
     <name>storage.data.store.classname>                <value>org.apache.gora.sql.store.SqlStorevalue>
     <description>TheGora DataStore class for storing and retrieving data.
             Currently the following stores areavailable:.
    description>
property>

<property>
     <name>parser.character.encoding.defaultname>
     <value>utf-8value>
     <description>Thecharacter encoding to fall back to when no other information
             isavailable
     description>
property>

<property>
     <name>generate.batch.idname>
     <value>*value>
property>

如图:
搜索引擎环境搭建nutch2.2.1+solr4.2+mysql5.7(附PHP solr拓展安装)_第3张图片

14.返回${NUTCH_HOME}用ant编译(时间可能较长)

# cd /usr/local/nutch
# ant runtime

15.手动创建表

CREATE TABLE `webpage` (

    `id` varchar(767) NOT NULL,

    `headers` blob,

    `text` longtext DEFAULT NULL,

    `status` int(11) DEFAULT NULL,

    `markers` blob,

    `parseStatus` blob,

    `modifiedTime` bigint(20) DEFAULT NULL,

    `prevModifiedTime` bigint(20) DEFAULT NULL,

    `score` float DEFAULT NULL,

    `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

    `batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

    `baseUrl` varchar(767) DEFAULT NULL,

    `content` longblob,

    `title` varchar(2048) DEFAULT NULL,

    `reprUrl` varchar(767) DEFAULT NULL,

    `fetchInterval` int(11) DEFAULT NULL,

    `prevFetchTime` bigint(20) DEFAULT NULL,

    `inlinks` mediumblob,

    `prevSignature` blob,

    `outlinks` mediumblob,

    `fetchTime` bigint(20) DEFAULT NULL,

    `retriesSinceFetch` int(11) DEFAULT NULL,

    `protocolStatus` blob,

    `signature` blob,

    `metadata` blob,

    PRIMARY KEY (`id`)

    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

16.手动执行nutch爬去验证是否成功

# cd ${NUTCH_HOME}/runtime/local 
# mkdir -p urls 
# echo 'http://bbs.gxbs.net' > urls/seed.txt

安装solr


15.下载solr4.2并解压(http://lucene.apache.org/solr/)

# mkdir /usr/local/solr
# tar -zxvf solr-4.2.0.tgz /usr/local/solr/

16.启动solr自带jetty服务器

# cd /usr/local/solr/exampole
# java -jar start.jar

17.打开浏览器查看solr项目加载是否成功

# http://localhost:8983/solr

18.手动执行nutch爬取并solr索引指令

# bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

19.打开浏览器查看solr子collection并查询,查询到数据库内容表明搭建成功


下载PHP solr拓展


20.下载php solr拓展(http://pecl.php.net/package/solr)

21.解压安装solr拓展

# tar -zxvf solr-2.4.0.tgz /usr/local/php-solr
# cd /usr/local/php-solr
# phpize
# ./configure
# make && make install

22.在php.ini文件中添加

extension=solr.so

你可能感兴趣的:(搜索引擎)