上一篇博文介绍了在Windows 10系统下用Cygwin搭建Nutch开发环境,本文将介绍在Ubuntu下Nutch2.3的开发环境的搭建。
基本要求,网上也有很多,自行安装,有问题可以留言。
useradd kandy
passwd kandy
vi /etc/sudoers
增加一行:
hadoop ALL=(ALL) ALL
vi /etc/hosts
增加如下内容:
127.0.0.1 localhost
使用如下命令:
ssh-keygen -t rsa
使用如下命令:
cp .ssh/id_rsa.pub .ssh/authorized_keys
输入如下命令:
ssh localhost
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
sudo chmod 777 -R hadoop
<configuration>
<property>
<name>hadoop.tmp.dirname>
<value>/data/hadoop-data/tmpvalue>
<description>Abase for other temporary directories.description>
property>
<property>
<name>fs.default.namename>
<value>hdfs://localhost:9000value>
property>
configuration>
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
<configuration>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
<configuration>
<property>
<name>mapred.job.trackername>
<value>localhost:9001value>
property>
configuration>
export HADOOP_PREFIX=/usr/local/hadoop/
export PATH=${HADOOP_PREFIX}/bin/:${PATH}
hadoop namenode -format
start-all.sh
hadoop fs -ls /
http://localhost:50060/tasktracker.jsp
http://localhost:50030/jobtracker.jsp
sudo chmod 777 -R hbase
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
并开启:
export HBASE_MANAGES_ZK=true
<configuration>
<property>
<name>hbase.rootdirname>
<value>hdfs://localhost:9000/hbasevalue>
property>
<property>
<name>hbase.cluster.distributedname>
<value>truevalue>
property>
<property>
<name>hbase.zookeeper.property.dataDirname>
<value>/data/hbase/zookeepervalue>
property>
configuration>
cp /usr/local/hadoop/hadoop-core-1.2.1.jar ./lib/
./bin/start-hbase.sh
执行./bin/hbase shell启动终端并执行list结果如下:
hbase(main):002:0> list
TABLE
0 row(s) in 0.0170 seconds
hbase(main):003:0>
http://localhost:60010/master-status
sudo chmod 777 -R nutch
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
<configuration>
<property>
<name>storage.data.store.classname>
<value>org.apache.gora.hbase.store.HBaseStorevalue>
<description>Default class for storing datadescription>
property>
<property>
<name>http.agent.namename>
<value>My Nutch Spidervalue>
property>
<property>
<name>plugin.includesname>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)value>
property>
configuration>
将依赖的hadoop-core和hadoop-test的版本由1.2.0改为1.2.1
将gora-hbase依赖解除注释如下:
<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.5″ conf=”*->default” />
sudo chmod 777 -R solr
cp /usr/local/nutch/runtime/local/conf/schema.xml solr/collection1/conf/schema.xml
java -jar start.jar
进到/usr/local/nutch/runtime/local目录,创建urls目录并创建url.txt文件内容为种子url,如:
http://www.cnbeta.com
./bin/crawl urls TestCrawl http://localhost:8983/solr 2