通过分析Nutch的配置文件Nutch-default.xml和阅读了部分源代码后,了解了Nutch的插件机制以及如何通过修改conf中的文件实现过滤抓取数据。默认情况下,实现URL过滤的类为RegexURLFilter,对应的过滤文件为regex-urlfilter.txt,在不修改该文件的情况下,Nutch可以过滤后缀以gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS结尾的文件,过滤包含?*!@=字符的URL,过滤/SameSomething/重复出现三次的URL,而接受其他一切URL。现在以http://hadoop.apache.com为抓取的URL为例,分为默认抓取和只抓取包含hadoop的URL两种情况。
先看第一种情况,即对rgex-urlfilter.txt不做任何修改,代码及结果如下所示。从结果可以看到,总共抓取了38条记录。
[hadoop@hadoop deploy]$ bin/crawl urls hadoop http://localhost:8983/solr/ 1 hbase(main):012:0> scan 'hadoop_webpage', {COLUMNS=>'f:ts'} ROW COLUMN+CELL com.apachecon.eu.www:http/c/aceu2009/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1D1 com.apachecon.us:http/c/acus2008/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Ft com.cafepress.www:http/hadoop/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1D2 008/07/apache_hadoop_wins_terabyte_sort _benchmark.html org.apache.avro:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fw org.apache.cassandra:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fx org.apache.forrest:http/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV\x1F\xAC org.apache.hadoop:http/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01C\xE1\xD1\xB2\xFC org.apache.hadoop:http/bylaws.html column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy org.apache.hadoop:http/docs/current/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x06 org.apache.hadoop:http/docs/r0.23.10/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fy org.apache.hadoop:http/docs/r1.2.1/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1Fz org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07 / org.apache.hadoop:http/docs/r2.2.0/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F{ org.apache.hadoop:http/docs/stable/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x07 org.apache.hadoop:http/index.html column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F| org.apache.hadoop:http/index.pdf column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08 org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08 tml org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x08 ml org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09 tml org.apache.hadoop:http/releases.html column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F| org.apache.hadoop:http/who.html column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F} org.apache.hbase:http/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09 org.apache.hive:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F~ org.apache.incubator:http/ambari/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F org.apache.incubator:http/chukwa/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x09 org.apache.incubator:http/hama/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A org.apache.mahout:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x7F org.apache.pig:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x82 org.apache.wiki:http/hadoop column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A org.apache.wiki:http/hadoop/PoweredBy column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x0A org.apache.www:http/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12 org.apache.www:http/foundation/sponsors column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83 hip.html org.apache.www:http/foundation/thanks.h column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x12 tml org.apache.www:http/licenses/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x83 org.apache.zookeeper:http/ column=f:ts, timestamp=1388471264814, value=\x00\x00\x01CGV\x1F\x85 org.sortbenchmark:http/ column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x13 uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388471264808, value=\x00\x00\x01CGV%\x14 /mar/25/media-guardian-innovation-award s-apache-hadoop 38 row(s) in 0.2590 seconds
第二种情况是修改rgex-urlfilter.txt文件,修改最后一行为+^http://.*hadoop.*,即只抓取包含hadoop的URL。抓取的结果如下所示,只包含20行,并且rowkey仅仅包含hadoop的RUL。
[hadoop@hadoop deploy]$ bin/crawl urls hadoopWithFilter http://localhost:8983/solr/ 1 hbase(main):016:0> scan 'hadoopWithFilter_webpage',{COLUMNS=>'f:ts'} ROW COLUMN+CELL com.cafepress.www:http/hadoop/ column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtGl com.yahoo.developer:http/blogs/hadoop/2 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtG\x88 008/07/apache_hadoop_wins_terabyte_sort _benchmark.html org.apache.hadoop:http/ column=f:ts, timestamp=1388473240778, value=\x00\x00\x01C\xE1\xF0\xEB\xCB org.apache.hadoop:http/bylaws.html column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHv org.apache.hadoop:http/docs/current/ column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA7 org.apache.hadoop:http/docs/r0.23.10/ column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw org.apache.hadoop:http/docs/r1.2.1/ column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHw org.apache.hadoop:http/docs/r2.1.1-beta column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8 / org.apache.hadoop:http/docs/r2.2.0/ column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx org.apache.hadoop:http/docs/stable/ column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8 org.apache.hadoop:http/index.html column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHx org.apache.hadoop:http/index.pdf column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA8 org.apache.hadoop:http/issue_tracking.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9 tml org.apache.hadoop:http/mailing_lists.ht column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9 ml org.apache.hadoop:http/privacy_policy.h column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xA9 tml org.apache.hadoop:http/releases.html column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHy org.apache.hadoop:http/who.html column=f:ts, timestamp=1388473239724, value=\x00\x00\x01CGtHz org.apache.wiki:http/hadoop column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA org.apache.wiki:http/hadoop/PoweredBy column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAA uk.co.guardian.www:http/technology/2011 column=f:ts, timestamp=1388473240778, value=\x00\x00\x01CGtL\xAB /mar/25/media-guardian-innovation-award s-apache-hadoop 20 row(s) in 0.3090 seconds
通过上面的结果可以发现,通过修改rgex-urlfilter.txt文件中的正则表达式,可以实现定制抓取URL,仅仅抓取自己感兴趣的内容。