著名java开源搜索引擎bddbot的简单使用——测试报告

<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style=""><span style="font-family: Calibri;">一、</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">编译</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">1.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">安装</span><span style="" lang="EN-US"><span style="font-family: Calibri;">JDK</span></span><span style="">(</span><span style="" lang="EN-US"><span style="font-family: Calibri;">java</span></span><span style="">开发工具包),这步环境变量设置比较麻烦(例如我用的是</span><span style="" lang="EN-US"><span style="font-family: Calibri;">jdk6.0_13</span></span><span style="">),</span><span style="">在系统属性<span lang="EN-US">-&gt;</span>高级<span lang="EN-US">-&gt;</span>环境变量中,设置如下三个变量(如果没有的话,则新建一个该名称的变量)</span><span style="">:</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">1)<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="" lang="EN-US">JAVA_HOME</span><span style="">,添加值:<span lang="EN-US">D:\Program Files\Java\jdk1.6.0_13;<span style="color: #92d050;">//</span></span><span style="color: #92d050;">如果只有一个变量值,不需要加“<span lang="EN-US">;</span>”号。</span></span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">2)<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="" lang="EN-US">ClassPath</span><span style="">,添加值:<span lang="EN-US">.;%JAVA_HOME%\lib\tools.jar;</span></span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">3)<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="" lang="EN-US">Path</span><span style="">,添加值:<span lang="EN-US">%JAVA_HOME%\bin;</span></span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">2.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">将文档</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddbot.zip</span></span><span style="">解压到</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddbot</span></span><span style="">目录下(以</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddhot</span></span><span style="">为根目录,如放在</span><span style="" lang="EN-US"><span style="font-family: Calibri;">E</span></span><span style="">盘下,则为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">E:\bddbot</span></span><span style="">),</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddbot</span></span><span style="">目录下有</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bdd</span></span><span style="">和</span><span style="" lang="EN-US"><span style="font-family: Calibri;">searchdb</span></span><span style="">两个子目录。</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">3.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">修改</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bdd/search/EnginePrefs.java</span></span><span style="">:</span><span style="" lang="EN-US"><span style="font-family: Calibri;">String email_address = "[email protected]"; // </span></span><span style="">改成自己的电邮</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">4.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">打开开始</span><span style="" lang="EN-US"><span style="font-family: Calibri;">-&gt;</span></span><span style="">运行</span><span style="" lang="EN-US"><span style="font-family: Calibri;">-&gt;</span></span><span style="">输入</span><span style="" lang="EN-US"><span style="font-family: Calibri;">cmd-&gt;</span></span><span style="">回车,在命令行中,先转到目录</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddhot</span></span><span style="">下,再执行命令</span><span style="" lang="EN-US"><span style="font-family: Calibri;">javac bdd\search\EnginePrefs.java</span></span><span style="">(其余的所有类文件也都已经编译,如果没有编译的话到相应文件夹下执行命令</span><span style="" lang="EN-US"><span style="font-family: Calibri;">javac *.java</span></span><span style="">即可)</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">5.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="" lang="EN-US"><span style="font-family: Calibri;">searchdb</span></span><span style="">文件夹下两文件</span><span style="" lang="EN-US"><span style="font-family: Calibri;">rules.txt</span></span><span style="">和</span><span style="" lang="EN-US"><span style="font-family: Calibri;">urls.txt</span></span><span style="">的用法,顾名思义:</span><span style="" lang="EN-US"><br></span><span style="" lang="EN-US"><span style="font-family: Calibri;">rules</span></span><span style="">是对</span><span style="" lang="EN-US"><span style="font-family: Calibri;">urls</span></span><span style="">的约束条件,有两种用法:</span><span style="" lang="EN-US"><span style="font-family: Calibri;">include</span></span><span style="">和</span><span style="" lang="EN-US"><span style="font-family: Calibri;">exclude</span></span><span style="">,如</span><span style="font-family: Calibri;"><span style="" lang="EN-US">include </span><span lang="EN-US"><a href="http://grs.pku.edu.cn/zs/"><span style="">http://grs.pku.edu.cn/zs/</span></a></span></span><span style="">,就是下载</span><span lang="EN-US"><a href="http://grs.pku.edu.cn/zs/"><span style=""><span style="font-family: Calibri;">http://grs.pku.edu.cn/zs/</span></span></a></span><span style="">开头的所有网页。</span><span style="" lang="EN-US"><span style="font-family: Calibri;">urls</span></span><span style="">是初始爬取的页面地址列表,每行一个地址,系统在这个地方不完善,对</span><span style="" lang="EN-US"><span style="font-family: Calibri;">.html</span></span><span style="">和</span><span style="" lang="EN-US"><span style="font-family: Calibri;">.htm</span></span><span style="">结尾的网页(即使用全名的网页)效果较好。</span><span style="" lang="EN-US"><span style="font-family: Calibri;">#</span></span><span style="">表示注释,即没有作用。</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style=""><span style="font-family: Calibri;">二、</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">爬取</span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">1.</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="font-size: small;"><span style="">配置</span><span lang="EN-US"><span style="font-family: Calibri;">rules.txt</span></span><span style="">值为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">include </span></span></span><span lang="EN-US"><a href="http://grs.pku.edu.cn/zs/"><span style=""><span style="font-family: Calibri;">http://grs.pku.edu.cn/zs/</span></span></a></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">2.</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">配置</span><span style="" lang="EN-US"><span style="font-family: Calibri;">urls</span></span><span style="">值为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">http://grs.pku.edu.cn/zs/zs_news.html</span></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">3.</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">命令行中执行</span><span style="" lang="EN-US"><span style="font-family: Calibri;">java bdd.search.Monitor</span></span><span style="">(注意,命令行当前目录应为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">bddbot</span></span><span style="">)打开图形界面</span></p>
<p>
</p>
<table class="MsoNormalTable" style="margin: auto auto auto 21pt; border-collapse: collapse;" border="0" cellspacing="0" cellpadding="0"><tbody>
<tr style="height: 276.2pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 252.8pt; padding-top: 0cm; height: 276.2pt; background-color: transparent; border: #ece9d8;" width="337" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="" lang="EN-US"><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/felomeng/EntryImages/20090510/%E6%9C%AA%E5%91%BD%E5%90%8D.JPG" alt="" width="328" height="376"></span></p>
</td>
</tr>
<tr style="height: 15.3pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 252.8pt; padding-top: 0cm; height: 15.3pt; background-color: transparent; border: #ece9d8;" width="337" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">1</span></span><span style=""> 主界面</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style=""><span style="font-size: small;">其中</span></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">1)</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Queries</span></span><span style="">栏目是检索所用关键字的记录;</span><span lang="EN-US"><span style="font-family: Calibri;">Current Url</span></span><span style="">是当前正在处理的网页;</span></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">2)</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Total Bytes</span></span><span style="">表示已经下载的内容的流量;</span></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">3)</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Processed</span></span><span style="">是已经处理过的网页地址列表;</span></span></p>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">4)</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="font-size: small;"><span lang="EN-US"><span style="font-family: Calibri;">Errors</span></span><span style="">是出错的网页地址列表,在命令窗体中有详细的错误记录。</span></span></p>
</td>
</tr>
</tbody></table>
<p class="MsoNormal" style=""><span style="" lang="EN-US"><span style=""><span style="font-size: small; font-family: Calibri;">4.</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">点击</span><span style="" lang="EN-US"><span style="font-family: Calibri;">start crawler</span></span><span style="">,该按钮变成不可用状态,开始爬取。爬取完成后,该按钮恢复到可用状态。</span></p>
<p>
</p>
<table class="MsoNormalTable" style="margin: auto auto auto 21pt; border-collapse: collapse;" border="0" cellspacing="0" cellpadding="0"><tbody>
<tr style="height: 244.25pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 227.8pt; padding-top: 0cm; height: 244.25pt; background-color: transparent; border: #ece9d8;" width="304" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="" lang="EN-US"><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/felomeng/EntryImages/20090510/%E6%9C%AA%E5%91%BD%E5%90%8D2.JPG" alt="" width="329" height="378"></span></p>
</td>
</tr>
<tr style="height: 13.55pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 227.8pt; padding-top: 0cm; height: 13.55pt; background-color: transparent; border: #ece9d8;" width="304" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">2</span></span><span style=""> 正在爬取</span></span></p>
</td>
</tr>
</tbody></table>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p>
</p>
<table class="MsoNormalTable" style="margin: auto auto auto 21pt; border-collapse: collapse;" border="0" cellspacing="0" cellpadding="0"><tbody>
<tr style="">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 254.7pt; padding-top: 0cm; background-color: transparent; border: #ece9d8;" width="340" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="" lang="EN-US"><span style="font-size: small; font-family: Calibri;"><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/felomeng/EntryImages/20090510/%E6%9C%AA%E5%91%BD%E5%90%8D3.JPG" alt="" width="328" height="376"></span></span></p>
</td>
</tr>
<tr style="">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 254.7pt; padding-top: 0cm; background-color: transparent; border: #ece9d8;" width="340" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">3</span></span><span style=""> 爬取完成</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">其中</span><span lang="EN-US"><span style="font-family: Calibri;">Queries</span></span><span style="">里面记录的是汉字内容,显示成了乱码,对汉语支持不完善。命令窗体的错误记录为:</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">java.net.MalformedURLException: unknown protocol: javascript</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at java.net.URL.&lt;init&gt;(URL.java:574)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at java.net.URL.&lt;init&gt;(URL.java:464)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at bdd.search.spider.HTMLLinkExtractor.analyzeAnchor(HTMLLinkExtractor.j</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">ava:76)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at bdd.search.spider.HTMLLinkExtractor.analyze(HTMLLinkExtractor.java:63</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;">)</span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at bdd.search.spider.HTMLLinkExtractor.&lt;init&gt;(HTMLLinkExtractor.java:43)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small; font-family: Calibri;"></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at bdd.search.spider.URLStatus.getLinkExtractor(URLStatus.java:152)</span></span></span></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span lang="EN-US"><span style="font-size: small;"><span style="font-family: Calibri;"><span style=""> </span>at bdd.search.spider.Indexer.run(Indexer.java:93)</span></span></span></p>
</td>
</tr>
</tbody></table>
<p class="MsoNormal" style="" align="left"><span style="font-size: 12pt;" lang="EN-US"><span style=""><span style="font-family: Calibri;">5.</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">本次下载后主索引文件(在</span><span style="" lang="EN-US"><span style="font-family: Calibri;">E:\bddbot\searchdb\main.db</span></span><span style="">)大小为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">1.34M</span></span><span style="">,用时共约</span><span style="" lang="EN-US"><span style="font-family: Calibri;">20</span></span><span style="">分钟。</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style=""><span style="font-family: Calibri;">三、</span><span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">搜索测试</span></p>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">1.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">检索:</span><span style="" lang="EN-US"><br><span style="font-family: Calibri;">&lt;form action="</span></span><span lang="EN-US"><a title="http://222.29.124.166:8001/query
CTRL + 单击以下链接" href="http://222.29.124.166:8001/query"><span style=""><span style="font-family: Calibri;">http://222.29.124.166:8001/query</span></span></a></span><span style="" lang="EN-US"><span style="font-family: Calibri;">" method=GET&gt;<br>&lt;input type="text" name="words" value="" size=45&gt;<br>&lt;input type="submit" value="Search"&gt;<br>&lt;/form&gt;<br></span></span><span style="">把上面见容另存为</span><span style="" lang="EN-US"><span style="font-family: Calibri;">html</span></span><span style="">文档,其中</span><span style="text-decoration: underline;"><span style="" lang="EN-US"><span style="font-family: Calibri;">222.29.124.166</span></span></span><span style="">改成本机地址即可使用,一般</span><span style="" lang="EN-US"><span style="font-family: Calibri;">windows</span></span><span style="">下测试可以直接使用</span><span style="" lang="EN-US"><span style="font-family: Calibri;">localhost</span></span><span style="">(当然,可以在此基础上在界面上面多添加一些元素),如图:</span></p>
<p>
</p>
<table class="MsoNormalTable" style="margin: auto auto auto 21pt; border-collapse: collapse;" border="0" cellspacing="0" cellpadding="0"><tbody>
<tr style="height: 7.25pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 401.4pt; padding-top: 0cm; height: 7.25pt; background-color: transparent; border: #ece9d8;" width="535" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt;"><span style="" lang="EN-US"><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/felomeng/EntryImages/20090510/%E6%9C%AA%E5%91%BD%E5%90%8D4.JPG" alt="" width="423" height="62"></span></p>
</td>
</tr>
<tr style="height: 7.25pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 401.4pt; padding-top: 0cm; height: 7.25pt; background-color: transparent; border: #ece9d8;" width="535" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">4</span></span><span style=""> 搜索界面</span></span></p>
</td>
</tr>
</tbody></table>
<p class="MsoNormal" style="" align="left"><span style="" lang="EN-US"><span style="">2.<span style="font: 7pt 'Times New Roman';"> </span></span></span><span style="">然后用浏览器打开它,在开启</span><span style="" lang="EN-US"><span style="font-family: Calibri;">Monitor</span></span><span style="">的情况下(注意,要求已经成功爬取完成一部分语料),输入关键字进行搜索。</span></p>
<p>
</p>
<table class="MsoNormalTable" style="margin: auto auto auto 21pt; border-collapse: collapse;" border="0" cellspacing="0" cellpadding="0"><tbody>
<tr style="height: 702.15pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 401.4pt; padding-top: 0cm; height: 702.15pt; background-color: transparent; border: #ece9d8;" width="535" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="" lang="EN-US"><img src="http://p.blog.csdn.net/images/p_blog_csdn_net/felomeng/EntryImages/20090510/%E6%9C%AA%E5%91%BD%E5%90%8D5.JPG" alt="" width="423" height="902"></span></p>
</td>
</tr>
<tr style="height: 16.25pt;">
<td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 401.4pt; padding-top: 0cm; height: 16.25pt; background-color: transparent; border: #ece9d8;" width="535" valign="top">
<p class="MsoNormal" style="margin: 0cm 0cm 0pt 21pt;"><span style="font-size: small;"><span style="">图</span><span lang="EN-US"><span style="font-family: Calibri;">5</span></span><span style=""> 一个搜索结果</span></span></p>
</td>
</tr>
</tbody></table>
<p class="MsoNormal" style="" align="left">
</p>
<p class="MsoNormal" style="" align="left"></p>

<p class="MsoNormal" style="" align="left">附:<a href="http://download.csdn.net/source/1293767">bddbot源码及其文档</a>、<a href="http://download.csdn.net/source/1293776">bddbot测试报告(使用方法)</a> Word版。</p>
<p class="MsoNormal" style="" align="left"></p>

你可能感兴趣的:(java)