【项目】 基于BOOST的站内搜索引擎

目录

  • 1. 简介
    • 建立搜索引擎的宏观体系
    • 技术栈和项目环境
    • 正排索引 and 倒排索引
  • 2. 数据去标签与数据清洗模块 —— Parser
    • 数据去标签 parser.cc
    • parser.cc 的代码结构
      • EnumFile() 函数 —— 枚举筛选html文件
      • ParseHtml() 函数 —— 解析html代码结构
      • SaveHtml() 函数 —— 保存去标签后的文档
      • 测试
  • 3. 建立索引模块 —— Index
    • 获得正排索引
    • 获得倒排索引
    • 构建索引
      • 构建正排索引
      • 构建倒排索引
  • 4. 搜索引擎模块 —— Searcher
    • 初始化搜索对象 —— InitSearcher
    • 搜索功能 —— Search
      • 安装json库与使用示例
      • Search 完整代码
    • 测试
  • 5. 服务器搭建 —— http_server 模块
    • cpp-httplib 的基本使用测试
    • 编写 HttpServer 模块
  • 6. 前端模块
    • HTML 网页框架
    • CSS 网页个性化设计
    • JavaScript 编写实现跳转
    • 整体效果
  • 7.后端优化
    • 搜索去重
    • 去除暂停词
  • 添加日志
  • 部署服务
  • 项目扩展方向:
  • 项目代码

【项目】 基于BOOST的站内搜索引擎_第1张图片


1. 简介

常见的搜索引擎:baidu、google、bing,以及常见的一些带有搜索功能的app等。

【项目】 基于BOOST的站内搜索引擎_第2张图片

我们自己单枪匹马实现一个常规的搜索引擎(全网搜索)显然是不可能的,但可以实现一个简单的搜索引擎来进行站内搜索的行为。

比如我们学习C++常用的cplusplus网站就是带有站内搜索功能,搜索的内容更垂直(范围小且相关性更强),数据量更小。

【项目】 基于BOOST的站内搜索引擎_第3张图片

boost库是没有站内搜索的,我们可以自己做一个。

完成后的搜索引擎也将显示每个检索条目的:网页标题,网页内容摘录以及url。

建立搜索引擎的宏观体系

【项目】 基于BOOST的站内搜索引擎_第4张图片

技术栈和项目环境

  • 技术栈:

    • 后端:C/C++, C++11,STL,Boost,Jsoncpp,cppjieba分词库,cpp-httplib开源库
    • 前端:html5,css,js,jQuery,Ajax
  • 项目环境:Centos 7云服务器,vim/gcc(g++)/Makefile,vs2019/vs code

正排索引 and 倒排索引

正排索引:由key查询实体的过程

  • 例如通过文档名找到相应的文档内容

    文档名 文档内容
    XXX公司2021年财报 2021年XXX总营收…
    XXX公司2021年产品销售情况 2021年A产品销售量…
  • 例如,用户表:

    t_user(uid,name,passwd,age,gender)

    由uid查询整行的过程就是正排索引。

  • 例如,网页库:

    t_web_page(url, page_content)

    由url查询整个网页的过程,也是正排索引查询。

分词:实体内容分词后,会对应一个分词后的集合list。所以简易的正排索引可以理解为 Map。(关键词具有唯一性)

  • 举个例子,假设有3个网页:

    url1 -> “我爱北京”

    url2 -> “我爱宏伟的天安门”

    url3 -> “长城真宏伟啊”

    这是一个正排索引Map

    分词之后:

    url1 -> {我,爱,北京}

    url2 -> {我,爱,宏伟,天安门}

    url3 -> {长城,宏伟}

    这是一个分词后的正排索引Map

停止词:了,的,吗,啊,a,the,一般我们在分词的时候可以不考虑

倒排索引:由实体查询key的过程

  • 例如,网页库:

    由查询词快速找到包含这个查询词的网页

    分词后倒排索引:

    我 -> {url1,url2}

    爱 -> {url1,url2}

    北京 -> {url1}

    宏伟 -> {url2,url3}

    长城 -> {url3}

由检索词item快速找到包含这个查询词的网页 Map 就是倒排索引。

模拟一次查找的过程

用户输入关键词:宏伟 -> 倒排索引 -> 提取出网页{url2,url3} -> 正排索引 -> 分别提取网页内容 -> 分别构建 title + content + url 响应结果 -> 呈现用户时,根据权重划分优先级

2. 数据去标签与数据清洗模块 —— Parser

数据源直接在boost官网下载

【项目】 基于BOOST的站内搜索引擎_第5张图片

打开云服务器,建立项目文件夹,使用rz指令将之前下载的数据报添加进入云服务器中:

【项目】 基于BOOST的站内搜索引擎_第6张图片

在这里插入图片描述

使用tar指令解压:

在这里插入图片描述

目前只需要 boost_1_79_0/doc/html目录下的html文件,来对它建立索引。

所以创建 data/input 目录,将boost库的 doc/html/*文件放在input目录下即可。

[sjl@VM-16-6-centos boost_searcher]$ cp -rf boost_1_79_0/doc/html/* data/input/

数据去标签 parser.cc

新建去标签程序

[sjl@VM-16-6-centos boost_searcher]$ touch parser.cc
//原始数据  -- > 去标签之后的数据

html文件中 被 <> 括起来的就是标签,然而这对于我们执行搜索是没有价值的,需要去掉这些标签。

<td align="center"><a href="../../libs/libraries.htm">Librariesa>td>

处理完标签的html数据将会存放在 raw_html 目录中

[sjl@VM-16-6-centos data]$ mkdir raw_html
[sjl@VM-16-6-centos data]$ ll
total 16
drwxrwxr-x 58 sjl sjl 16384 Jul 19 16:37 input      //原始html文档
drwxrwxr-x  2 sjl sjl  4096 Jul 19 20:37 raw_html   //去标签之后的html文档

可以看一下data这个文件目前包含多少个html文件:

[sjl@VM-16-6-centos data]$ ls -Rl|grep -E *.html|wc -l
8172

grep : 文本搜索指令 —E 支持正则表达式

wc : 统计文件属性 -l 统计行数

目标

把每个html都去标签,然后写入同一个文件中,注意方便读取,那么我们就把每个文件都各自放在一行里,例子如下,不同的内容以 \3 分隔,不同文件以 \n 分隔:

类似:

title\3content\3url \n title\3content\3url \n title\3content\3url \n

我们知道getline函数可以直接读取一行,直接获取一个文档的全部内容title\3content\3url\3

parser.cc 的代码结构

【项目】 基于BOOST的站内搜索引擎_第7张图片

#include 
#include 
#include 

const std::string src_path="data/input";
const std::string output="data/raw_html/raw.txt";

typedef struct DocInfo
{
    std::string title;    //文档标题
    std::string content;  //文档内容
    std::string url;      //该文档在官网中的url
}DocInfo_t;

//const & : 输入
//* : 输出
//& : 输入输出

bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list);

bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results);

bool SaveHtml(const std::vector<DocInfo_t>& results,const std::string& output);

int main()
{
    //文件名列表
    std::vector<std::string> files_list;

    //第一步:递归式地把每个html文件名(带路径),存放到files_list中,方便后期对html文件的读取
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }

    //第二步:读取files_list的文件名读取每个文件的内容,并解析:title + content + url 
    std::vector<DocInfo_t> results; //files_list中所有文件 去除标签后的结果 存放于此
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }

    //第三步:将解析完毕的各个文件的内容,写入到 output路径 ,每个文件结束以 \3 作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"Save html error"<<std::endl;
        return 3;
    }
    return 0;
}

EnumFile() 函数 —— 枚举筛选html文件

由于C++标准库对文件操作的支持并不完善,所以这里需要使用Boost库的filesystem模块来完成。

  • boost开发库的安装
[sjl@VM-16-6-centos boost_searcher]$ sudo yum install -y boost-devel 

同时在parser.cc中引入头文件

#include 
  • 代码如下
bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list)
{
    namespace fs=boost::filesystem;
    fs::path root_path(src_path);

    //判断路径是否存在,如果不存在就不必往后走了 
    if(!fs::exists(root_path))
    {
        std::cerr<<src_path<<"not exists"<<std::endl;
        return false;
    }

    //定义空的迭代器,用来判断递归结束
    fs::recursive_directory_iterator end;
    for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
    {
        //筛选路径下的普通文件(过滤掉目录文件),html文件都是普通文件
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //过滤掉后缀不为".html"的文件
        if(iter->path().extension()!=".html")
        {
            continue;
        }

        //打印测试
        std::cout<<"debug: "<<iter->path().string()<<std::endl; 
  
        //当前的路径一定是以".html"为后缀而定普通网页文件
        files_list->push_back(iter->path().string());//将html文件的路径名转为字符串填入files_list中。

    }

    return true;
}

Makefile文件如下(注意链接boost库和boost文件库):

cc=g++

parser:parser.cc
	$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem


.PHONY:clean
clean:
		rm -rf parser

make后查看parser的链接库

【项目】 基于BOOST的站内搜索引擎_第8张图片

我们运行下parser可执行文件(另两个函数先默认 return true),查看输出情况:

【项目】 基于BOOST的站内搜索引擎_第9张图片

这样html的文件就被筛选出来了,共有8171个html文件。

ParseHtml() 函数 —— 解析html代码结构

经过上面函数的筛选后,我们 files_list中存放的都是html文件的路径名了。

ParseHtml()代码的整体框架如下:

【项目】 基于BOOST的站内搜索引擎_第10张图片

函数架构

bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results)
{
    for(const std::string &file: files_list)
    {
        //1.读取文件 ReadFile
        std::string result;
        if(!ns_tool::FileTool::ReadFile(file,&result))
        {
            continue;
        }
  
        DocInfo_t doc;
        //2.解析文件,提取title
        if(!ParseTitle(result,&doc.title))
        {
            continue;
        }
  
        //3.解析文件,提取content,就是去标签
        if(!ParseContent(result,&doc.content))
        {
            continue;
        }
        //4.解析指定的文件路径,构建官网url
        if(!ParseUrl(file,&doc.url))
        {
            continue;
        }
        //done 一定是完成了解析任务,当前文档的相关结果都保存在了结构体doc中
        //将这些结构体存入results中
        results->push_back(std::move(doc));//bug:todo细节,本质会发生拷贝,效率会比较低
    }
    return true;
}

解释

该函数主要完成4件事:根据路径名依次读取文件内容,提取title,提取content,构建url

  1. 读取文件

遍历files_list中存储的文件名,从中读取文件内容到 result 中,由函数 ReadFile() 完成该功能。

该函数定义于头文件 tool.hpp的类 FileTool中。

//tool.hpp
#pragma once
#include 
#include 
#include 
namespace ns_tool
{
    class FileTool
    {
     public:
  
         //输入文件名,将文件内容读取到out中
         static bool ReadFile(const std::string& file_path,std::string *out)
         {
            std::ifstream in(file_path,std::ifstream::in);
  
            //文件打开失败检查
            if(!in.is_open())
            {
                std::cerr<<"open file: "<<file_path<<std::endl;
                return false;
            }
  
            //读取文件
            std::string line;
            while(getline(in,line))
            {
                *out+=line; 
            }//while(bool),getline的返回值istream会重载操作符bool,读到文件尾eofset被设置并返回false
   

            in.close();
            return true;
         }

    };
}
  1. 提取title —— ParseTitle()

随意打开一个html文件,可以看到我们要提取的title部分是被title标签包围起来的部分。如下所示:

【项目】 基于BOOST的站内搜索引擎_第11张图片

这里需要依赖函数 —— bool ParseTitle(const std::string& result,&doc.title),来帮助完成这一工作,函数就定义在parse.cc中。

//解析title
static bool ParseTitle(const std::string& result,std::string* title)
{
    std::size_t begin=result.find(""</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token keyword">if</span><span class="token punctuation">(</span>begin<span class="token operator">==</span>std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    std<span class="token double-colon punctuation">::</span>size_t end<span class="token operator">=</span>result<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"/title"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token keyword">if</span><span class="token punctuation">(</span>end<span class="token operator">==</span>std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>

    begin<span class="token operator">+=</span>std<span class="token double-colon punctuation">::</span><span class="token function">string</span><span class="token punctuation">(</span><span class="token string">"<title>"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    <span class="token keyword">if</span><span class="token punctuation">(</span>begin<span class="token operator">></span>end<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>

    <span class="token operator">*</span>title <span class="token operator">=</span> result<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin<span class="token punctuation">,</span>end<span class="token operator">-</span>begin<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <ol start="3"> 
   <li><strong>提取content,实际上是去除标签</strong> —— <code>ParseContent()</code></li> 
  </ol> 
  <p>即把所有尖括号及尖括号包含的部分全部去除</p> 
  <p><a href="http://img.e-com-net.com/image/info8/08654d6595704324b0d06dad9f7e2056.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/08654d6595704324b0d06dad9f7e2056.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第12张图片" width="650" height="179" style="border:1px solid black;"></a></p> 
  <p>在遍历的时候,只要碰到了 <code>></code> ,就意味着,当前的标签被处理完毕. 只要碰到了 <code><</code> 意味着新的标签开始了。</p> 
  <p>这里需要依赖函数 —— <code>bool ParseContent(const std::string& result,&doc.content)</code>,来帮助完成这一工作,函数就定义在parse.cc中。</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//去标签</span>
<span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> result<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> content<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//基于一个简易的状态机</span>
    <span class="token keyword">enum</span> <span class="token class-name">status</span> 
    <span class="token punctuation">{</span>
        LABLE<span class="token punctuation">,</span>
        CONTENT
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
    <span class="token keyword">enum</span> <span class="token class-name">status</span> s<span class="token punctuation">;</span>
    <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">char</span> c<span class="token operator">:</span>result<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">switch</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token keyword">case</span> LABLE<span class="token operator">:</span>
                <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'>'</span><span class="token punctuation">)</span>s<span class="token operator">=</span>CONTENT<span class="token punctuation">;</span>
                <span class="token keyword">break</span><span class="token punctuation">;</span>
            <span class="token keyword">case</span> CONTENT<span class="token operator">:</span>
                <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'<'</span><span class="token punctuation">)</span> s<span class="token operator">=</span>LABLE<span class="token punctuation">;</span>
                <span class="token keyword">else</span> 
                <span class="token punctuation">{</span>
                    <span class="token comment">//不保留 '/n'</span>
                    <span class="token keyword">if</span><span class="token punctuation">(</span>c<span class="token operator">==</span><span class="token char">'\n'</span><span class="token punctuation">)</span> c<span class="token operator">=</span><span class="token char">' '</span><span class="token punctuation">;</span>
                    content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>c<span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
                <span class="token keyword">break</span><span class="token punctuation">;</span>
            <span class="token keyword">default</span><span class="token operator">:</span>
                <span class="token keyword">break</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <ol start="4"> 
   <li><strong>构建官网url</strong></li> 
  </ol> 
  <p>boost库在网页上的url,和我们下载的文档的路径是有对应关系的:</p> 
  <p>举个例子:</p> 
  <p>当我们进入官网中查询 <code>Accumulators</code>,其<strong>官网url</strong>为:</p> 
  <p>https://www.boost.org/doc/libs/1_79_0/doc/html/accumulators.html</p> 
  <p>如果我们在下载的文档中查询该网页文件,那么其路径为:</p> 
  <p><a href="http://img.e-com-net.com/image/info8/5ddc3084d61d41259e609d1bbbc8aa34.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/5ddc3084d61d41259e609d1bbbc8aa34.jpg" alt="在这里插入图片描述" width="650" height="83"></a></p> 
  <p>而我们项目中的所有数据源都拷贝到了 <code>data/input</code>目录下,那么在我们项目中寻找该网页文件的路径为:</p> 
  <pre><code class="prism language-bash">data/input/accumulators.html
</code></pre> 
  <p>于是我们可以将url拼接:</p> 
  <p>url_head = https://www.boost.org/doc/libs/1_79_0/doc/html</p> 
  <p>url_tail = <s>data/input</s>/accumulators.html</p> 
  <pre><code class="prism language-bash"><span class="token assign-left variable">url</span><span class="token operator">=</span>url_head + url_tail //相当于形成了一个官网链接
</code></pre> 
  <p>这里需要依赖函数 —— <code>bool ParseUrl(const std::string& file_path,std:string* url)</code>,来帮助完成这一工作,函数就定义在parse.cc中。</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//构建官网url :url_head + url_tail</span>
<span class="token keyword">static</span> <span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> file_path<span class="token punctuation">,</span>std<span class="token operator">:</span>string<span class="token operator">*</span> url<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    std<span class="token double-colon punctuation">::</span>string url_head<span class="token operator">=</span><span class="token string">"https://www.boost.org/doc/libs/1_79_0/doc/html"</span><span class="token punctuation">;</span>
    std<span class="token double-colon punctuation">::</span>string url_tail<span class="token operator">=</span>file_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token operator">*</span>url<span class="token operator">=</span>url_head<span class="token operator">+</span>url_tail<span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span> 
<span class="token punctuation">}</span>
</code></pre> 
  <h3>SaveHtml() 函数 —— 保存去标签后的文档</h3> 
  <pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">SaveHtml</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo_t<span class="token operator">></span><span class="token operator">&</span> results<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> output<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">SEP</span> <span class="token char">'\3'</span></span>
    std<span class="token double-colon punctuation">::</span>ofstream <span class="token function">out</span><span class="token punctuation">(</span>output<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>out<span class="token operator">|</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>out<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"open "</span><span class="token operator"><<</span>out<span class="token operator"><<</span><span class="token string">" error"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>

    <span class="token comment">//文档写入磁盘</span>
    <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>results<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>string out_string<span class="token punctuation">;</span>
        out_string <span class="token operator">=</span> item<span class="token punctuation">.</span>title<span class="token punctuation">;</span>
        out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span>
        out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>content<span class="token punctuation">;</span>
        out_string <span class="token operator">+=</span> SEP<span class="token punctuation">;</span>
        out_string <span class="token operator">+=</span> item<span class="token punctuation">.</span>url<span class="token punctuation">;</span>
        out_string <span class="token operator">+=</span> <span class="token char">'\n'</span><span class="token punctuation">;</span>
        out<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>out_string<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>out_string<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>

    out<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>

</code></pre> 
  <h3>测试</h3> 
  <p>我们编译下 parser.cc,得到parser可执行文件,随后make。如果成功,那么此时 <code>/data/raw_html</code>目录下的 <code>raw.txt</code> 就会填入所有的处理完的html文档。</p> 
  <pre><code class="prism language-bash"><span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">make</span>
g++ -o parser parser.cc -std<span class="token operator">=</span>c++11 -lboost_system -lboost_filesystem
<span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ ll
total <span class="token number">136</span>
drwxr-xr-x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> 05:33 boost_1_79_0
drwxrwxr-x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span>:37 data
-rw-rw-r-- <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span>:03 Makefile
-rwxrwxr-x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span>:36 parser
-rw-rw-r-- <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span>:31 parser.cc
-rw-rw-r-- <span class="token number">1</span> sjl sjl    <span class="token number">889</span> Jul <span class="token number">21</span> <span class="token number">21</span>:27 tool.hpp
<span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">cat</span> data/raw_html/raw.txt <span class="token operator">|</span> <span class="token function">wc</span> -l
<span class="token number">8171</span>
</code></pre> 
  <p>每个html文档占据一行,显然行数与处理之前的html文件数是匹配的。</p> 
  <p><a href="http://img.e-com-net.com/image/info8/b8e788dbd9df4500837b4b2fdac3a819.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/b8e788dbd9df4500837b4b2fdac3a819.jpg" alt="在这里插入图片描述" width="650" height="87"></a></p> 
  <p>'\3’ascii对应的控制字符 就是 <code>^C</code></p> 
  <h1>3. 建立索引模块 —— Index</h1> 
  <pre><code class="prism language-bash"><span class="token punctuation">[</span>sjl@VM-16-6-centos boost_searcher<span class="token punctuation">]</span>$ <span class="token function">touch</span> index.hpp
</code></pre> 
  <p>该头文件主要负责三件事:1.构建索引 2.正排索引 3.倒排索引</p> 
  <p>构建思路框图:</p> 
  <p><a href="http://img.e-com-net.com/image/info8/53da046eae114db3ba32370a763566ca.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/53da046eae114db3ba32370a763566ca.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第13张图片" width="650" height="258" style="border:1px solid black;"></a></p> 
  <pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once </span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><unordered_map></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"tool.hpp"</span></span>

<span class="token keyword">namespace</span> ns_index
<span class="token punctuation">{</span>
    <span class="token keyword">struct</span> <span class="token class-name">DocInfo</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>string title <span class="token punctuation">;</span>   <span class="token comment">//文档标题</span>
        std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span>  <span class="token comment">//文档去标签内容</span>
        std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span>      <span class="token comment">//文档对应的官网url</span>
        <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>      <span class="token comment">//文档ID</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>


    <span class="token comment">//倒排索引结构体</span>
    <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>   <span class="token comment">// 文档ID</span>
        std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">// 文档相关关键字</span>
        <span class="token keyword">int</span> weight<span class="token punctuation">;</span>        <span class="token comment">// 文档权重</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
  
    <span class="token comment">//倒排拉链</span>
    <span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span>
  

    <span class="token keyword">class</span> <span class="token class-name">Index</span>
    <span class="token punctuation">{</span>
        <span class="token keyword">private</span><span class="token operator">:</span>
            <span class="token comment">//正排索引的数据结构使用数组,下标将对应文档ID</span>
            std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> forward_index<span class="token punctuation">;</span> <span class="token comment">//正排索引:通过文档ID找到文档内容</span>


            <span class="token comment">//倒排索引:一个关键词和一组 InvertedElem 对应(关键字和倒排拉链的映射关系)</span>
            std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span> std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> InvertedList <span class="token operator">></span> inverted_index<span class="token punctuation">;</span>

        <span class="token keyword">private</span><span class="token operator">:</span>
            <span class="token comment">//Index作为单例模式</span>
            <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token punctuation">}</span>
            <span class="token function">Index</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span> <span class="token punctuation">)</span><span class="token operator">=</span><span class="token keyword">delete</span><span class="token punctuation">;</span>
            Index<span class="token operator">&</span> <span class="token keyword">operator</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">const</span> Index<span class="token operator">&</span> <span class="token punctuation">)</span><span class="token operator">=</span><span class="token keyword">delete</span><span class="token punctuation">;</span>
            <span class="token keyword">static</span> Index<span class="token operator">*</span> instance<span class="token punctuation">;</span>
            <span class="token keyword">static</span> std<span class="token double-colon punctuation">::</span>mutex mtx<span class="token punctuation">;</span>
        <span class="token keyword">public</span><span class="token operator">:</span>
            <span class="token comment">//创建单例</span>
            <span class="token keyword">static</span> Index<span class="token operator">*</span> <span class="token function">Getinstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
            <span class="token punctuation">{</span>
                <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>instance<span class="token punctuation">)</span>
                <span class="token punctuation">{</span>
                    <span class="token comment">//instance为临界资源,需为互斥量</span>
                    mtx<span class="token punctuation">.</span><span class="token function">lock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                    <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>instance<span class="token punctuation">)</span>
                    <span class="token punctuation">{</span>
                        instance<span class="token operator">=</span><span class="token keyword">new</span> <span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                    <span class="token punctuation">}</span>
                    mtx<span class="token punctuation">.</span><span class="token function">unlock</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
                <span class="token punctuation">}</span>
                <span class="token keyword">return</span> instance<span class="token punctuation">;</span>
            <span class="token punctuation">}</span>

            <span class="token operator">~</span><span class="token function">Index</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
            <span class="token punctuation">{</span><span class="token punctuation">}</span>
        <span class="token keyword">public</span><span class="token operator">:</span>
            <span class="token comment">//获得正排索引:根据文档的 doc_id 获得文档内容</span>
            DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> 
            <span class="token punctuation">{</span>
                <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>

            <span class="token comment">//获得倒排索引:根据关键字word,获得倒排拉链</span>
            InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
            <span class="token punctuation">{</span>
                <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>

            <span class="token comment">//构建索引</span>
            <span class="token comment">//Parse处理后的文档,用来构建正排与倒排索引</span>
            <span class="token comment">//Parse处理后的文档路径存于路径:data/raw_html/raw.txt</span>
            <span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> parsed_path<span class="token punctuation">)</span>
            <span class="token punctuation">{</span>
                <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>

    <span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token punctuation">}</span>
</code></pre> 
  <p>有了基本思路后我们就可以开始编写函数了</p> 
  <h2>获得正排索引</h2> 
  <p>在 <code>forward_list</code>已经建立好的前提下,获得正排索引的函数并不难写。</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//根据文档的 doc_id 获得文档内容</span>
DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">)</span> 
<span class="token punctuation">{</span>
    <span class="token keyword">if</span><span class="token punctuation">(</span>doc_id<span class="token operator">>=</span>forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"doc_id out of range!"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">[</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <h2>获得倒排索引</h2> 
  <pre><code class="prism language-cpp"><span class="token comment">//根据关键字word,获得倒排拉链</span>
InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span>InvertedList<span class="token operator">></span><span class="token double-colon punctuation">::</span>iterator iter<span class="token operator">=</span>inverted_index<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">if</span><span class="token punctuation">(</span>iter<span class="token operator">==</span>inverted_index<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token comment">//没有索引结果</span>
        std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span>word<span class="token operator"><<</span><span class="token string">"has no InvertedList"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>

    <span class="token keyword">return</span> <span class="token operator">&</span><span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">;</span> 
<span class="token punctuation">}</span>
</code></pre> 
  <h2>构建索引</h2> 
  <p>显然这部分的难点就是如何构建索引,而<strong>构建索引的思路正好和用户使用搜索功能的过程正好相反</strong>。</p> 
  <p>思路:一个一个文档遍历,为其每个构建先正排索引后构建倒排索引。</p> 
  <p><a href="http://img.e-com-net.com/image/info8/eedece91804341909e335852b796cbb1.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/eedece91804341909e335852b796cbb1.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第14张图片" width="650" height="278" style="border:1px solid black;"></a></p> 
  <p>代码如下:</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//Parse处理后的文档,构建正排与倒排索引</span>
<span class="token comment">//Parse处理后的文档路径存于路径:data/raw_html/raw.txt</span>
<span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> parsed_path<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//读取Parse路径的文件</span>
    std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in</span><span class="token punctuation">(</span>parsed_path<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>in<span class="token operator">|</span>std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>in<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span>parsed_path<span class="token operator"><<</span><span class="token string">" open failed"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
  
    std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span>
    <span class="token keyword">int</span> count<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span><span class="token comment">//统计已构成索引的条目数</span>
    <span class="token keyword">while</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">getline</span><span class="token punctuation">(</span>in<span class="token punctuation">,</span>line<span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span> 
        <span class="token comment">//构建正排索引:把Parse后的文档读入到正排索引中</span>
        DocInfo<span class="token operator">*</span> doc<span class="token operator">=</span><span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>doc<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            std<span class="token double-colon punctuation">::</span>cerr<span class="token operator"><<</span><span class="token string">"bulid "</span><span class="token operator"><<</span>line<span class="token operator"><<</span><span class="token string">" error"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token comment">//for debug</span>
            <span class="token keyword">continue</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>

        <span class="token comment">//构建倒排索引:</span>
        <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token operator">*</span>doc<span class="token punctuation">)</span><span class="token punctuation">;</span>

        <span class="token comment">//实时打印已完成构建的索引条目数:进度条</span>
        count<span class="token operator">++</span><span class="token punctuation">;</span>
        <span class="token function">printf</span><span class="token punctuation">(</span><span class="token string">"已构建索引%d条: %d%%\r"</span><span class="token punctuation">,</span>count<span class="token punctuation">,</span>count<span class="token operator">*</span><span class="token number">100</span><span class="token operator">/</span><span class="token number">8171</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//8171为已解析文件数</span>
        <span class="token function">fflush</span><span class="token punctuation">(</span><span class="token constant">stdout</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
</code></pre> 
  <h3>构建正排索引</h3> 
  <pre><code class="prism language-cpp"><span class="token keyword">private</span><span class="token operator">:</span>
    DocInfo<span class="token operator">*</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> line<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token comment">//1.解析line,字符串切分</span>
        <span class="token comment">//line -> title+content+url </span>
        std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> results<span class="token punctuation">;</span>
        <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string sep<span class="token operator">=</span><span class="token string">"\3"</span><span class="token punctuation">;</span>
        ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">StringTool</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>line<span class="token punctuation">,</span><span class="token operator">&</span>results<span class="token punctuation">,</span>sep<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>results<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">!=</span><span class="token number">3</span><span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>

        <span class="token comment">//2.切分后填入DocInfo</span>
        DocInfo doc<span class="token punctuation">;</span>
        doc<span class="token punctuation">.</span>title<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
        doc<span class="token punctuation">.</span>content<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
        doc<span class="token punctuation">.</span>url<span class="token operator">=</span>results<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
        doc<span class="token punctuation">.</span>doc_id<span class="token operator">=</span>forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
  
        <span class="token comment">//3.DocInfo再插入到正排索引的forward_index</span>
        forward_index<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">return</span> <span class="token operator">&</span>forward_index<span class="token punctuation">.</span><span class="token function">back</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
</code></pre> 
  <p>其中 <code>CutString</code>函数定义在tool.hpp中</p> 
  <p>借用boost库的split函数可以方便我们切分字符串,在此之前我们把title/content/url使用 <code>\3</code>进行了划分。</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//tool.hpp</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><boost/algorithm/string.hpp></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cppjieba/Jieba.hpp"</span></span>
<span class="token keyword">namespace</span> ns_tool
<span class="token punctuation">{</span>
    <span class="token comment">//...</span>

    <span class="token keyword">class</span> <span class="token class-name">StringTool</span>
    <span class="token punctuation">{</span>
    <span class="token keyword">public</span><span class="token operator">:</span>
        <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> src<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> dst<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> sep <span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//boost split</span>
            boost<span class="token double-colon punctuation">::</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token operator">*</span>dst<span class="token punctuation">,</span>src<span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span><span class="token function">is_any_of</span><span class="token punctuation">(</span>sep<span class="token punctuation">)</span><span class="token punctuation">,</span>boost<span class="token double-colon punctuation">::</span>token_compress_on<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//token_compress_on 为压缩划分——分隔符的连续出现会视为仅一个分隔符</span>
        <span class="token punctuation">}</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <h3>构建倒排索引</h3> 
  <p>构建倒排索引是构建索引的难点</p> 
  <p><strong>原理</strong>:</p> 
  <ol> 
   <li>拿到了DocInfo</li> 
  </ol> 
  <pre><code class="prism language-cpp"><span class="token keyword">struct</span> <span class="token class-name">DocInfo</span>
<span class="token punctuation">{</span>
    std<span class="token double-colon punctuation">::</span>string title <span class="token punctuation">;</span>   <span class="token comment">//文档标题</span>
    std<span class="token double-colon punctuation">::</span>string content<span class="token punctuation">;</span>  <span class="token comment">//文档去标签内容</span>
    std<span class="token double-colon punctuation">::</span>string url<span class="token punctuation">;</span>      <span class="token comment">//文档对应的官网url</span>
    <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>      <span class="token comment">//文档ID</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>
</code></pre> 
  <p>例如:</p> 
  <pre><code class="prism language-txt">title: 吃葡萄
content:吃葡萄不吐葡萄皮
url:http://xxxx
doc_id:123
</code></pre> 
  <ol> 
   <li>根据DocInfo涵盖的文档内容形成一个InvertedElem或者多个InvertedElem,</li> 
  </ol> 
  <pre><code class="prism language-cpp"><span class="token comment">//倒排索引结构体</span>
<span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span>
<span class="token punctuation">{</span>
    <span class="token keyword">uint64_t</span> doc_id<span class="token punctuation">;</span>   <span class="token comment">// 文档ID</span>
    std<span class="token double-colon punctuation">::</span>string word<span class="token punctuation">;</span> <span class="token comment">// 文档相关关键字</span>
    <span class="token keyword">int</span> weight<span class="token punctuation">;</span>        <span class="token comment">// 文档权重</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token comment">//倒排拉链</span>
<span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElem<span class="token operator">></span> InvertedList<span class="token punctuation">;</span>
</code></pre> 
  <p>由于当前我们是一个一个文档进行处理,一个文档会包含多个词,所以都对应到当前的doc_id .</p> 
  <p><strong>2.1</strong> 首先是对 title && content 分词—— 使用 <code>jieba分词(第三方库)</code></p> 
  <p>title: 吃/葡萄/吃葡萄 (<code>title_word</code>)</p> 
  <p>content:吃/葡萄/不吐/葡萄皮( <code>content_word</code> )</p> 
  <p><strong>2.2</strong> 词频统计</p> 
  <p>词和文档的相关性(词频越高或者在标题中出现的词,可以认为相关性高)</p> 
  <p>伪代码:</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//文档分词后统计每个词对应在title和content中出现的频率</span>
<span class="token keyword">struct</span> <span class="token class-name">word_cnt</span>
<span class="token punctuation">{</span>
    title_cnt<span class="token punctuation">;</span>
    content_cnt<span class="token punctuation">;</span>
<span class="token punctuation">}</span><span class="token punctuation">;</span>

<span class="token comment">//每个词 与对应的 词频统计 放在map容器中</span>
unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_stat<span class="token punctuation">;</span>

<span class="token comment">//遍历title_word数组,统计每个词在title中的词频</span>
<span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>title_word<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    word_stat<span class="token punctuation">[</span>word<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token comment">//吃(1)/葡萄 (1)//吃葡萄(1)</span>
<span class="token punctuation">}</span>

<span class="token comment">//遍历content_word数组,统计每个词在content的词频</span>
<span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>content_word<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    word_stat<span class="token punctuation">[</span>word<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token comment">//吃(1)/葡萄(1)/不吐(1)/葡萄皮(1)</span>
<span class="token punctuation">}</span>


</code></pre> 
  <p>至此知道了文档中,title和content中的每个词的词频</p> 
  <p><strong>2.3</strong> 自定义相关性</p> 
  <p>伪代码</p> 
  <pre><code class="prism language-cpp"><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> word<span class="token operator">:</span>word_stat<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//具体一个词(word)和文档(ID:123)的对应关系</span>
    <span class="token keyword">struct</span> <span class="token class-name">InvertedElem</span> elem<span class="token punctuation">;</span>
    elem<span class="token punctuation">.</span>doc_id<span class="token operator">=</span><span class="token number">123</span><span class="token punctuation">;</span>
    elem<span class="token punctuation">.</span>word<span class="token operator">=</span>word<span class="token punctuation">.</span>first<span class="token punctuation">;</span>  

    <span class="token comment">//当一个词指向多个文档ID时,优先显示谁将由相关性决定</span>
    elem<span class="token punctuation">.</span>weight<span class="token operator">=</span><span class="token number">10</span><span class="token operator">*</span>word<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt <span class="token operator">+</span> word<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt <span class="token punctuation">;</span>
    <span class="token comment">//相关性,或者说权重的配比是一个很难的课题,这里只做简化处理</span>
  
    <span class="token comment">//为该词建立倒排拉链——一词可对应多个文档</span>
    inverted_index<span class="token punctuation">[</span>word<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <ol start="3"> 
   <li>jieba分词的使用 —— cppjieba</li> 
  </ol> 
  <p>下载cppjieba库</p> 
  <p>获取链接 :</p> 
  <pre><code class="prism language-bash"><span class="token function">git</span> clone https://github.com/yanyiwu/cppjieba
</code></pre> 
  <p>下载完cppjieba后,还有一个细节,手动把 <code>cppjieba/deps/limonp/</code> 的文件拷贝到 <code>cpp/jieba/include/cppjieba/</code> 目录下,否则会编译报错</p> 
  <p><a href="http://img.e-com-net.com/image/info8/3658f849d7d14c69a0c28999129d40d6.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/3658f849d7d14c69a0c28999129d40d6.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第15张图片" width="650" height="230" style="border:1px solid black;"></a></p> 
  <p>我们可以试一下这个第三方库,主要使用 <code>CutForSearch()</code>函数</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ ll
total <span class="token number">372</span>
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">366424</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> a<span class="token punctuation">.</span>out
drwxrwxr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">11</span> cppjieba
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">857</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">07</span> demo<span class="token punctuation">.</span>cpp
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">14</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">23</span> dict <span class="token operator">-></span> cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">17</span> Jul <span class="token number">23</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">26</span> inc <span class="token operator">-></span> cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">424</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">34</span> test<span class="token punctuation">.</span>cc
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ cat demo<span class="token punctuation">.</span>cpp 
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"inc/cppjieba/Jieba.hpp"</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>

<span class="token keyword">using</span> <span class="token keyword">namespace</span> std<span class="token punctuation">;</span>

<span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span>
<span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span>
<span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span>
<span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span>
<span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span>

<span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token keyword">int</span> argc<span class="token punctuation">,</span> <span class="token keyword">char</span><span class="token operator">*</span><span class="token operator">*</span> argv<span class="token punctuation">)</span> 
<span class="token punctuation">{</span>
    cppjieba<span class="token double-colon punctuation">::</span>Jieba <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span>
            HMM_PATH<span class="token punctuation">,</span>
            USER_DICT_PATH<span class="token punctuation">,</span>
            IDF_PATH<span class="token punctuation">,</span>
            STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span>
    vector<span class="token operator"><</span>string<span class="token operator">></span> words<span class="token punctuation">;</span>
    string s<span class="token punctuation">;</span>

    s <span class="token operator">=</span> <span class="token string">"小明硕士毕业于中国科学院计算所,后在日本京都大学深造"</span><span class="token punctuation">;</span>
    cout <span class="token operator"><<</span> s <span class="token operator"><<</span> endl<span class="token punctuation">;</span>
    cout <span class="token operator"><<</span> <span class="token string">"[demo] CutForSearch"</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span>
    jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>s<span class="token punctuation">,</span> words<span class="token punctuation">)</span><span class="token punctuation">;</span>
    cout <span class="token operator"><<</span> limonp<span class="token double-colon punctuation">::</span><span class="token function">Join</span><span class="token punctuation">(</span>words<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"/"</span><span class="token punctuation">)</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span>

    <span class="token keyword">return</span> EXIT_SUCCESS<span class="token punctuation">;</span>
<span class="token punctuation">}</span>
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos test<span class="token punctuation">]</span>$ <span class="token punctuation">.</span><span class="token operator">/</span>a<span class="token punctuation">.</span>out 
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
<span class="token punctuation">[</span>demo<span class="token punctuation">]</span> CutForSearch
小明<span class="token operator">/</span>硕士<span class="token operator">/</span>毕业<span class="token operator">/</span>于<span class="token operator">/</span>中国<span class="token operator">/</span>科学<span class="token operator">/</span>学院<span class="token operator">/</span>科学院<span class="token operator">/</span>中国科学院<span class="token operator">/</span>计算<span class="token operator">/</span>计算所<span class="token operator">/</span>,<span class="token operator">/</span>后<span class="token operator">/</span>在<span class="token operator">/</span>日本<span class="token operator">/</span>京都<span class="token operator">/</span>大学<span class="token operator">/</span>日本京都大学<span class="token operator">/</span>深造
</code></pre> 
  <p>可以看到词语得以很好的划分。</p> 
  <p><strong>下面引入jieba库来编写倒排索引的代码</strong></p> 
  <p>将 cppjieba 库存放在根目录的第三方目录 <code>thirdpart</code> 下,然后将<strong>库的头文件和词库</strong>在本项目目录中<strong>创建软连接</strong>:</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
total <span class="token number">148</span>
drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4399</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> index<span class="token punctuation">.</span>hpp
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">03</span> Makefile
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">36</span> parser
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> test
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">1244</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> tool<span class="token punctuation">.</span>hpp
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span> cppjieba
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span> dict
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
total <span class="token number">148</span>
drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">46</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">46</span> cppjieba <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span>
drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">34</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">47</span> dict <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4399</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> index<span class="token punctuation">.</span>hpp
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">124</span> Jul <span class="token number">20</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">03</span> Makefile
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">112408</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">36</span> parser
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">02</span> test
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">1244</span> Jul <span class="token number">23</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">44</span> tool<span class="token punctuation">.</span>hpp
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ls cppjieba<span class="token operator">/</span>
DictTrie<span class="token punctuation">.</span>hpp     HMMModel<span class="token punctuation">.</span>hpp    Jieba<span class="token punctuation">.</span>hpp             limonp          MPSegment<span class="token punctuation">.</span>hpp  PreFilter<span class="token punctuation">.</span>hpp     SegmentBase<span class="token punctuation">.</span>hpp    TextRankExtractor<span class="token punctuation">.</span>hpp  Unicode<span class="token punctuation">.</span>hpp
FullSegment<span class="token punctuation">.</span>hpp  HMMSegment<span class="token punctuation">.</span>hpp  KeywordExtractor<span class="token punctuation">.</span>hpp  MixSegment<span class="token punctuation">.</span>hpp  PosTagger<span class="token punctuation">.</span>hpp  QuerySegment<span class="token punctuation">.</span>hpp  SegmentTagged<span class="token punctuation">.</span>hpp  Trie<span class="token punctuation">.</span>hpp
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ls dict<span class="token operator">/</span>
hmm_model<span class="token punctuation">.</span>utf8  idf<span class="token punctuation">.</span>utf8  jieba<span class="token punctuation">.</span>dict<span class="token punctuation">.</span>utf8  pos_dict  README<span class="token punctuation">.</span>md  stop_words<span class="token punctuation">.</span>utf8  user<span class="token punctuation">.</span>dict<span class="token punctuation">.</span>utf8
</code></pre> 
  <p>我们把分词的代码作为一种常用工具放在头文件 <code>tool.hpp</code>中,于是分词的函数代码如下</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//tool.hpp</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">pragma</span> <span class="token expression">once</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><vector></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><fstream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><boost/algorithm/string.hpp></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cppjieba/Jieba.hpp"</span></span>
<span class="token keyword">namespace</span> ns_tool
<span class="token punctuation">{</span>
    <span class="token comment">//...</span>


    <span class="token comment">//分词工具</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">const</span> <span class="token keyword">char</span><span class="token operator">*</span> <span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span>
    <span class="token keyword">class</span> <span class="token class-name">JiebaTool</span>
    <span class="token punctuation">{</span>
    <span class="token keyword">private</span><span class="token operator">:</span>
        <span class="token keyword">static</span> cppjieba<span class="token double-colon punctuation">::</span>Jieba jieba<span class="token punctuation">;</span>

    <span class="token keyword">public</span><span class="token operator">:</span>
        <span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">SplitToWord</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> out<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//使用jieba库函数对src分词,并存于out中</span>
            jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span><span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>

    cppjieba<span class="token double-colon punctuation">::</span>Jieba <span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span>
            HMM_PATH<span class="token punctuation">,</span>
            USER_DICT_PATH<span class="token punctuation">,</span>
            IDF_PATH<span class="token punctuation">,</span>
            STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <p>于是整个构建倒排索引的代码如下:</p> 
  <pre><code class="prism language-cpp"><span class="token keyword">private</span><span class="token operator">:</span>
    <span class="token keyword">bool</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> DocInfo <span class="token operator">&</span>doc<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token comment">//构建完的正排,此时DocInfo[title,content,url,doc_id]</span>
        <span class="token comment">// word-> 倒排拉链</span>
  
        <span class="token comment">//每个词在文档中的词频统计 </span>
        <span class="token keyword">struct</span> <span class="token class-name">word_cnt</span>
        <span class="token punctuation">{</span>
            <span class="token keyword">int</span> title_cnt<span class="token punctuation">;</span>
            <span class="token keyword">int</span> content_cnt<span class="token punctuation">;</span>
            <span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">:</span><span class="token function">title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token function">content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span>
            <span class="token punctuation">{</span><span class="token punctuation">}</span>
        <span class="token punctuation">}</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string <span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_stat<span class="token punctuation">;</span><span class="token comment">//用来暂存关键词与词频的映射表</span>
  

        <span class="token comment">//标题分词</span>
        std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> title_word<span class="token punctuation">;</span>
        ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>title<span class="token punctuation">,</span><span class="token operator">&</span>title_word<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token comment">//标题词频统计</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s<span class="token operator">:</span>title_word<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//将标题关键字全部转为小写统一计算词频(使用拷贝,不影响原来的关键字)</span>
            boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>
            word_stat<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>title_cnt<span class="token operator">++</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
  
        <span class="token comment">//内容分词</span>
        std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> content_word<span class="token punctuation">;</span>
        ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>content<span class="token punctuation">,</span><span class="token operator">&</span>content_word<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token comment">//内容词频统计</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s<span class="token operator">:</span>content_word<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//将内容关键字全部转为小写统一计算词频(使用拷贝,不影响原来的关键字)</span>
            boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>
            word_stat<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>content_cnt<span class="token operator">++</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
  
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">X</span> <span class="token expression"><span class="token number">10</span></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">Y</span> <span class="token expression"><span class="token number">1</span></span></span>
        <span class="token comment">//建立该doc所有关键字对应的倒排拉链</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span>word_pair<span class="token operator">:</span>word_stat<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            InvertedElem elem<span class="token punctuation">;</span>
            elem<span class="token punctuation">.</span>doc_id<span class="token operator">=</span>doc<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span>
            elem<span class="token punctuation">.</span>word<span class="token operator">=</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">;</span>
            <span class="token comment">//自定义相关性</span>
            elem<span class="token punctuation">.</span>weight<span class="token operator">=</span>word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>title_cnt<span class="token operator">*</span>X<span class="token operator">+</span>word_pair<span class="token punctuation">.</span>second<span class="token punctuation">.</span>content_cnt<span class="token operator">*</span>Y<span class="token punctuation">;</span>
      
            <span class="token comment">//将这个关键字构成的倒排索引元素push到倒排索引表的倒排拉链中</span>
            <span class="token comment">//(注意这里的关键字全部转为小写计算了词频),所以搜索时,需将用户输入的关键字先转为全小写</span>

            InvertedList <span class="token operator">&</span>inverted_list<span class="token operator">=</span>inverted_index<span class="token punctuation">[</span>word_pair<span class="token punctuation">.</span>first<span class="token punctuation">]</span><span class="token punctuation">;</span>
            inverted_list<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
</code></pre> 
  <h1>4. 搜索引擎模块 —— Searcher</h1> 
  <p>基本思路</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//searcher.hpp</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"index.hpp"</span></span>

<span class="token keyword">namespace</span> ns_searcher
<span class="token punctuation">{</span>
    <span class="token keyword">class</span> <span class="token class-name">Searcher</span>
    <span class="token punctuation">{</span>
    <span class="token keyword">private</span><span class="token operator">:</span>
        ns_index<span class="token double-colon punctuation">::</span>Index <span class="token operator">*</span>index<span class="token punctuation">;</span>
    <span class="token keyword">public</span><span class="token operator">:</span>
        <span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//1.创建index对象(单例)</span>
            <span class="token comment">//2.根据index对象建立索引</span>
        <span class="token punctuation">}</span>

        <span class="token comment">//搜索功能</span>
        <span class="token comment">//json_string 返回给用户浏览器的搜索结果</span>
        <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> json_string<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//1.[分词]:对搜索关键字query在服务端也要分词,然后查找index</span>
            <span class="token comment">//2.[触发]:根据分词的各个词进行index查找</span>
            <span class="token comment">//3.[合并排序]:汇总查找结果,按照相关性(权重weight)降序排序</span>
            <span class="token comment">//4.[构建]:将排好序的结果,生成json串 —— jsoncpp</span>
        <span class="token punctuation">}</span>
    <span class="token punctuation">}</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <h2>初始化搜索对象 —— InitSearcher</h2> 
  <p>该函数负责两件事,构造索引对象并构建索引</p> 
  <p>Index为单例模式,调用函数GetInstance生成对象:</p> 
  <p>调用函数BuildIndex构建索引</p> 
  <pre><code class="prism language-cpp"><span class="token keyword">void</span> <span class="token function">InitSearcher</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>input<span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//1.创建index对象(单例)</span>
    index<span class="token operator">=</span>ns_index<span class="token double-colon punctuation">::</span><span class="token class-name">Index</span><span class="token double-colon punctuation">::</span><span class="token function">Getinstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"创建index单例完成..."</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
    <span class="token comment">//2.根据index对象建立索引(将已去除标签处理好的文件路径传入)</span>
    index<span class="token operator">-></span><span class="token function">BuildIndex</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>
    std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"构建索引完成..."</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <h2>搜索功能 —— Search</h2> 
  <ul> 
   <li> <p>[分词]</p> <p>继续使用结巴分词工具定义的函数 <code>SplitToWord</code>来对用户输入的索引词进行分词</p> </li> 
   <li> <p>[触发]</p> <p>调用 <code>获取倒排索引函数GetInvertedList()</code>获得所有关键词的倒排拉链</p> </li> 
   <li> <p>[合并排序]</p> <p>汇总倒排拉链中的所有倒排元素(文档ID相同的去重),按照权重降序排序</p> </li> 
   <li> <p>[构建]<br> 由倒排元素正排索引得到正文文档,将正文中的content进行摘录。合并所有文档后,使用json库生成序列化字符串,便于后续网络传输。</p> <p>摘录content的多少部分是我们自己定的规则:找到关键字在content中首次出现的位置pos,然后截取 —— 往前找50个字节(如没有50个,则从begin开始),往后找100个字节(如没有,则截取到end)的内容</p> </li> 
  </ul> 
  <p><a href="http://img.e-com-net.com/image/info8/9cb44b4a25754d2bac4f237e443f6f80.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/9cb44b4a25754d2bac4f237e443f6f80.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第16张图片" width="650" height="280" style="border:1px solid black;"></a></p> 
  <h3>安装json库与使用示例</h3> 
  <pre><code class="prism language-bash"><span class="token function">sudo</span> yum <span class="token function">install</span> -y jsoncpp-devel
</code></pre> 
  <p>使用json</p> 
  <pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><jsoncpp/json/json.h></span></span>

<span class="token comment">//Value Reader(反序列化) Writer(序列化)</span>
<span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span>
    Json<span class="token double-colon punctuation">::</span>Value item1<span class="token punctuation">;</span>
    item1<span class="token punctuation">[</span><span class="token string">"key1"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value11"</span><span class="token punctuation">;</span>
    item1<span class="token punctuation">[</span><span class="token string">"key2"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value12"</span><span class="token punctuation">;</span>

    Json<span class="token double-colon punctuation">::</span>Value item2<span class="token punctuation">;</span>
    item2<span class="token punctuation">[</span><span class="token string">"key1"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value21"</span><span class="token punctuation">;</span>
    item2<span class="token punctuation">[</span><span class="token string">"key2"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token string">"value22"</span><span class="token punctuation">;</span>

    root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>item1<span class="token punctuation">)</span><span class="token punctuation">;</span>
    root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>item2<span class="token punctuation">)</span><span class="token punctuation">;</span>

    Json<span class="token double-colon punctuation">::</span>StyledWriter writer<span class="token punctuation">;</span>
    <span class="token comment">//Json::FastWriter writer;</span>
    std<span class="token double-colon punctuation">::</span>string s<span class="token operator">=</span>writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span>
    std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span>s<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <p><a href="http://img.e-com-net.com/image/info8/6aad665da6fd4c7fa584459166d72e70.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/6aad665da6fd4c7fa584459166d72e70.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第17张图片" width="582" height="226" style="border:1px solid black;"></a></p> 
  <h3>Search 完整代码</h3> 
  <pre><code class="prism language-cpp"><span class="token keyword">public</span><span class="token operator">:</span>
    <span class="token comment">//搜索功能</span>
    <span class="token comment">//json_string 返回给用户浏览器的搜索结果</span>
    <span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> query<span class="token punctuation">,</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> json_string<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token comment">//1.[分词]:对搜索关键字query在服务端也要分词,然后查找index</span>
        std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> words<span class="token punctuation">;</span>
        ns_tool<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaTool</span><span class="token double-colon punctuation">::</span><span class="token function">SplitToWord</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span><span class="token operator">&</span>words<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token comment">//2.[触发]:就是根据分词的各个词进行index查找,忽略大小写,所以关键字需要转换为小写</span>
        ns_index<span class="token double-colon punctuation">::</span>InvertedList inverted_list_all<span class="token punctuation">;</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span>string word<span class="token operator">:</span>words<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//获取倒排拉链</span>
            ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>inverted_list<span class="token operator">=</span>index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token comment">//如果倒排拉链不存在则continue</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>inverted_list<span class="token punctuation">)</span>
            <span class="token punctuation">{</span>
                <span class="token keyword">continue</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            <span class="token comment">//将关键字的倒排拉链的倒排元素汇总</span>
            <span class="token comment">//不完美的地方,如果多个关键字出现在一个文档中,那么许多倒排元素中的文档ID其实是会重复的</span>
            inverted_list_all<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
    
        <span class="token comment">//3.[合并排序]:汇总查找结果,按照相关性(权重weight)进行降序排序</span>
        std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElem e1<span class="token punctuation">,</span><span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElem<span class="token operator">&</span> e2<span class="token punctuation">)</span><span class="token operator">-></span><span class="token keyword">bool</span><span class="token punctuation">{</span>\
            <span class="token keyword">return</span> e1<span class="token punctuation">.</span>weight<span class="token operator">></span>e2<span class="token punctuation">.</span>weight<span class="token punctuation">;</span>\
            <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

        <span class="token comment">//4.[构建]:根据查找出的结果,生成json串 —— jsoncpp 完成序列化和反序列化</span>
        Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span>
        <span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span><span class="token operator">&</span> item<span class="token operator">:</span>inverted_list_all<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token comment">//正排索引获取文档内容</span>
            ns_index<span class="token double-colon punctuation">::</span>DocInfo<span class="token operator">*</span> doc<span class="token operator">=</span>index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token keyword">nullptr</span><span class="token operator">==</span>doc<span class="token punctuation">)</span>
            <span class="token punctuation">{</span>
                <span class="token keyword">continue</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span>
            Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span>
            elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span><span class="token operator">=</span>doc<span class="token operator">-></span>title<span class="token punctuation">;</span>
            <span class="token comment">//content是文档去标签的结果,但是内容太多需要提取出摘要GetAbstract</span>
            elem<span class="token punctuation">[</span><span class="token string">"abstract"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token function">GetAbstract</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>content<span class="token punctuation">,</span>item<span class="token punctuation">.</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span>
            elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span><span class="token operator">=</span>doc<span class="token operator">-></span>url<span class="token punctuation">;</span>

            <span class="token comment">//for debug 查看是否以权重降序排序</span>
            elem<span class="token punctuation">[</span><span class="token string">"doc_id"</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">int</span><span class="token punctuation">)</span>item<span class="token punctuation">.</span>doc_id<span class="token punctuation">;</span>
            elem<span class="token punctuation">[</span><span class="token string">"weight"</span><span class="token punctuation">]</span><span class="token operator">=</span>item<span class="token punctuation">.</span>weight<span class="token punctuation">;</span>

            root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>

        Json<span class="token double-colon punctuation">::</span>StyledWriter writer<span class="token punctuation">;</span>
        <span class="token operator">*</span>json_string<span class="token operator">=</span>writer<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
</code></pre> 
  <p>提取摘要</p> 
  <pre><code class="prism language-cpp"><span class="token keyword">public</span><span class="token operator">:</span>
    std<span class="token double-colon punctuation">::</span>string <span class="token function">GetAbstract</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> html_content<span class="token punctuation">,</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        <span class="token comment">//找到word在html_content中首次出现的位置,</span>
        <span class="token comment">//然后截取:往前找50个字节(如没有50个,则从begin开始),往后找100个字节(如没有截取到end)的内容</span>
    
        <span class="token keyword">const</span> <span class="token keyword">int</span> prev_step<span class="token operator">=</span><span class="token number">50</span><span class="token punctuation">;</span>
        <span class="token keyword">const</span> <span class="token keyword">int</span> post_step<span class="token operator">=</span><span class="token number">100</span><span class="token punctuation">;</span>
        <span class="token comment">//1.找到首次出现位置pos 使用std::search 函数 忽视大小写搜索</span>
        <span class="token keyword">auto</span> iter<span class="token operator">=</span>std<span class="token double-colon punctuation">::</span><span class="token function">search</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>word<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">int</span> a<span class="token punctuation">,</span><span class="token keyword">int</span> b<span class="token punctuation">)</span><span class="token punctuation">{</span>\
            <span class="token keyword">return</span> <span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>a<span class="token punctuation">)</span><span class="token operator">==</span>std<span class="token double-colon punctuation">::</span><span class="token function">tolower</span><span class="token punctuation">(</span>b<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
            <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>iter<span class="token operator">==</span>html_content<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            <span class="token keyword">return</span> <span class="token string">"Not Found"</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token keyword">int</span> pos<span class="token operator">=</span>std<span class="token double-colon punctuation">::</span><span class="token function">distance</span><span class="token punctuation">(</span>html_content<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span>

        <span class="token comment">//2.获取start的位置和last的位置</span>
        <span class="token keyword">int</span> start<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span>
        <span class="token keyword">int</span> last<span class="token operator">=</span>html_content<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">;</span>
        <span class="token comment">//如果之前有50+个字节,更新start</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator">></span>start<span class="token operator">+</span>prev_step<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            start<span class="token operator">=</span>pos<span class="token operator">-</span>prev_step<span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token comment">//如果之后有100+个字节,更新last </span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>pos<span class="token operator">+</span>post_step<span class="token operator"><</span>last<span class="token punctuation">)</span>
        <span class="token punctuation">{</span>
            last<span class="token operator">=</span>pos<span class="token operator">+</span>post_step<span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        <span class="token comment">//3.截取子串返回</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span>start<span class="token operator">>=</span>last<span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token string">"None"</span><span class="token punctuation">;</span> 
        <span class="token keyword">return</span> html_content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>start<span class="token punctuation">,</span>last<span class="token operator">-</span>start<span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span> 
</code></pre> 
  <h2>测试</h2> 
  <p>在完成网络传输模块之前,我们可以在本地进行测试,搜索关键词时是否能搜到想得到的结果:</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//debug.cc</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><iostream></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><cstdio></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><string></span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"><cstring></span></span>
<span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input<span class="token operator">=</span><span class="token string">"data/raw_html/raw.txt"</span><span class="token punctuation">;</span>
<span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//for test</span>
    ns_searcher<span class="token double-colon punctuation">::</span>Searcher <span class="token operator">*</span>search<span class="token operator">=</span><span class="token keyword">new</span> ns_searcher<span class="token double-colon punctuation">::</span>Searcher<span class="token punctuation">;</span>
    search<span class="token operator">-></span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>

    std<span class="token double-colon punctuation">::</span>string query<span class="token punctuation">;</span>
    <span class="token keyword">char</span> buffer<span class="token punctuation">[</span><span class="token number">1024</span><span class="token punctuation">]</span><span class="token punctuation">;</span>
    <span class="token keyword">while</span><span class="token punctuation">(</span><span class="token boolean">true</span><span class="token punctuation">)</span>
    <span class="token punctuation">{</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"please enter the query"</span><span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token function">fgets</span><span class="token punctuation">(</span>buffer<span class="token punctuation">,</span><span class="token keyword">sizeof</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">,</span><span class="token constant">stdin</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        buffer<span class="token punctuation">[</span><span class="token function">strlen</span><span class="token punctuation">(</span>buffer<span class="token punctuation">)</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">;</span><span class="token comment">//去除回车</span>
        query<span class="token operator">=</span>buffer<span class="token punctuation">;</span>
      
        std<span class="token double-colon punctuation">::</span>string ans<span class="token punctuation">;</span>
        search<span class="token operator">-></span><span class="token function">Search</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span><span class="token operator">&</span>ans<span class="token punctuation">)</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span>ans<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
    <span class="token punctuation">}</span>
    <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <h1>5. 服务器搭建 —— http_server 模块</h1> 
  <p>cpp-httplib库:https://gitee.com/sumert/cpp-httplib/tree/v0.7.15</p> 
  <p>(如果链接失效,直接在gitee搜索 <code>cpp-httplib</code>即可)</p> 
  <p>注意事项:cpp-httplib 在使用的时候需使用较新的gcc,否则会编译出错。</p> 
  <p>我们使用的云服务的gcc版本默认为 gcc 4.8.5</p> 
  <pre><code class="prism language-bash"><span class="token punctuation">[</span>sjl@VM-16-6-centos ~<span class="token punctuation">]</span>$ gcc -v
Using built-in specs.
<span class="token assign-left variable">COLLECT_GCC</span><span class="token operator">=</span>gcc
<span class="token assign-left variable">COLLECT_LTO_WRAPPER</span><span class="token operator">=</span>/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: <span class="token punctuation">..</span>/configure --prefix<span class="token operator">=</span>/usr --mandir<span class="token operator">=</span>/usr/share/man --infodir<span class="token operator">=</span>/usr/share/info --with-bugurl<span class="token operator">=</span>http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads<span class="token operator">=</span>posix --enable-checking<span class="token operator">=</span>release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style<span class="token operator">=</span>gnu --enable-languages<span class="token operator">=</span>c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl<span class="token operator">=</span>/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog<span class="token operator">=</span>/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune<span class="token operator">=</span>generic --with-arch_32<span class="token operator">=</span>x86-64 --build<span class="token operator">=</span>x86_64-redhat-linux
Thread model: posix
gcc version <span class="token number">4.8</span>.5 <span class="token number">20150623</span> <span class="token punctuation">(</span>Red Hat <span class="token number">4.8</span>.5-44<span class="token punctuation">)</span> <span class="token punctuation">(</span>GCC<span class="token punctuation">)</span> 
</code></pre> 
  <p><strong>所以需要我们升级一下gcc:</strong></p> 
  <p>CentOS 7上升级/安装gcc</p> 
  <pre><code class="prism language-cpp"><span class="token comment">//安装scl</span>
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ sudo yum install centos<span class="token operator">-</span>release<span class="token operator">-</span>scl scl<span class="token operator">-</span>utils<span class="token operator">-</span>build

<span class="token comment">//安装新版本gcc</span>
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ sudo yum install <span class="token operator">-</span>y devtoolset<span class="token operator">-</span><span class="token number">7</span><span class="token operator">-</span>gcc devtoolset<span class="token operator">-</span><span class="token number">7</span><span class="token operator">-</span>gccc<span class="token operator">++</span>

<span class="token comment">//查看工具集</span>
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ ls <span class="token operator">/</span>opt<span class="token operator">/</span>rh
devtoolset<span class="token operator">-</span><span class="token number">7</span>
</code></pre> 
  <p>因为不会覆盖系统默认的gcc,需要手动启动</p> 
  <p>命令行启动仅在本次会话有效。</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ scl enable devtoolset<span class="token operator">-</span><span class="token number">7</span> bash
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ gcc <span class="token operator">-</span>v
</code></pre> 
  <p>若想永久有效,则需要启动时自动执行指令,在文件 <code>~/.bash_profile</code>中添加语句</p> 
  <p><code>scl enable devtoolset-7 bash</code></p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ vim <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bash_profile 
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos <span class="token operator">~</span><span class="token punctuation">]</span>$ cat <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bash_profile 
# <span class="token punctuation">.</span>bash_profile

<span class="token macro property"><span class="token directive-hash">#</span> <span class="token expression">Get the aliases <span class="token operator">and</span> functions</span></span>
<span class="token keyword">if</span> <span class="token punctuation">[</span> <span class="token operator">-</span>f <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bashrc <span class="token punctuation">]</span><span class="token punctuation">;</span> then
	<span class="token punctuation">.</span> <span class="token operator">~</span><span class="token operator">/</span><span class="token punctuation">.</span>bashrc
fi

<span class="token macro property"><span class="token directive-hash">#</span> <span class="token expression">User specific environment <span class="token operator">and</span> startup programs</span></span>

PATH<span class="token operator">=</span>$PATH<span class="token operator">:</span>$HOME<span class="token operator">/</span><span class="token punctuation">.</span>local<span class="token operator">/</span>bin<span class="token operator">:</span>$HOME<span class="token operator">/</span>bin

<span class="token keyword">export</span> PATH


#每次启动的时候,都会执行这个scl命令
scl enable devtoolset<span class="token operator">-</span><span class="token number">7</span> bash
</code></pre> 
  <p><strong>安装 <code>cpp-httplib</code></strong></p> 
  <p>如果gcc不是特别新,可能会有运行时错误的问题。</p> 
  <p>所以建议使用:cpp-httplib 0.7.15</p> 
  <p>点击链接下载,</p> 
  <p><a href="http://img.e-com-net.com/image/info8/a1ec25c48fa24d1c861d3a0341f6bd9b.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/a1ec25c48fa24d1c861d3a0341f6bd9b.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第18张图片" width="582" height="226" style="border:1px solid black;"></a></p> 
  <p>将压缩包放置 <code>thirdpart</code>文件夹中并解压(unzip):</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos thirdpart<span class="token punctuation">]</span>$ ll
total <span class="token number">8</span>
drwxrwxr<span class="token operator">-</span>x <span class="token number">6</span> sjl sjl <span class="token number">4096</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">50</span> cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span>
drwxrwxr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl <span class="token number">4096</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">45</span> cppjieba
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos thirdpart<span class="token punctuation">]</span>$ 
</code></pre> 
  <p>在项目文件夹中建立软连接:</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ln <span class="token operator">-</span>s <span class="token operator">~</span><span class="token operator">/</span>thirdpart<span class="token operator">/</span>cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span><span class="token operator">/</span> cpp<span class="token operator">-</span>httplib
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ ll
total <span class="token number">1532</span>
drwxr<span class="token operator">-</span>xr<span class="token operator">-</span>x <span class="token number">8</span> sjl sjl   <span class="token number">4096</span> Apr  <span class="token number">7</span> <span class="token number">05</span><span class="token operator">:</span><span class="token number">33</span> boost_1_79_0
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">40</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">54</span> cpp<span class="token operator">-</span>httplib <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cpp<span class="token operator">-</span>httplib<span class="token operator">-</span>v0<span class="token punctuation">.</span><span class="token number">7.15</span><span class="token operator">/</span>
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">46</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">46</span> cppjieba <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>include<span class="token operator">/</span>cppjieba<span class="token operator">/</span>
drwxrwxr<span class="token operator">-</span>x <span class="token number">4</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">19</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">37</span> data
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">608144</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> debug
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">640</span> Jul <span class="token number">28</span> <span class="token number">01</span><span class="token operator">:</span><span class="token number">05</span> debug<span class="token punctuation">.</span>cc
lrwxrwxrwx <span class="token number">1</span> sjl sjl     <span class="token number">34</span> Jul <span class="token number">23</span> <span class="token number">20</span><span class="token operator">:</span><span class="token number">47</span> dict <span class="token operator">-></span> <span class="token operator">/</span>home<span class="token operator">/</span>sjl<span class="token operator">/</span>thirdpart<span class="token operator">/</span>cppjieba<span class="token operator">/</span>dict<span class="token operator">/</span>
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">409408</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> http_server
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl     <span class="token number">58</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> http_server<span class="token punctuation">.</span>cc
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">7489</span> Jul <span class="token number">27</span> <span class="token number">16</span><span class="token operator">:</span><span class="token number">08</span> index<span class="token punctuation">.</span>hpp
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl    <span class="token number">360</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> Makefile
<span class="token operator">-</span>rwxrwxr<span class="token operator">-</span>x <span class="token number">1</span> sjl sjl <span class="token number">492840</span> Jul <span class="token number">28</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">44</span> parser
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">6088</span> Jul <span class="token number">22</span> <span class="token number">12</span><span class="token operator">:</span><span class="token number">31</span> parser<span class="token punctuation">.</span>cc
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">4654</span> Jul <span class="token number">28</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">17</span> searcher<span class="token punctuation">.</span>hpp
drwxrwxr<span class="token operator">-</span>x <span class="token number">3</span> sjl sjl   <span class="token number">4096</span> Jul <span class="token number">28</span> <span class="token number">15</span><span class="token operator">:</span><span class="token number">47</span> test
<span class="token operator">-</span>rw<span class="token operator">-</span>rw<span class="token operator">-</span>r<span class="token operator">--</span> <span class="token number">1</span> sjl sjl   <span class="token number">2047</span> Jul <span class="token number">27</span> <span class="token number">00</span><span class="token operator">:</span><span class="token number">43</span> tool<span class="token punctuation">.</span>hpp
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ 

</code></pre> 
  <p>新建网页根目录(后续将包含首页及一系列资源),在WWWROOT的目录下写一个html文件</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ mkdir WWWROOT
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos WWWROOT<span class="token punctuation">]</span>$ touch index<span class="token punctuation">.</span>html
</code></pre> 
  <h2>cpp-httplib 的基本使用测试</h2> 
  <pre><code class="prism language-cpp"><span class="token comment">//http_server.cc</span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cpp-httplib/httplib.h"</span></span>

<span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path<span class="token operator">=</span><span class="token string">"./WWWROOT"</span><span class="token punctuation">;</span>
<span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span>

    <span class="token comment">//设置首页</span>
    svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/hi"</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span>httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span><span class="token punctuation">{</span>
        rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"gogogogogo"</span><span class="token punctuation">,</span><span class="token string">"text/plain; charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span><span class="token number">8081</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <pre><code class="prism language-html"><span class="token comment"><!-- index.html --></span>

<span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>UTF-8<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span> for test <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>h1</span><span class="token punctuation">></span></span>Hello World!<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>h1</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是一个httplib测试<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span>

</code></pre> 
  <p>编译运行:</p> 
  <pre><code class="prism language-cpp"><span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ g<span class="token operator">++</span> <span class="token operator">-</span>o http_server httpserver<span class="token punctuation">.</span>cc <span class="token operator">-</span>std<span class="token operator">=</span>c<span class="token operator">++</span><span class="token number">11</span> <span class="token operator">-</span>ljsoncpp <span class="token operator">-</span>lpthread
<span class="token punctuation">[</span>sjl@VM<span class="token operator">-</span><span class="token number">16</span><span class="token operator">-</span><span class="token number">6</span><span class="token operator">-</span>centos boost_searcher<span class="token punctuation">]</span>$ <span class="token punctuation">.</span><span class="token operator">/</span>http_server
</code></pre> 
  <p><a href="http://img.e-com-net.com/image/info8/ebc0f511ac294dc2a89f3af368cd7298.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/ebc0f511ac294dc2a89f3af368cd7298.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第19张图片" width="650" height="259" style="border:1px solid black;"></a></p> 
  <p><a href="http://img.e-com-net.com/image/info8/5fdf0374ac3b4197800d7fbe6c302780.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/5fdf0374ac3b4197800d7fbe6c302780.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第20张图片" width="650" height="231" style="border:1px solid black;"></a></p> 
  <h2>编写 HttpServer 模块</h2> 
  <pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span>
<span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cpp-httplib/httplib.h"</span></span>

<span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string root_path<span class="token operator">=</span><span class="token string">"./WWWROOT"</span><span class="token punctuation">;</span>
<span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string input<span class="token operator">=</span><span class="token string">"data/raw_html/raw.txt"</span><span class="token punctuation">;</span>

<span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">{</span>
    <span class="token comment">//创建搜索器并初始化</span>
    ns_searcher<span class="token double-colon punctuation">::</span>Searcher search<span class="token punctuation">;</span>
    search<span class="token punctuation">.</span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>

    httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span>
    <span class="token comment">//设置首页 </span>
    svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>

    svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/s"</span><span class="token punctuation">,</span><span class="token punctuation">[</span><span class="token operator">&</span>search<span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span>httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span><span class="token punctuation">{</span>
        <span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>req<span class="token punctuation">.</span><span class="token function">has_param</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment">//请求中若没有参数</span>
        <span class="token punctuation">{</span>
            rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"请输入搜索词!"</span><span class="token punctuation">,</span><span class="token string">"text/plain; charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">//返回Content—Type为文本</span>
            <span class="token keyword">return</span><span class="token punctuation">;</span>
        <span class="token punctuation">}</span>
        std<span class="token double-colon punctuation">::</span>string word<span class="token operator">=</span>req<span class="token punctuation">.</span><span class="token function">get_param_value</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
        std<span class="token double-colon punctuation">::</span>cout<span class="token operator"><<</span><span class="token string">"用户搜索词: "</span><span class="token operator"><<</span>word<span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>
        <span class="token comment">//执行搜索服务</span>
        std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span>
        search<span class="token punctuation">.</span><span class="token function">Search</span><span class="token punctuation">(</span>word<span class="token punctuation">,</span><span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span>
        rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span>json_string<span class="token punctuation">,</span><span class="token string">"application/json"</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span><span class="token number">8081</span><span class="token punctuation">)</span><span class="token punctuation">;</span>
    <span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span>
<span class="token punctuation">}</span>
</code></pre> 
  <p></p> 
  <p>OK,至此后端大抵完成,后面来完成前端工作。</p> 
  <h1>6. 前端模块</h1> 
  <h2>HTML 网页框架</h2> 
  <pre><code class="prism language-html"><span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span> <span class="token attr-name">lang</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>en<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>UTF-8<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">http-equiv</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>X-UA-Compatible<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>IE=edge<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>viewport<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>width=device-width, initial-scale=1.0<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span>BOOST搜索引擎<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>container<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>search<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>text<span class="token punctuation">"</span></span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>输入搜索关键字<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>button</span><span class="token punctuation">></span></span>Search<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>button</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>

        <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>result<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>item<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>a</span> <span class="token attr-name">href</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>#<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>这是标题<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>a</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>p</span><span class="token punctuation">></span></span>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>p</span><span class="token punctuation">></span></span>
                <span class="token tag"><span class="token tag"><span class="token punctuation"><</span>i</span><span class="token punctuation">></span></span>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>i</span><span class="token punctuation">></span></span>
            <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
        <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
    <span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span>
<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span>
</code></pre> 
  <p><a href="http://img.e-com-net.com/image/info8/49590c462d3a4466a408a42c196bf752.jpg" target="_blank"><img src="http://img.e-com-net.com/image/info8/49590c462d3a4466a408a42c196bf752.jpg" alt="【项目】 基于BOOST的站内搜索引擎_第21张图片" width="650" height="479" style="border:1px solid black;"></a></p> 
  <h2>CSS 网页个性化设计</h2> 
  <p>设置样式的本质是找到标签设置属性(直接在html代码中的title之后进行编辑)</p> 
  <ol> 
   <li>选择特定标签:类选择器,标签选择,复合选择</li> 
   <li>设置指定标签的属性</li> 
  </ol> 
  <pre><code class="prism language-css"><!DOCTYPE html>
<html lang=<span class="token string">"en"</span>>
<head>
    <meta charset=<span class="token string">"UTF-8"</span>>
    <meta http-equiv=<span class="token string">"X-UA-Compatible"</span> content=<span class="token string">"IE=edge"</span>>
    <meta name=<span class="token string">"viewport"</span> content=<span class="token string">"width=device-width, initial-scale=1.0"</span>>
    <title>BOOST搜索引擎
    /* css设计 */
    

/* ... */

JavaScript 编写实现跳转

使用原生JS成本较高(xmlhttprequest),这里使用JQuery。

在html中添加外部链接,获取JQuery库

<script src="http://code.jquery.com/jquery-2.1.1.min.js">script>

在html文件中插入代码:


 div>
    <script>  
        function Search(){
            // 是浏览器的一个弹出框
            // alert("hello js!");
          
            //1.提取数据 $可以理解为JQuery的别称
            let query = $(".container .search input").val();
            console.log("query = " + query);//console是浏览器的对话框,查看js的数据

            //2.发起http请求(把关键字上传给服务器),JQuery中的ajax:一个与服务器进行数据交互的函数
            $.ajax({
                type:"GET",
                url:"/s?word="+query,
                //如果请求成功,打印出服务器返回的data(此时服务器一直在后台运行)
                success:function(data){
                    console.log(data);
                    //将结果构建为网页信息
                    BuildHtml(data);
                }
            });
        }

        function BuildHtml(data)
        {
            if(data=="" || data==null)
            {
                document.write("搜索内容不存在");
                return ;
            }
            //获取result标签
            let result_label = $(".container .result");
            //清空历史搜索数据
            result_label.empty();

            for(let elem of data)
            {
                console.log(elem.title);
                console.log(elem.url);

                let a_label=$("",{
                    text: elem.title,
                    //标签链接
                    href: elem.url,
                    //点击链接跳转新启一页 
                    target: "_blank"
                });
                let p_label=$("

",{ text: elem.abstract }); let i_label=$("",{ text: elem.url, }); let div_label=$("

",{ class:"item" }); a_label.appendTo(div_label); p_label.appendTo(div_label); i_label.appendTo(div_label); div_label.appendTo(result_label); } } script> body> html>

至此整个前端的代码便全部完成。

整体效果

项目所有的文件如下:

【项目】 基于BOOST的站内搜索引擎_第22张图片

makefile文件如下:

PARSER=parser
DUG=debug
HTTP_SERVER=http_server
cc=g++

.PHONY:all
all:$(PARSER) $(DUG) $(HTTP_SERVER)

$(PARSER):parser.cc
	$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem

$(DUG):debug.cc
	$(cc) -o $@ $^ -std=c++11 -ljsoncpp

$(HTTP_SERVER):http_server.cc
	$(cc) -o $@ $^ -std=c++11 -ljsoncpp -lpthread

.PHONY:clean
clean:
		rm -rf $(PARSER) $(DUG) $(HTTP_SERVER)

make之后,运行 ./parse 会将处理好的所有html文件存放在raw.txt中

随后启动服务器程序:./http_server

然后打开网页,输入自己服务器的IP地址即可:

7.后端优化

搜索去重

在之前的search模块中讨论过,搜索的倒排拉链会产生重复,即不同的关键词可能来源于同一个文档,那么这样造成的后果就是搜索的结果可能就是重复的。

为了测试这种可能性,我们自己新建一个test.html文件,并试图搜索这个文档的内容。

  • test.html
DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  
    <title>测试用例title>
    <meta http-equiv="refresh" content="0; URL=http://www.boost.org/doc/libs/master/doc/html/hash.html">
  head>
  <body>
    今天是一个晴天
    <a href="http://www.boost.org/doc/libs/master/doc/html/hash.html">http://www.boost.org/doc/libs/master/doc/html/hash.htmla>
  body>
html>

我们把test.html放在input路径下,并重新编译运行:

[sjl@VM-16-6-centos boost_searcher]$ make
g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
g++ -o debug debug.cc -std=c++11 -ljsoncpp
g++ -o http_server http_server.cc -std=c++11 -ljsoncpp -lpthread
[sjl@VM-16-6-centos boost_searcher]$ ./parser 
[sjl@VM-16-6-centos boost_searcher]$ ./http_server 
创建index单例完成...
构建索引完成....: 100%

【项目】 基于BOOST的站内搜索引擎_第23张图片

在这里插入图片描述

可以看到结果是重复的!

所以我们需要避免这种情况的出现

将search.hpp做修改,详情见文末的项目代码链接

改完之后:

【项目】 基于BOOST的站内搜索引擎_第24张图片

去除暂停词

在jieba分词库中包含了暂停词词库:

【项目】 基于BOOST的站内搜索引擎_第25张图片

改动tool.hpp

将暂停词库导入内存,在jieba分词结束后,再用暂停词库将关键词筛一遍,去除暂停词。

具体见文尾的项目代码 tool.hpp

效果展示:

搜索暂停词后,将不会显示结果,

【项目】 基于BOOST的站内搜索引擎_第26张图片

前期构建索引是需要筛一遍暂停词所以会比较慢,但是一旦构建完毕,索引的时间将会大幅缩减,因为省去了暂停词的索引过程。

添加日志

//log.hpp
#pragma once

#include 
#include 
#include 

#define NORMAL  1
#define WARNING 2
#define DEBUG   3
#define FATAL   4
 
#define LOG(LEVEL,MESSAGE) log(#LEVEL,MESSAGE,__FILE__,__LINE__)

void log(std::string level ,std::string message,std::string file,int line)
{
    std::cout<<"["<<level<<"]"<<"["<<time(nullptr)<<"]"<<"["<<message<<"]"<<"["<<file<<" : "<<line<<"]"<<std::endl;

}

在所有的错误控制处以及信息提示出,使用LOG函数,并给予一定的错误等级与提示。

部署服务

在后台运行服务器,并把日志信息输出在 log.txt中(把错误输出也重定向到此文件中 2>&1):

[sjl@VM-16-6-centos boost_searcher]$ nohup ./http_server &>log.txt 2>&1

输入一些搜索词后:

[sjl@VM-16-6-centos boost_searcher]$ cat log.txt 
nohup: ignoring input
创建index单例完成...
[NORMAL][1659167339][创建index单例完成...][searcher.hpp : 24]
构建索引完成....: 100%
[NORMAL][1659167389][构建索引完成...][searcher.hpp : 28]
用户搜索词: vector
[NORMAL][1659168113][用户搜索词: vector][http_server.cc : 25]
用户搜索词: split
[NORMAL][1659168141][用户搜索词: split][http_server.cc : 25]
用户搜索词: filestream
[NORMAL][1659168148][用户搜索词: filestream][http_server.cc : 25]

项目扩展方向:

  1. 该项目的数据源是基于 boost_1_79_0/doc/html/ 目录下的html文件索引。所以可以建立全站索引。
  2. 数据源可以定期使用爬虫程序对网页进行爬取,或者在网站更新时设置信号,提醒重新爬取网页。设计在线更新的方案(多线程,多进程)。
  3. 不使用组件,自己设计对应的各种方案。
  4. 添加竞价排名
  5. 热词统计,智能显示搜索关键词(字典树,优先级队列)
  6. 设置登录注册

项目代码

已上传至gitee:项目代码链接

你可能感兴趣的:(项目,搜索引擎)