项目地址:https://gitee.com/Vertas/boost-searcher-project

1. 项目背景

  • 日常生活中我们使用过很多搜索引擎,比如百度,搜狗,360搜索等。我们今天是要实现一个像百度这样的搜索引擎嘛?那是不可能的,因为像百度这样的搜索引擎搜索的是全网的数据。其数据量之庞大远远超出我们的想象。
  • 今天我们要实现的 Boost 搜索引擎是一个栈内搜索引擎。也就是在 Boost 官网https://www.boost.org/ 进行搜索。站内搜索的数据量更加垂直,其实就是数据量更加小!
  • 我们为什么要做这个项目的原因还有一个:Boost 官网中并没有栈内搜索的功能。

我们可以在百度中搜索一个关键字看看是什么效果:

我们看到所有的网页都有上图中标注的三个部分:标题,网页内容简介,网页 url。

同理,我们实现的 Boost 搜索引擎搜索关键字时也要展示这三部分信息。

2. 搜索引擎的宏观原理

  • 我们需要下载 Boost 库中所有页面的 html 文件,作为后台响应数据。
  • 下载完成后,我们需要编写代码,对所有的 html 文件进行去标签,清理数据以及建立索引的工作。
  • 我们通过浏览器访问服务器,就是在向服务器发送 Http 请求,通过 Http 请求能够将我们搜索的关键字上传给服务器。
  • 服务器就会根据用户搜索的关键字,在提前建立好的索引中查找,将相关的数据返回给用户,用户的浏览器解析之后就能看到搜索的结果啦!

3. 项目的技术栈和环境

  • 技术栈: C/C++,C++11, STL, 准标准库Boost,Jsoncpp,cppjieba,cpp-httplib,html5,css,js、jQuery、Ajax。
  • 项目环境: Centos 7云服务器,vim/gcc(g++)/Makefile,vscode。

4. 编写数据去标签以及数据清理模块

4.1 下载 Boost 库中所有的 html 页面

  • 下载链接:Boost下载

  • 使用 rz 命令将下载好的文件上传到 centos 服务器。

  • 使用 tar -zxvf 解压下载好的压缩包。

  • 我们想要的仅仅是 html 文件,其他的文件我们是不需要的。使用 find 命令来查看下载好的文件到底有多少 html 文件:

​ 可以看到一共是有 23987 个 html 文件哈!

4.2 解析 html 文件

我们来看看 html 文件长什么样子,以及什么是标签:

  • 双标签由开始标签和结束标签组成,如图标注的双标签: 就是开始标签, 就是结束标签。
  • 单标签就只有一个标签哈,如上图中的 标签。
  • 我们要做的工作就是将这些标签全部去掉,只保留网页的内容部分。

显然,在去标签之前肯定要将 html 文件读取到内存,但是我们下载的 Boost 中不只有 html 文件。因此我们还得做个准备工作:将 Boost 中所有的 html 提取出来。想要提取所有的 html 文件,不可避免要遍历整个目录,但是嘞,C++ 标准库做这个工作不方便,因此我们使用 boost 库中的函数来完成!

安装 boost 开发库:

sudo yum install -y boost-devel # devel 就是开发库的意思哈

我们将要使用 Boost 库中 filesystem.hpp 中的相关类来实现过滤 html 文件。

于是我们设计了一个函数 FilterFile

  • 参数一:输入型参数,我们要遍历的目录,也就是是下载好的 Boost 库。
  • 参数二:输出型参数,保存我们过滤出来的 html 文件。
bool FilterFile(const std::string& src_dir, std::vector<std::string>* file_list){namespace fs = boost::filesystem;//根据传入的文件创建一个 path 对象fs::path root_path(src_dir);//判断当前目录下是否存在 src_dirif(!fs::exists(root_path)){std::cerr << src_dir << "is a " << "Invalid source path." << std::endl;return false;}//创建一个迭代器,用来遍历 src_dir 目录下的所有文件fs::recursive_directory_iterator end;for(fs::recursive_directory_iterator iter(root_path); iter != end; iter++){//判断遍历到的文件是不是普通文件if(!fs::is_regular_file(*iter)){continue;}//判断遍历到的文件的后缀是不是 .htmlif(iter->path().extension() != ".html"){continue;}// for debug 观察是不是将所有的 html 文件提取出来了std::cout << iter->path().string() << std::endl;//将符合要求的文件放到 vector 中file_list->push_back(iter->path().string());}return true;}

在使用 Boost 库时,编译的时候需要链接你使用到的 Boost 库文件。

g++ -o test debug.cc -std=c++11 -lboost_system -lboost_filesystem

我们通过调用该函数,观察到代码执行效果符合预期:与开头我们使用 find 命令查找的结果一样。


提取标题

通过分析 html 页面,我们不难发现一个 html 页面的标题都是在 这个双标签之间的,并且一个 html 文件中 标签有且只有一个。那么我们就可以将每一个 html 文件读取到内存。通过调用 find 函数找到这两个标签的位置。进而获取到 html 页面的标题。

bool ParseTitle(const std::string& content, std::string* title){//查找开始标签的下标size_t start_label = content.find(""</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span>start_label <span class="token operator">==</span> std<span class="token double-colon punctuation">::</span>string<span class="token double-colon punctuation">::</span>npos<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">//查找结束标签的下标</span>size_t end_label <span class="token operator">=</span> content<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"");if(end_label == std::string::npos){return false;}//截取内容的开始下标size_t begin_pos = start_label + std::string(""</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>size_t end_pos <span class="token operator">=</span> end_label<span class="token punctuation">;</span><span class="token comment">//开始下标不可能大于结束下标</span><span class="token keyword">if</span><span class="token punctuation">(</span>begin_pos <span class="token operator">></span> end_pos<span class="token punctuation">)</span> <span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span><span class="token operator">*</span>title <span class="token operator">=</span> content<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>begin_pos<span class="token punctuation">,</span> end_pos <span class="token operator">-</span> begin_pos<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><ul><li>参数一:输入型参数,一个 html 文件的全部内容。</li><li>参数二:输出型参数,我们提取到一个 html 文件的标题。</li></ul><h4>提取内容</h4><p>除了要提取一个 html 文件的标题,我们还要提取 html 文件的内容。这个内容当然不是 html 文件里面的那些标签,而是指浏览器解析 html 文件之后,网页上能看到的内容。也就是两个标签之间的文字。</p><p>想要提取我们想要的内容,需要使用一个简易的状态机。</p><ul><li>整个 html 文件中的字符可以分为两类:一类是标签,一类是我们想要的内容。我们就可以一个字符一个字符的遍历 html 文件,根据当前的状态来确定当前字符是不是我们需要的。</li><li>如果遍历到的字符是我们需要的话,将其添加到结果中就行啦!</li></ul><p>如果你还是不太明白下面的图片可能会帮到你:</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/8269b42929604ce98e1d78b3cd7cd9f1.gif" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/8269b42929604ce98e1d78b3cd7cd9f1.gif" /></p><p>于是我们可以定义一个函数:<code>ParseContent</code> 来获取 html 文件中的内容。</p><ul><li>参数一:输入型参数,一个 html 文件的全部内容。</li><li>参数二:输出型参数,我们提取到一个 html 文件的内容。</li></ul><pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">ParseContent</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span> file<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> content<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">//定义状态机,确定遍历到某个字符时是否是我们需要的字符</span><span class="token keyword">enum</span><span class="token punctuation">{</span>LABEL<span class="token punctuation">,</span>CONTENT<span class="token punctuation">}</span> cur_stat<span class="token punctuation">;</span><span class="token comment">// html 文件一开始一定是标签</span>cur_stat <span class="token operator">=</span> LABEL<span class="token punctuation">;</span><span class="token comment">//遍历文件的内容,根据状态来确定是不是我们要的字符</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> ch <span class="token operator">:</span> file<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">switch</span><span class="token punctuation">(</span>cur_stat<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">case</span> LABEL<span class="token operator">:</span><span class="token comment">//状态切换</span><span class="token keyword">if</span><span class="token punctuation">(</span>ch <span class="token operator">==</span> <span class="token char">'>'</span><span class="token punctuation">)</span>cur_stat <span class="token operator">=</span> CONTENT<span class="token punctuation">;</span><span class="token keyword">break</span><span class="token punctuation">;</span><span class="token keyword">case</span> CONTENT<span class="token operator">:</span><span class="token comment">// 状态切换</span><span class="token keyword">if</span><span class="token punctuation">(</span>ch <span class="token operator">==</span> <span class="token char">'<'</span><span class="token punctuation">)</span>cur_stat <span class="token operator">=</span> LABEL<span class="token punctuation">;</span><span class="token keyword">else</span><span class="token punctuation">{</span><span class="token comment">// 我们将 html 文件中的 \n 全部置换成为空格,因为我们在将 html 文件</span><span class="token comment">// 保存到本地的时候需要让 \n 作为每一个文件的分隔符</span><span class="token keyword">if</span><span class="token punctuation">(</span>ch <span class="token operator">==</span> <span class="token char">'\n'</span><span class="token punctuation">)</span> ch <span class="token operator">=</span> <span class="token char">' '</span><span class="token punctuation">;</span>content<span class="token operator">-></span><span class="token function">push_back</span><span class="token punctuation">(</span>ch<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">break</span><span class="token punctuation">;</span><span class="token keyword">default</span><span class="token operator">:</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><h4>提取 url</h4><p>用户在搜索某个关键字之后,是能够跳转到 Boost 官网的。因此我们还需要根据过滤出来的 html 页面将对应 html 页面的官网地址提取出来。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/c36f3846bcdc4e5e887c2c027f459412.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/c36f3846bcdc4e5e887c2c027f459412.png" /></p><p>对比过滤出来的 html 页面在服务器的位置与官网对应的地址,不难发现:我们只要将服务器本地的 html 页面存放的位置拼接上 Boost 官网前半部分的固定字符串就能正确提取出跳转官网的 url 链接啦!</p><p>我们可以定义一个函数:<code>ParseUrl</code> 来实现提取 url</p><ul><li>参数一:输入型参数,我们过滤出来的 html 文件在服务器的相对路径。</li><li>参数二:输出型参数,跳转官网的那个 url。</li></ul><pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">ParseUrl</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> src_path<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">*</span> url<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// Boost 官网固定前缀</span>std<span class="token double-colon punctuation">::</span>string url_head <span class="token operator">=</span> <span class="token string">"https://www.boost.org/doc/libs/1_84_0"</span><span class="token punctuation">;</span><span class="token comment">// boost_1_84_0/doc/html/container/main_features.html</span><span class="token comment">// 服务器上的文件截取掉 boost_1_84_0 再拼街上固定前缀即是官网地址</span>std<span class="token double-colon punctuation">::</span>string url_tail <span class="token operator">=</span> src_path<span class="token punctuation">.</span><span class="token function">substr</span><span class="token punctuation">(</span>src_path<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token string">"/"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token operator">*</span>url <span class="token operator">=</span> url_head <span class="token operator">+</span> url_tail<span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><h3>4.3 保存 html 文件</h3><p>当我们将提取标题,提取内容,提取 url 的工作做完了之后,我们就可以将解析出来的数据通过一个结构体封装起来,然后再将结果保存到服务器,方便进行后续建立索引的工作。</p><p>为了方便在建立索引的时候读取一个解析之后的 html 文件内容,我们将解析出来的结果统一保存在一个文件中。每一个 html 文件解析出来的结果用换行符进行分割,一个 html 文件中的标题,内容,url 之间使用 <code>\3</code> 进行分割。这里为什么用 <code>\3</code> 呢,是因为在 html 文档中不可能出现 <code>\3</code>,因此使用 <code>\3</code> 能够正确分割标题,内容,url 这三个部分。当然你用其他不可能在 html 文件中出现的字符也行。</p><pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">SaveHtml</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo_t<span class="token operator">></span><span class="token operator">&</span> results<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> output_path<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">int</span> cnt <span class="token operator">=</span> <span class="token number">1</span><span class="token punctuation">;</span><span class="token comment">//打开想要保存的文件 不存在就是创建啦</span>std<span class="token double-colon punctuation">::</span>ofstream <span class="token function">out_file</span><span class="token punctuation">(</span>output_path<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary <span class="token operator">|</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>out_file<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">//文件打开失败结束保存</span>std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> <span class="token string">"file "</span> <span class="token operator"><<</span> output_path <span class="token operator"><<</span> <span class="token string">"open failed"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">// 遍历文件将提取出来的 html 文件保存在服务器</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span><span class="token operator">&</span> result <span class="token operator">:</span> results<span class="token punctuation">)</span><span class="token punctuation">{</span> std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> <span class="token string">"正在保存第 "</span> <span class="token operator"><<</span> cnt<span class="token operator">++</span> <span class="token operator"><<</span> <span class="token string">" 个 html 文件"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> std<span class="token double-colon punctuation">::</span>string out_string<span class="token punctuation">;</span>out_string <span class="token operator">+=</span> result<span class="token punctuation">.</span>title<span class="token punctuation">;</span>out_string <span class="token operator">+=</span> DATA_BLOCK_SEP<span class="token punctuation">;</span> <span class="token comment">//数据块之间使用 /3 作为分割符,方便构建索引的时候区分</span>out_string <span class="token operator">+=</span> result<span class="token punctuation">.</span>content<span class="token punctuation">;</span>out_string <span class="token operator">+=</span> DATA_BLOCK_SEP<span class="token punctuation">;</span>out_string <span class="token operator">+=</span> result<span class="token punctuation">.</span>url<span class="token punctuation">;</span>out_string <span class="token operator">+=</span> <span class="token string">"\n"</span><span class="token punctuation">;</span> <span class="token comment">//每一个 html 文件之间使用 \n 作为分割符,方便构建索引的时候读取文件</span>out_file<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>out_string<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> out_string<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span>out_file<span class="token punctuation">.</span><span class="token function">close</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>在解析文件的时候,我们可以顺便将解析的结果打印出来看看:</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/4b1f83af801a4f50ac976478875b76b1.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/4b1f83af801a4f50ac976478875b76b1.png" /></p><p>可以看到,我们解析出来的 content 中已经不含任何的标签啦!</p><p>我们可以直接访问解析到的 url:https://www.boost.org/doc/libs/1_84_0/libs/type_traits/doc/html/boost_typetraits/reference/has_trivial_constructor.html</p><p>可以看到能够正确跳转官网。</p><p>我们查看网页的源代码,可以看到标题也是被正确地提取出来了!</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/dd72380267f34a36a19adfed03e67887.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/dd72380267f34a36a19adfed03e67887.png" /></p><h2>5. 编写建立索引的模块</h2><h3>5.1 获取正排索引</h3><p>什么是正排索引呢?其实很简单,我们不是提取到了很多很多的 html 文件嘛,正排索引就是给所有文件编一个号,能够根据编号找到对应的文档就行啦!</p><p>比如有两个文档:</p><ol><li>我喜欢中国。</li><li>中国是我最喜欢的国家。</li></ol><p>就可以建立这样的正排索引:</p><table><thead><tr><th>文档编号</th><th align="left">文档内容</th></tr></thead><tbody><tr><td>1</td><td align="left">我喜欢中国。</td></tr><tr><td>2</td><td align="left">中国是我最喜欢的国家。</td></tr></tbody></table><ul><li>我们能根据文档编号 1 找到,“我爱中国。” 的文档内容。</li><li>我们能根据文档编号 2 找到,“中国是我最喜欢的国家。” 的文档内容。</li></ul><p>根据编号找文档内容,我们自然就想到了使用数组来存储所有的正排索引。</p><p>于是我们很轻松写出了获取正排的函数:</p><pre><code class="prism language-cpp"><span class="token keyword">struct</span> <span class="token class-name">DocInfo</span><span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span>string _title<span class="token punctuation">;</span> <span class="token comment">//文档标题</span>std<span class="token double-colon punctuation">::</span>string _content<span class="token punctuation">;</span> <span class="token comment">//文档内容</span>std<span class="token double-colon punctuation">::</span>string _url<span class="token punctuation">;</span> <span class="token comment">//对应官网链接</span><span class="token keyword">uint32_t</span> _doc_id<span class="token punctuation">;</span> <span class="token comment">//文档的编号</span><span class="token punctuation">}</span><span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>DocInfo<span class="token operator">></span> _forward_index<span class="token punctuation">;</span> <span class="token comment">//正排索引</span>DocInfo<span class="token operator">*</span> <span class="token function">GetForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">uint32_t</span> doc_id<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">if</span><span class="token punctuation">(</span>doc_id <span class="token operator">>=</span> _forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment">// 文档 id 不能越界</span><span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> <span class="token string">"doc_id out of range"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">return</span> <span class="token operator">&</span>_forward_index<span class="token punctuation">[</span>doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span> <span class="token comment">//根据文档 id 返回整个文档</span><span class="token punctuation">}</span></code></pre><h3>5.2 获取倒排索引</h3><p>那什么又是倒排索引呢?不急哈,我们先来看看我们平时的搜索场景:</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/b354ed13468a45daaefcfceef61e820c.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/b354ed13468a45daaefcfceef61e820c.png" /></p><p>可以看到我在百度搜索:“清华大学是中国最好的大学之一”,百度返回的条目中,有 “中国”,“清华大学”,“大学”,“最好的大学”,“清华” 这样的词语匹配成功百度搜索引擎就给我返回了对应网页!</p><p>如此可见,我们在搜索引擎进行搜索的时候,会将搜索的字符串进行拆分,得到很多关键字,然后百度服务器根据这些关键字查找服务器上包含这些关键字的文章,最后以一定的顺序返回给用户。</p><p>同理我们实现的 Boost 搜索引擎也要做词语拆分的工作!</p><p>那么到底什么是倒排索引呢?倒排索引就是根据关键字,找到该关键字对应的文档编号。还是这个例子:</p><p>我有两个文档:</p><ol><li>我喜欢中国。</li><li>中国是你我都喜欢的国家。</li></ol><p>假如我搜索的是:你喜欢中国吗?</p><ul><li>将这个字符串进行拆分:得到:“喜欢”,“中国”,“你”。</li><li>于是,就可以建立倒排索引:</li></ul><table><thead><tr><th>关键字</th><th>文档编号</th></tr></thead><tbody><tr><td>喜欢</td><td>文档 1,文档 2</td></tr><tr><td>中国</td><td>文档 1,文档 2</td></tr><tr><td>你</td><td>文档 2</td></tr></tbody></table><p>通过关键字得到了文档编号,即根据倒排索引得到了文档编号。然后再根据正排索引就能获得该文档编号下的所有内容。就能将数据发送给客户端啦!</p><p>可以看到 “吗” 这种词并不会参与建立倒排索引,因为像这类语气助词太常见了!这种词我们一般称为暂停词,搜索引擎应该能够去掉这些暂停词,不然会很影响服务器返回用户条目的顺序排列!这类暂停词在英语中就有:“a”,“the”,“an” 等等哈!</p><p>通过在百度搜索 “清华大学是中国最好的大学之一” 可以看到 百度服务器返回的条目是按照一个顺序罗列出来的,因此我们还需要确定一个关键字在一个文档中的权重,这样就可以根据用户搜索的关键字,按照权重降序排列返回给客户端啦!</p><p>我们要根据关键字也就是 <code>string</code> 找到文档编号等内容,可见比较理想的保存倒排索引的数据结构就是哈希表啦!</p><p>于是我们很轻松就写出了获取倒排索引的函数:</p><pre><code class="prism language-cpp"><span class="token keyword">struct</span> <span class="token class-name">InvertedElement</span><span class="token punctuation">{</span><span class="token keyword">uint32_t</span> _doc_id<span class="token punctuation">;</span> <span class="token comment">//文档编号</span><span class="token keyword">uint32_t</span> _weight<span class="token punctuation">;</span> <span class="token comment">//关键字对应在该文档中的权重</span>std<span class="token double-colon punctuation">::</span>string _word<span class="token punctuation">;</span> <span class="token comment">//关键字</span><span class="token punctuation">}</span><span class="token punctuation">;</span><span class="token keyword">typedef</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElement<span class="token operator">></span> InvertedList<span class="token punctuation">;</span> <span class="token comment">//倒排拉链</span>std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> InvertedList<span class="token operator">></span> _inverted_index<span class="token punctuation">;</span> <span class="token comment">//倒排索引</span>InvertedList<span class="token operator">*</span> <span class="token function">GetInvertedList</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> word<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">auto</span> iter <span class="token operator">=</span> _inverted_index<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span>word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 根据关键字在倒排索引中查找</span><span class="token keyword">if</span><span class="token punctuation">(</span>iter <span class="token operator">==</span> _inverted_index<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> word <span class="token operator"><<</span> <span class="token string">"have not InvertedList"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span> <span class="token keyword">return</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">return</span> <span class="token operator">&</span><span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//找到了就返回倒排拉链</span><span class="token punctuation">}</span></code></pre><p>显然,一个关键字可能出现在多个文档之中,因此一个 <code>string</code> 对应的应该是一个 <code>vector</code> 我们一般将这个 <code>veector</code> 叫做倒排拉链,是不是非常的形象。</p><h3>5.3 建立正排索引</h3><p>我们已经成功将 html 文件解析成功保存到服务器中了,下一步要做的就是将这个文件读取出来,建立正排索引和倒排索引。</p><p>解析成功的一个 html 我们在保存的时候是当作一行的!标题,内容,url 之间使用 <code>\3</code> 作为分隔符。因此我们只需要以 <code>\3</code> 作为分隔符将读取到的一行字符串进行切割,建立正排索引之后保存在之前定义好的数据结构中就行啦!</p><pre><code class="prism language-cpp"><span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">Split</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>target<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">,</span> <span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>sep<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 参数一是 vector 哈用来存放切割之后的字符串,参数二就是要切割的字符串,参数三是什么作为分隔符,</span><span class="token comment">// 参数四表示多个连续出现的分隔符会进行合并</span>boost<span class="token double-colon punctuation">::</span><span class="token function">split</span><span class="token punctuation">(</span><span class="token operator">*</span>out<span class="token punctuation">,</span> target<span class="token punctuation">,</span> boost<span class="token double-colon punctuation">::</span><span class="token function">is_any_of</span><span class="token punctuation">(</span>sep<span class="token punctuation">)</span><span class="token punctuation">,</span> boost<span class="token double-colon punctuation">::</span>token_compress_on<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span>DocInfo <span class="token operator">*</span><span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>line<span class="token punctuation">)</span><span class="token punctuation">{</span>DocInfo doc<span class="token punctuation">;</span><span class="token comment">// 存储分割出来的结果</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> results<span class="token punctuation">;</span><span class="token comment">// 分隔符</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string sep <span class="token operator">=</span> <span class="token string">"\3"</span><span class="token punctuation">;</span><span class="token comment">// 调用分割函数</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">StringUtil</span><span class="token double-colon punctuation">::</span><span class="token function">Split</span><span class="token punctuation">(</span>line<span class="token punctuation">,</span> <span class="token operator">&</span>results<span class="token punctuation">,</span> sep<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 根据分割结果构建 DocInfo 对象</span>doc<span class="token punctuation">.</span>_title <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">;</span>doc<span class="token punctuation">.</span>_content <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">;</span>doc<span class="token punctuation">.</span>_url <span class="token operator">=</span> results<span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">]</span><span class="token punctuation">;</span>doc<span class="token punctuation">.</span>_doc_id <span class="token operator">=</span> _forward_index<span class="token punctuation">.</span><span class="token function">size</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 将建立好的正排插入 vector</span>_forward_index<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 返回新建立的正排的地址</span><span class="token keyword">return</span> <span class="token operator">&</span><span class="token punctuation">(</span>_forward_index<span class="token punctuation">.</span><span class="token function">back</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>我们在分割字符串的时候当然可以使用 <code>find</code> 加 <code>substr</code> 来截取,只不过就是比较麻烦罢了!因此我们选择使用 Boost 库中的 <code>split</code> 函数来处理字符串分割的问题。具体用法在注释中哦!uu 们也可以自行百度!</p><h3>5.4 建立倒排索引</h3><p>我们在获取倒排索引的时候讲过,需要将用户搜索的字符串进行词语分割!这个工作看上去很复杂,嗯,没错就是很复杂。因此,我们要使用第三方库啦!</p><blockquote><p>cpp-jieba 项目地址:https://github.com/yanyiwu/cppjieba.git</p></blockquote><p>怎么使用呢?</p><ul><li><p>我们使用 <code>ln -s</code> 命令建立两个软连接,指向我们需要的文件。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/8baba1a9d8bc442aa7c83259b0f64a3c.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/8baba1a9d8bc442aa7c83259b0f64a3c.png" /></p></li></ul><p>​ 第一个文件里面有我们要使用的 <code>Jieba.hpp</code> 文件;第二个文件里面则是分词要使用的词库哈!</p><ul><li><p>这个第三方库使用之前要将一个文件拷贝到 <code>cppjieba</code> 目录下,你可以先不拷贝,看看报错信息,你应该就知道该怎么解决了,如果你嫌麻烦,直接按照下面的命令拷贝一下就可以使用这个第三方库了!</p><pre><code class="prism language-bash"><span class="token function">cp</span> <span class="token parameter variable">-rf</span> deps/limonp include/cppjieba/</code></pre></li></ul><p>在这个项目里面是由 <code>demo</code> 的,你可以直接运行试试:我们要使用的只有一个函数哈:</p><pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cppjieba/Jieba.hpp"</span></span><span class="token keyword">using</span> <span class="token keyword">namespace</span> std<span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/jieba.dict.utf8"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> HMM_PATH <span class="token operator">=</span> <span class="token string">"./dict/hmm_model.utf8"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> USER_DICT_PATH <span class="token operator">=</span> <span class="token string">"./dict/user.dict.utf8"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> IDF_PATH <span class="token operator">=</span> <span class="token string">"./dict/idf.utf8"</span><span class="token punctuation">;</span><span class="token keyword">const</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token keyword">const</span> STOP_WORD_PATH <span class="token operator">=</span> <span class="token string">"./dict/stop_words.utf8"</span><span class="token punctuation">;</span><span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token keyword">int</span> argc<span class="token punctuation">,</span> <span class="token keyword">char</span> <span class="token operator">*</span><span class="token operator">*</span>argv<span class="token punctuation">)</span><span class="token punctuation">{</span> <span class="token comment">// 初始化一个 jieba 对象,传入的就是我们要使用的哪些词库哈</span>cppjieba<span class="token double-colon punctuation">::</span>Jieba <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span>HMM_PATH<span class="token punctuation">,</span>USER_DICT_PATH<span class="token punctuation">,</span>IDF_PATH<span class="token punctuation">,</span>STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 分词的结果将保存在这个 vector 里面</span>vector<span class="token operator"><</span>string<span class="token operator">></span> words<span class="token punctuation">;</span><span class="token comment">// 这个表示我们要对那个字符串进行分词</span>string s<span class="token punctuation">;</span>s <span class="token operator">=</span> <span class="token string">"小明硕士毕业于中国科学院计算所,后在日本京都大学深造"</span><span class="token punctuation">;</span>cout <span class="token operator"><<</span> s <span class="token operator"><<</span> endl<span class="token punctuation">;</span>cout <span class="token operator"><<</span> <span class="token string">"[demo] CutForSearch"</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span>jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>s<span class="token punctuation">,</span> words<span class="token punctuation">)</span><span class="token punctuation">;</span>cout <span class="token operator"><<</span> limonp<span class="token double-colon punctuation">::</span><span class="token function">Join</span><span class="token punctuation">(</span>words<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">"/"</span><span class="token punctuation">)</span> <span class="token operator"><<</span> endl<span class="token punctuation">;</span><span class="token keyword">return</span> EXIT_SUCCESS<span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>我们要使用的就是这个 <code>CutForSearch</code> 函数!</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/df0c94311c084ebda4b3d0e6fa665103.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/df0c94311c084ebda4b3d0e6fa665103.png" /></p><p>上面就是分词的效果是不是和我们需要的样子差不多啊!</p><p>我们现在就来编写分词的模块:</p><pre><code class="prism language-cpp"><span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> out<span class="token punctuation">)</span><span class="token punctuation">{</span>jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>我们封装一个函数直接调用 <code>CutForSearch</code> 函数就可以啦!</p><hr><p>现在我们就来看看如何编写建立倒排索引的函数哈:</p><ul><li>我们建立了正排索引之后不是得到了一个 <code>DocInfo</code> 嘛?我们将这个 <code>DocInfo</code> 传给构建倒排索引的函数,让他根据标题和内容先进性分词。</li><li>分词完成之后,我们还要统计一个关键字在该文档的权重,怎么计算呢?我们可以自己瞎编一个算法哈!我们就假定,一个关键字如果在标题中出现的话权重加十,如果一个关键字在内容中出现的话权重加一!当然你也可以定义自己的权重的计算方法。</li></ul><pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> DocInfo <span class="token operator">&</span>doc<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 这个用来统计一个词语在标题中出现了几次,在内容中出现了几次</span><span class="token keyword">struct</span> <span class="token class-name">word_cnt</span><span class="token punctuation">{</span><span class="token keyword">int</span> _title_cnt<span class="token punctuation">;</span><span class="token keyword">int</span> _content_cnt<span class="token punctuation">;</span><span class="token function">word_cnt</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">_title_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_content_cnt</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">;</span><span class="token comment">// 临时保存一个词语的出现次数,包括在标题中出现的次数和在内容中出现的次数</span>std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> word_cnt<span class="token operator">></span> word_map<span class="token punctuation">;</span><span class="token comment">// 我们先对标题进行分词,然后将该词语在标题中出现的次数加上一</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> title_word<span class="token punctuation">;</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>_title<span class="token punctuation">,</span> <span class="token operator">&</span>title_word<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 遍历标题分出来的词语,并将 title_cnt 加上一</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s <span class="token operator">:</span> title_word<span class="token punctuation">)</span><span class="token punctuation">{</span>boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>_title_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">// 谈后就是对内容进行分词</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> content_word<span class="token punctuation">;</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>doc<span class="token punctuation">.</span>_content<span class="token punctuation">,</span> <span class="token operator">&</span>content_word<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 同样的道理,对其 content_cnt 加上一</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> s <span class="token operator">:</span> content_word<span class="token punctuation">)</span><span class="token punctuation">{</span>boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span>word_map<span class="token punctuation">[</span>s<span class="token punctuation">]</span><span class="token punctuation">.</span>_content_cnt<span class="token operator">++</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">TITLE_WEIGHT</span> <span class="token expression"><span class="token number">10</span></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">define</span> <span class="token macro-name">CONTENT_WEIGHT</span> <span class="token expression"><span class="token number">1</span></span></span><span class="token comment">//现在我们就可以遍历整个 word_map 进行构造 InvertedElement 后插入我们的倒排索引中</span><span class="token comment">//定义 word_map 的迭代器,对哈希表进行遍历</span><span class="token keyword">auto</span> iter <span class="token operator">=</span> word_map<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">while</span><span class="token punctuation">(</span>iter <span class="token operator">!=</span> word_map<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 构建结构体,并用已经得到的数据进行初始化</span>InvertedElement ie<span class="token punctuation">;</span><span class="token comment">// 一个关键词对应的文档 id</span>ie<span class="token punctuation">.</span>_doc_id <span class="token operator">=</span> doc<span class="token punctuation">.</span>_doc_id<span class="token punctuation">;</span><span class="token comment">// 这个关键词是啥</span>ie<span class="token punctuation">.</span>_word <span class="token operator">=</span> iter<span class="token operator">-></span>first<span class="token punctuation">;</span><span class="token comment">// 这个关键词在该文档中的权重</span>ie<span class="token punctuation">.</span>_weight <span class="token operator">=</span> <span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">.</span>_title_cnt <span class="token operator">*</span> TITLE_WEIGHT <span class="token operator">+</span> <span class="token punctuation">(</span>iter<span class="token operator">-></span>second<span class="token punctuation">)</span><span class="token punctuation">.</span>_content_cnt <span class="token operator">*</span> CONTENT_WEIGHT<span class="token punctuation">;</span><span class="token comment">// 将这个结构体插入到一个关键词下的 vector 中,后续需要根据这个哈希表进行倒排索引的查找</span>_inverted_index<span class="token punctuation">[</span>iter<span class="token operator">-></span>first<span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>ie<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>细节:</p><ol><li>我们在分词之后,将得到的结果全部转换成了小写,我们想要的结果就是无论用户搜索的是大写的英文单词还是小写的英文单词都是能匹配上的。我们这里就统一转换成小写字符方便处理!</li><li>我们要理解哈希表以及红黑树里面的 <code>insert</code> 函数的具体实现哈!</li></ol><h3>5.5 完成索引建立模块</h3><p>只要把前面的工作做好了,这里只需要简单的调用我们之前写过的函数就可以了!</p><p>我们编写这样一个函数:<code>bool BulidIndex(const std::string file)</code></p><ul><li>参数一:这个 file 就是我们调用 <code>SaveHtml</code> 函数之后保存到服务器的那个文件。</li></ul><p>我们将这个文件一行一行的读取出来,然后分别调用我们之前就写好的 <code>BuildForwardIndex</code> 和 <code>BuildInvertedIndex</code> 函数就行。</p><pre><code class="prism language-cpp"><span class="token comment">// 我们之前不是写了 SaveHtml 这个函数嘛,这里的file 就是保存到服务器的那个文件啦</span><span class="token keyword">bool</span> <span class="token function">BuildIndex</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> file<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 打开SaveHtml 函数保存到服务器的文件</span>std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in_file</span><span class="token punctuation">(</span>file<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>in <span class="token operator">|</span> std<span class="token double-colon punctuation">::</span>ios<span class="token double-colon punctuation">::</span>binary<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>in_file<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> <span class="token string">"file "</span> <span class="token operator"><<</span> file <span class="token operator"><<</span> <span class="token string">" open failed"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token comment">//读取到的每一行,也就是解析之后的一个 html 文件 还记得吧: 标题\3内容\3url\n</span>std<span class="token double-colon punctuation">::</span>string line<span class="token punctuation">;</span><span class="token keyword">while</span><span class="token punctuation">(</span><span class="token function">getline</span><span class="token punctuation">(</span>in_file<span class="token punctuation">,</span> line<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 建立正排索引</span>DocInfo<span class="token operator">*</span> doc <span class="token operator">=</span> <span class="token function">BuildForwardIndex</span><span class="token punctuation">(</span>line<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span>doc <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">continue</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">// 建立倒排索引</span><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span><span class="token function">BuildInvertedIndex</span><span class="token punctuation">(</span><span class="token operator">*</span>doc<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">continue</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><h2>6. 编写搜索引擎的 searcher 模块</h2><p>准备工作:</p><ul><li>我们要将之前写的 <code>Index</code> 模块设计成单例哈!因为在整个项目中只需要一个 <code>Index</code> 对象就可以啦!</li><li>设计成单例模式在 <code>searcher</code> 模块中调用 <code>Index</code> 模块中的函数十分方便。</li></ul><p>设计单例的代码这里就不粘贴出来啦!你可以直接去看项目的源码!我们选用用懒汉的单例模式,并且要加锁哦!</p><hr><p><code>Searcher</code> 模块中我们要根据用户搜索的字符串,返回给客户端相关的条目,因此:</p><ul><li>用户搜索的字符串也要进行分词的操作。</li><li>服务端返回客户端的数据格式选用 <code>json</code> 数据格式就行。</li></ul><p>好的,现在我们来下载 <code>jsoncpp</code> 吧:</p><pre><code class="prism language-bash"><span class="token function">sudo</span> yum <span class="token function">install</span> <span class="token parameter variable">-y</span> jsoncpp-devel <span class="token comment"># 同样的 -devel 表示的就是开发库的意思</span></code></pre><p>同样地,我们创建一个软连接:</p><pre><code class="prism language-bash"><span class="token function">ln</span> <span class="token parameter variable">-s</span> /usr/include/jsoncpp jsoncpp</code></pre><p>想要使用 <code>jsoncpp</code> 我们在编译源文件的时候还要链接这个库哦!</p><p>这里可以写一个简单的代码来使用一下 <code>jsoncpp</code></p><pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string"></span></span><span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span>Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span> <span class="token comment">// 可以往里面插入任何类型的数据</span>Json<span class="token double-colon punctuation">::</span>Value ele1<span class="token punctuation">;</span>ele1<span class="token punctuation">[</span><span class="token string">"title1"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"标题1"</span><span class="token punctuation">;</span>ele1<span class="token punctuation">[</span><span class="token string">"content1"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"内容1"</span><span class="token punctuation">;</span>ele1<span class="token punctuation">[</span><span class="token string">"url1"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"链接1"</span><span class="token punctuation">;</span>root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>ele1<span class="token punctuation">)</span><span class="token punctuation">;</span>Json<span class="token double-colon punctuation">::</span>Value ele2<span class="token punctuation">;</span>ele2<span class="token punctuation">[</span><span class="token string">"title2"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"标题2"</span><span class="token punctuation">;</span>ele2<span class="token punctuation">[</span><span class="token string">"content2"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"内容2"</span><span class="token punctuation">;</span>ele2<span class="token punctuation">[</span><span class="token string">"url2"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token string">"链接2"</span><span class="token punctuation">;</span>root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>ele2<span class="token punctuation">)</span><span class="token punctuation">;</span>Json<span class="token double-colon punctuation">::</span>StyledWriter w<span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>string s <span class="token operator">=</span> w<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> <span class="token string">"序列化之后的结果:"</span> <span class="token operator"><<</span>std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> s <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span>Json<span class="token double-colon punctuation">::</span>Value ret<span class="token punctuation">;</span>Json<span class="token double-colon punctuation">::</span>Reader r<span class="token punctuation">;</span>r<span class="token punctuation">.</span><span class="token function">parse</span><span class="token punctuation">(</span>s<span class="token punctuation">,</span> ret<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//反序列化</span>std<span class="token double-colon punctuation">::</span>cout <span class="token operator"><<</span> ret<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"title1"</span><span class="token punctuation">]</span><span class="token punctuation">.</span><span class="token function">asString</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>可以看到 <code>jsoncpp</code> 的测试程序成功运行啦!</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/819dcc8dbbb74d0aa5cbd3e95de3f742.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/819dcc8dbbb74d0aa5cbd3e95de3f742.png" /></p><p>现在我们就来编写 <code>Searcher</code> 模块:</p><p>我们可以定义这样一个函数:<code>void Search(const std::string &query, std::string *json_string)</code></p><ul><li>参数一:输入型参数,用户在搜索框输入的字符串。</li><li>参数二:输出型参数,我们根据用户输入的字符串,找到相关的网页,将查找的结果用 json 打包好,通过参数二返回。这个返回的结果就是发送给客户端的数据啦!</li></ul><ol><li>在这个函数中我们第一步要做的就是对用户搜索的字符串进行分词操作。</li><li>根据分词的结果查找倒排索引,获取到一个一个的倒排拉链,并且将这些倒排拉链合并到一个 <code>vector</code> 中去。</li><li>对 <code>vector</code> 中的元素按照降序排序。</li><li>将查询到的数据打包成 <code>json</code> 数据格式输出。</li></ol><pre><code class="prism language-cpp"><span class="token keyword">void</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>query<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>json_string<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 对用户搜索的字符串进行分词操作</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> query_word<span class="token punctuation">;</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>query_word<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 一个关键字,对应了李哥倒排拉链,我们需要进行合并操作</span>ns_index<span class="token double-colon punctuation">::</span>InvertedList inverted_list_all<span class="token punctuation">;</span><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> s <span class="token operator">:</span> query_word<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 全部转化成小写,方便进行查找</span>boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 根据倒排索引进行查找</span>ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>il <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 合并一条条拉链</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token operator">*</span>il<span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">(</span><span class="token operator">*</span>il<span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">// 按照权重进行降序排序</span>std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElement <span class="token operator">&</span>e1<span class="token punctuation">,</span> <span class="token keyword">const</span> ns_index<span class="token double-colon punctuation">::</span>InvertedElement <span class="token operator">&</span>e2<span class="token punctuation">)</span><span class="token punctuation">{</span> <span class="token keyword">return</span> e1<span class="token punctuation">.</span>_weight <span class="token operator">></span> e2<span class="token punctuation">.</span>_weight<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 序列化,json 数据格式</span>Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> inverted_list_all<span class="token punctuation">)</span><span class="token punctuation">{</span>ns_index<span class="token double-colon punctuation">::</span>DocInfo <span class="token operator">*</span>doc <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>_doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> doc<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">continue</span><span class="token punctuation">;</span><span class="token punctuation">}</span>Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span>elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>_title<span class="token punctuation">;</span>elem<span class="token punctuation">[</span><span class="token string">"desc"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">GetDesc</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>_content<span class="token punctuation">,</span> item<span class="token punctuation">.</span>_word<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// content是文档的去标签的结果,但是不是我们想要的,我们要的是一部分</span>elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>_url<span class="token punctuation">;</span>root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span>Json<span class="token double-colon punctuation">::</span>StyledWriter r<span class="token punctuation">;</span><span class="token operator">*</span>json_string <span class="token operator">=</span> r<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>这里面有一个 <code>GetDesc</code> 函数。一个 html 文件的 content 内容可能会非常非常的长,但是我们不需要这么多。因此只需要有一个简短的描述信息就可以了,这个函数就是根据 html 文件的内容生成一个简单的描述信息。</p><p>我们采用的策略是:</p><ul><li>找到这个关键字第一次出现在 content 中的下标。</li><li>向前截取 50 个字符,向后截取 100 个字符作为这个 html 文件 content 的描述信息。</li></ul><p>因为这个函数比较简单,您可以查看项目的源代码。</p><blockquote><p>书写这个函数时,注意 size_t 类型的易错点就行啦!</p></blockquote><h2>7. 编写 http_server 模块</h2><p>本着有库就不手搓的原则,这个项目中 <code>http_server</code> 模块的编写我们也是用大佬们写好的库哈!如果你想体验手搓的过程,我们会在下一个项目 <code>高并发服务器</code> 中手搓一个!</p><p><code>cpp-httplib</code> 的安装:</p><pre><code class="prism language-bash"><span class="token function">git</span> clone https://gitee.com/welldonexing/cpp-httplib.git</code></pre><p>这里有一个问题就是使用 <code>httplib</code> 需要较新版本的 <code>gcc</code> 编译器,<code>centos7</code> 默认的 <code>gcc</code> 编译器是 <code>4.8.5</code>,我们需要升级到 <code>gcc 7</code> 或者更高版本哈!</p><pre><code class="prism language-bash"><span class="token comment"># 安装 scl</span><span class="token function">sudo</span> yum <span class="token function">install</span> centos-release-scl scl-utils-build</code></pre><pre><code class="prism language-bash"><span class="token comment"># 安装新版本 gcc</span><span class="token function">sudo</span> yum <span class="token function">install</span> <span class="token parameter variable">-y</span> devtoolset-7-gcc devtoolset-7-gcc-c++</code></pre><pre><code class="prism language-bash"><span class="token comment"># 使用 gcc 7</span>scl <span class="token builtin class-name">enable</span> devtoolset-7 <span class="token function">bash</span></code></pre><p>我们在执行使用 <code>gcc 7</code> 的命令后,只在当前会话有效,因此我们需要将这个命令弄到配置文件中:</p><pre><code class="prism language-bash"><span class="token function">vim</span> ~/.bash_profile</code></pre><p>我们使用 vim 打开家目录下的 <code>.bash_profile</code> 文件,为当前用户配置一下:在这个文件中加上刚才的那个命令就行。当我们登录的时候就会自动执行这条命令啦,保证我们的 <code>gcc</code> 版本一直都是 <code>gcc 7</code>。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/f26a0c3a8d4f4c23ae3938b83ea0d9ec.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/f26a0c3a8d4f4c23ae3938b83ea0d9ec.png" /></p><p>同样我们使用 <code>ln -s</code> 命令建立软连接,就不将整个项目克隆到 Boost 搜索引擎项目中了:</p><pre><code class="prism language-bash"><span class="token function">ln</span> <span class="token parameter variable">-s</span> ~/ThirdPartLibs/cpp-httplib cpp-httplib</code></pre><pre><code class="prism language-cpp"><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"cpp-httplib/httplib.h"</span></span><span class="token macro property"><span class="token directive-hash">#</span><span class="token directive keyword">include</span> <span class="token string">"searcher.hpp"</span></span><span class="token comment">// 这个是 SaveHtml 保存的文件</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string file <span class="token operator">=</span> <span class="token string">"Parse.txt"</span><span class="token punctuation">;</span><span class="token comment">// 这个是 web 根目录</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string web_root_path <span class="token operator">=</span> <span class="token string">"./wwwroot"</span><span class="token punctuation">;</span><span class="token keyword">int</span> <span class="token function">main</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span>ns_searcher<span class="token double-colon punctuation">::</span>Searcher search<span class="token punctuation">;</span> <span class="token comment">// 初始化 Searcher 模块</span>search<span class="token punctuation">.</span><span class="token function">InitSearcher</span><span class="token punctuation">(</span>input<span class="token punctuation">)</span><span class="token punctuation">;</span>httplib<span class="token double-colon punctuation">::</span>Server svr<span class="token punctuation">;</span>svr<span class="token punctuation">.</span><span class="token function">set_base_dir</span><span class="token punctuation">(</span>root_path<span class="token punctuation">.</span><span class="token function">c_str</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span>svr<span class="token punctuation">.</span><span class="token function">Get</span><span class="token punctuation">(</span><span class="token string">"/s"</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token operator">&</span>search<span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> httplib<span class="token double-colon punctuation">::</span>Request <span class="token operator">&</span>req<span class="token punctuation">,</span> httplib<span class="token double-colon punctuation">::</span>Response <span class="token operator">&</span>rsp<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>req<span class="token punctuation">.</span><span class="token function">has_param</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span>rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span><span class="token string">"必须要有搜索关键字!"</span><span class="token punctuation">,</span> <span class="token string">"text/plain; charset=utf-8"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">return</span><span class="token punctuation">;</span><span class="token punctuation">}</span>std<span class="token double-colon punctuation">::</span>string word <span class="token operator">=</span> req<span class="token punctuation">.</span><span class="token function">get_param_value</span><span class="token punctuation">(</span><span class="token string">"word"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//Get 请求中的word参数</span>std<span class="token double-colon punctuation">::</span>string json_string<span class="token punctuation">;</span> <span class="token comment">//返回可浏览器的 json 数据</span>search<span class="token punctuation">.</span><span class="token function">Search</span><span class="token punctuation">(</span>word<span class="token punctuation">,</span> <span class="token operator">&</span>json_string<span class="token punctuation">)</span><span class="token punctuation">;</span>rsp<span class="token punctuation">.</span><span class="token function">set_content</span><span class="token punctuation">(</span>json_string<span class="token punctuation">,</span> <span class="token string">"application/json"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>svr<span class="token punctuation">.</span><span class="token function">listen</span><span class="token punctuation">(</span><span class="token string">"0.0.0.0"</span><span class="token punctuation">,</span> <span class="token number">9999</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 绑定 ip 地址和端口号</span><span class="token keyword">return</span> <span class="token number">0</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>那么,现在我们就可以访问我们的服务器看看是什么效果啦:</p><ul><li>运行 <code>Http_server.cc</code> 编译出来的可执行程序。</li><li>等待正排索引与倒排索引建立完成。</li><li>假设我们要搜索关键字:filesystem:<code>47.180.251.0:9999/s" /></p><p>可以看到我们的服务器将数据成功返回给了客户端哈!下面我们要做的就是编写前端模块了!如果你会前端可以自己编写,这里的话我就直接将代码贴出来啦!因为个人不怎么会写前端代码!</p><h2>8. 前端代码的编写</h2><pre><code class="prism language-html"><span class="token doctype"><span class="token punctuation"><!</span><span class="token doctype-tag">DOCTYPE</span> <span class="token name">html</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>html</span> <span class="token attr-name">lang</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>en<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>head</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">charset</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>UTF-8<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">http-equiv</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>X-UA-Compatible<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>IE=edge<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>meta</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>viewport<span class="token punctuation">"</span></span> <span class="token attr-name">content</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>width=device-width, initial-scale=1.0<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>script</span> <span class="token attr-name">src</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>http://code.jquery.com/jquery-2.1.1.min.js<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token script"></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>script</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>title</span><span class="token punctuation">></span></span>boost 搜索引擎<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>title</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>style</span><span class="token punctuation">></span></span><span class="token style"><span class="token language-css"><span class="token comment">/* 去掉网页中的所有的默认内外边距,html的盒子模型 */</span><span class="token selector">*</span> <span class="token punctuation">{</span><span class="token comment">/* 设置外边距 */</span><span class="token property">margin</span><span class="token punctuation">:</span> 0<span class="token punctuation">;</span><span class="token comment">/* 设置内边距 */</span><span class="token property">padding</span><span class="token punctuation">:</span> 0<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">/* 将我们的body内的内容100%和html的呈现吻合 */</span><span class="token selector">html,body</span> <span class="token punctuation">{</span><span class="token property">height</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">/* 类选择器.container */</span><span class="token selector">.container</span> <span class="token punctuation">{</span><span class="token comment">/* 设置div的宽度 */</span><span class="token property">width</span><span class="token punctuation">:</span> 800px<span class="token punctuation">;</span><span class="token comment">/* 通过设置外边距达到居中对齐的目的 */</span><span class="token property">margin</span><span class="token punctuation">:</span> 0px auto<span class="token punctuation">;</span><span class="token comment">/* 设置外边距的上边距,保持元素和网页的上部距离 */</span><span class="token property">margin-top</span><span class="token punctuation">:</span> 15px<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">/* 复合选择器,选中container 下的 search */</span><span class="token selector">.container .search</span> <span class="token punctuation">{</span><span class="token comment">/* 宽度与父标签保持一致 */</span><span class="token property">width</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span><span class="token comment">/* 高度设置为52px */</span><span class="token property">height</span><span class="token punctuation">:</span> 52px<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">/* 先选中input标签, 直接设置标签的属性,先要选中, input:标签选择器*/</span><span class="token comment">/* input在进行高度设置的时候,没有考虑边框的问题 */</span><span class="token selector">.container .search input</span> <span class="token punctuation">{</span><span class="token comment">/* 设置left浮动 */</span><span class="token property">float</span><span class="token punctuation">:</span> left<span class="token punctuation">;</span><span class="token property">width</span><span class="token punctuation">:</span> 600px<span class="token punctuation">;</span><span class="token property">height</span><span class="token punctuation">:</span> 50px<span class="token punctuation">;</span><span class="token comment">/* 设置边框属性:边框的宽度,样式,颜色 */</span><span class="token property">border</span><span class="token punctuation">:</span> 1px solid black<span class="token punctuation">;</span><span class="token comment">/* 去掉input输入框的有边框 */</span><span class="token property">border-right</span><span class="token punctuation">:</span> none<span class="token punctuation">;</span><span class="token comment">/* 设置内边距,默认文字不要和左侧边框紧挨着 */</span><span class="token property">padding-left</span><span class="token punctuation">:</span> 10px<span class="token punctuation">;</span><span class="token comment">/* 设置input内部的字体的颜色和样式 */</span><span class="token property">color</span><span class="token punctuation">:</span> #CCC<span class="token punctuation">;</span><span class="token property">font-size</span><span class="token punctuation">:</span> 14px<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">/* 先选中button标签, 直接设置标签的属性,先要选中, button:标签选择器*/</span><span class="token selector">.container .search button</span> <span class="token punctuation">{</span><span class="token comment">/* 设置left浮动 */</span><span class="token property">float</span><span class="token punctuation">:</span> left<span class="token punctuation">;</span><span class="token property">width</span><span class="token punctuation">:</span> 150px<span class="token punctuation">;</span><span class="token property">height</span><span class="token punctuation">:</span> 52px<span class="token punctuation">;</span><span class="token comment">/* 设置button的背景颜色,#4e6ef2 */</span><span class="token property">background-color</span><span class="token punctuation">:</span> #4e6ef2<span class="token punctuation">;</span><span class="token comment">/* 设置button中的字体颜色 */</span><span class="token property">color</span><span class="token punctuation">:</span> #FFF<span class="token punctuation">;</span><span class="token comment">/* 设置字体的大小 */</span><span class="token property">font-size</span><span class="token punctuation">:</span> 19px<span class="token punctuation">;</span><span class="token property">font-family</span><span class="token punctuation">:</span>Georgia<span class="token punctuation">,</span> <span class="token string">'Times New Roman'</span><span class="token punctuation">,</span> Times<span class="token punctuation">,</span> serif<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result</span> <span class="token punctuation">{</span><span class="token property">width</span><span class="token punctuation">:</span> 100%<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result .item</span> <span class="token punctuation">{</span><span class="token property">margin-top</span><span class="token punctuation">:</span> 15px<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result .item a</span> <span class="token punctuation">{</span><span class="token comment">/* 设置为块级元素,单独站一行 */</span><span class="token property">display</span><span class="token punctuation">:</span> block<span class="token punctuation">;</span><span class="token comment">/* a标签的下划线去掉 */</span><span class="token property">text-decoration</span><span class="token punctuation">:</span> none<span class="token punctuation">;</span><span class="token comment">/* 设置a标签中的文字的字体大小 */</span><span class="token property">font-size</span><span class="token punctuation">:</span> 20px<span class="token punctuation">;</span><span class="token comment">/* 设置字体的颜色 */</span><span class="token property">color</span><span class="token punctuation">:</span> #4e6ef2<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result .item a:hover</span> <span class="token punctuation">{</span><span class="token property">text-decoration</span><span class="token punctuation">:</span> underline<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result .item p</span> <span class="token punctuation">{</span><span class="token property">margin-top</span><span class="token punctuation">:</span> 5px<span class="token punctuation">;</span><span class="token property">font-size</span><span class="token punctuation">:</span> 16px<span class="token punctuation">;</span><span class="token property">font-family</span><span class="token punctuation">:</span><span class="token string">'Lucida Sans'</span><span class="token punctuation">,</span> <span class="token string">'Lucida Sans Regular'</span><span class="token punctuation">,</span> <span class="token string">'Lucida Grande'</span><span class="token punctuation">,</span> <span class="token string">'Lucida Sans Unicode'</span><span class="token punctuation">,</span> Geneva<span class="token punctuation">,</span> Verdana<span class="token punctuation">,</span> sans-serif<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token selector">.container .result .item i</span><span class="token punctuation">{</span><span class="token comment">/* 设置为块级元素,单独站一行 */</span><span class="token property">display</span><span class="token punctuation">:</span> block<span class="token punctuation">;</span><span class="token comment">/* 取消斜体风格 */</span><span class="token property">font-style</span><span class="token punctuation">:</span> normal<span class="token punctuation">;</span><span class="token property">color</span><span class="token punctuation">:</span> green<span class="token punctuation">;</span><span class="token punctuation">}</span></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>style</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>head</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>body</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>container<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>search<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>text<span class="token punctuation">"</span></span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>请输入搜索关键字<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>button</span> <span class="token special-attr"><span class="token attr-name">onclick</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span><span class="token value javascript language-javascript"><span class="token function">Search</span><span class="token punctuation">(</span><span class="token punctuation">)</span></span><span class="token punctuation">"</span></span></span><span class="token punctuation">></span></span>搜索一下<span class="token tag"><span class="token tag"><span class="token punctuation"></</span>button</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>div</span> <span class="token attr-name">class</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>result<span class="token punctuation">"</span></span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>div</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"><</span>script</span><span class="token punctuation">></span></span><span class="token script"><span class="token language-javascript"><span class="token keyword">function</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 1. 提取数据, $可以理解成就是JQuery的别称</span><span class="token keyword">let</span> query <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">".container .search input"</span><span class="token punctuation">)</span><span class="token punctuation">.</span><span class="token function">val</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span>console<span class="token punctuation">.</span><span class="token function">log</span><span class="token punctuation">(</span><span class="token string">"query = "</span> <span class="token operator">+</span> query<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">//console是浏览器的对话框,可以用来进行查看js数据</span><span class="token comment">//2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数,JQuery中的</span>$<span class="token punctuation">.</span><span class="token function">ajax</span><span class="token punctuation">(</span><span class="token punctuation">{</span><span class="token literal-property property">type</span><span class="token operator">:</span> <span class="token string">"GET"</span><span class="token punctuation">,</span><span class="token literal-property property">url</span><span class="token operator">:</span> <span class="token string">"/s?word="</span> <span class="token operator">+</span> query<span class="token punctuation">,</span><span class="token function-variable function">success</span><span class="token operator">:</span> <span class="token keyword">function</span><span class="token punctuation">(</span><span class="token parameter">data</span><span class="token punctuation">)</span><span class="token punctuation">{</span>console<span class="token punctuation">.</span><span class="token function">log</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token function">BuildHtml</span><span class="token punctuation">(</span>data<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">function</span> <span class="token function">BuildHtml</span><span class="token punctuation">(</span><span class="token parameter">data</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 获取html中的result标签</span><span class="token keyword">let</span> result_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">".container .result"</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 清空历史搜索结果</span>result_lable<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">for</span><span class="token punctuation">(</span> <span class="token keyword">let</span> elem <span class="token keyword">of</span> data<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// console.log(elem.title);</span><span class="token comment">// console.log(elem.url);</span><span class="token keyword">let</span> a_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">""</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>title<span class="token punctuation">,</span><span class="token literal-property property">href</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>url<span class="token punctuation">,</span><span class="token comment">// 跳转到新的页面</span><span class="token literal-property property">target</span><span class="token operator">:</span> <span class="token string">"_blank"</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">let</span> p_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">"<p>"</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>desc<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">let</span> i_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">"<i>"</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token literal-property property">text</span><span class="token operator">:</span> elem<span class="token punctuation">.</span>url<span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">let</span> div_lable <span class="token operator">=</span> <span class="token function">$</span><span class="token punctuation">(</span><span class="token string">""</span><span class="token punctuation">,</span> <span class="token punctuation">{</span><span class="token keyword">class</span><span class="token operator">:</span> <span class="token string">"item"</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span>a_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span>p_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span>i_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>div_lable<span class="token punctuation">)</span><span class="token punctuation">;</span>div_lable<span class="token punctuation">.</span><span class="token function">appendTo</span><span class="token punctuation">(</span>result_lable<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">}</span></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>script</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>body</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation"></</span>html</span><span class="token punctuation">></span></span></code></pre><p>把前端代码粘贴过去之后,我们就能直接用 ip 地址加端口号访问啦。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/c91514c59f2e4d1f8e96c5b041604091.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/c91514c59f2e4d1f8e96c5b041604091.png" /></p><h2>9. 处理细节问题</h2><h3>9.1 搜索文档重复问题</h3><p>不止到大家在写代码的时候有没有发现这样一个问题:如果用户搜索的字符串分词过后形成了多个关键字,但是有两个或者以上的关键字在同一个文档中都出现了,用户拿到返回的结果时就会有重复的条目!</p><p>我们可以做个实验验证一下:</p><p>我们在要过滤的 html 文件中随便加一个 html 文件,添加一串中文:“你是一个好人”。这个随便你怎么添加都行。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/684d16063bf24d2e81012d2d11dc096c.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/684d16063bf24d2e81012d2d11dc096c.png" /></p><p>然后重新解析 html 文件并启动我们的服务器。</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/750874a377db456ebf0c339acd056ec2.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/750874a377db456ebf0c339acd056ec2.png" /></p><p>可以看到我们搜索 “你是一个好人” 的时候,服务器给我们相应了四个条目,并且这四个条目是一样的,因为他们都有一个相同的 <code>"id" : 11944</code> 显然,这不是我们期望的结果,我们想要的是服务器返回给我们一个条目就行了,并且权值是 4。我们定义重复文档,让他们的权值相加哈!</p><p>这该怎么做呢?</p><ul><li><p>之前我们是使用 <code>InvertedElement</code> 的 <code>vector</code> 来记录查找到的数据的:</p><pre><code class="prism language-cpp"><span class="token keyword">struct</span> <span class="token class-name">InvertedElement</span><span class="token punctuation">{</span><span class="token keyword">uint32_t</span> _doc_id<span class="token punctuation">;</span><span class="token comment">// 文档编号</span><span class="token keyword">uint32_t</span> _weight<span class="token punctuation">;</span><span class="token comment">// 关键字对应在该文档中的权重</span>std<span class="token double-colon punctuation">::</span>string _word<span class="token punctuation">;</span> <span class="token comment">// 关键字</span><span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre><p>显然我们要进行去重就不能在使用这个 <code>InvertedElement</code> 了,因为多个关键字,可能对应同一个文档嘛,我们要保存的不应该只是一个关键字,而是一个关键字的数组。</p><p>所以我们重新定义一个结构体:</p><pre><code class="prism language-cpp"><span class="token keyword">struct</span> <span class="token class-name">InvertedElementNode</span><span class="token punctuation">{</span><span class="token keyword">uint32_t</span> _doc_id<span class="token punctuation">;</span><span class="token comment">// 文档编号</span><span class="token keyword">uint32_t</span> _weight<span class="token punctuation">;</span><span class="token comment">// 关键字对应在该文档中的权重</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> _words<span class="token punctuation">;</span> <span class="token comment">// 关键字们</span><span class="token function">InvertedElementNode</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">_doc_id</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token function">_weight</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> <span class="token punctuation">{</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token punctuation">;</span></code></pre><ul><li>接下来我们就要建立一个 <code>doc_id</code> 映射 <code>InvertedElementNode</code> 的哈希表,当我们通过分词之后的关键字查找倒排索引得到倒排拉链之后,需要遍历这个倒排拉链,将数据一个一个地插入到 <code>unordered_map</code> 中去,注意看代码到底是怎么去重的!</li><li>去重之后的数据都保存在 <code>unordred_map</code> 中哈,我们就需要遍历这个哈希表,将数据插入到我们的 <code>vector</code> 中去,等会方便进行按照权值进行降序排序的操作。</li></ul><p>下面就是优化之后的代码啦:</p><pre><code class="prism language-cpp"><span class="token keyword">bool</span> <span class="token function">Search</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>query<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">*</span>json_string<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 对用户搜索的字符串进行分词操作</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> query_word<span class="token punctuation">;</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">CutString</span><span class="token punctuation">(</span>query<span class="token punctuation">,</span> <span class="token operator">&</span>query_word<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 一个关键字,对应了一个倒排拉链,我们需要进行合并操作</span><span class="token comment">// ns_index::InvertedList inverted_list_all;</span><span class="token comment">// 用户搜索的字符串相关的文档都会保存到这里啦</span>std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>InvertedElementNode<span class="token operator">></span> inverted_list_all<span class="token punctuation">;</span>std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span><span class="token keyword">uint32_t</span><span class="token punctuation">,</span> InvertedElementNode<span class="token operator">></span> unique_hash<span class="token punctuation">;</span><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> s <span class="token operator">:</span> query_word<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 全部转化成小写,方便进行查找</span>boost<span class="token double-colon punctuation">::</span><span class="token function">to_lower</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 根据倒排索引进行查找</span>ns_index<span class="token double-colon punctuation">::</span>InvertedList <span class="token operator">*</span>il <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetInvertedList</span><span class="token punctuation">(</span>s<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 有可能又得关键词没有倒排拉链</span><span class="token keyword">if</span> <span class="token punctuation">(</span>il <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span><span class="token keyword">continue</span><span class="token punctuation">;</span><span class="token comment">// 遍历一个关键对应的倒排拉链</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span><span class="token operator">&</span> ele <span class="token operator">:</span> <span class="token punctuation">(</span><span class="token operator">*</span>il<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 请理解 unordered_map 重载 [] 运算符的底层原理</span><span class="token keyword">auto</span><span class="token operator">&</span> IEN <span class="token operator">=</span> unique_hash<span class="token punctuation">[</span>ele<span class="token punctuation">.</span>_doc_id<span class="token punctuation">]</span><span class="token punctuation">;</span><span class="token comment">//这里在 [] 插入了一个元素之后就显得有点多余了,但是第一个插入的元素必须这么做,不过代价也不是很大吧</span>IEN<span class="token punctuation">.</span>_doc_id <span class="token operator">=</span> ele<span class="token punctuation">.</span>_doc_id<span class="token punctuation">;</span><span class="token comment">// 我们定义的规则是进行权值的相加</span>IEN<span class="token punctuation">.</span>_weight <span class="token operator">+=</span> ele<span class="token punctuation">.</span>_weight<span class="token punctuation">;</span><span class="token comment">// 将关键词插入我们维护的 vector 里面</span>IEN<span class="token punctuation">.</span>_words<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>ele<span class="token punctuation">.</span>_word<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token comment">// 合并一条条拉链</span><span class="token comment">// inverted_list_all.insert(inverted_list_all.end(), (*il).begin(), (*il).end());</span><span class="token punctuation">}</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">const</span> <span class="token keyword">auto</span><span class="token operator">&</span> node <span class="token operator">:</span> unique_hash<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 遍历去重后的数据,也就是哈希表中的数据,将他插入 vector 中方便后续按照权值进行降序排序。</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">push_back</span><span class="token punctuation">(</span>std<span class="token double-colon punctuation">::</span><span class="token function">move</span><span class="token punctuation">(</span>node<span class="token punctuation">.</span>second<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">if</span> <span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">empty</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">return</span> <span class="token boolean">false</span><span class="token punctuation">;</span><span class="token comment">// 按照权重进行降序排序</span><span class="token comment">// std::sort(inverted_list_all.begin(), inverted_list_all.end(), [](const ns_index::InvertedElement &e1, const ns_index::InvertedElement &e2)</span><span class="token comment">// { return e1._weight > e2._weight; });</span>std<span class="token double-colon punctuation">::</span><span class="token function">sort</span><span class="token punctuation">(</span>inverted_list_all<span class="token punctuation">.</span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> inverted_list_all<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token punctuation">[</span><span class="token punctuation">]</span><span class="token punctuation">(</span><span class="token keyword">const</span> InvertedElementNode <span class="token operator">&</span>e1<span class="token punctuation">,</span> <span class="token keyword">const</span> InvertedElementNode <span class="token operator">&</span>e2<span class="token punctuation">)</span><span class="token punctuation">{</span> <span class="token keyword">return</span> e1<span class="token punctuation">.</span>_weight <span class="token operator">></span> e2<span class="token punctuation">.</span>_weight<span class="token punctuation">;</span> <span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token comment">// 序列化,json 数据格式</span>Json<span class="token double-colon punctuation">::</span>Value root<span class="token punctuation">;</span><span class="token keyword">for</span> <span class="token punctuation">(</span><span class="token keyword">auto</span> <span class="token operator">&</span>item <span class="token operator">:</span> inverted_list_all<span class="token punctuation">)</span><span class="token punctuation">{</span>ns_index<span class="token double-colon punctuation">::</span>DocInfo <span class="token operator">*</span>doc <span class="token operator">=</span> index<span class="token operator">-></span><span class="token function">GetForwardIndex</span><span class="token punctuation">(</span>item<span class="token punctuation">.</span>_doc_id<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span> <span class="token punctuation">(</span><span class="token keyword">nullptr</span> <span class="token operator">==</span> doc<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">continue</span><span class="token punctuation">;</span><span class="token punctuation">}</span>Json<span class="token double-colon punctuation">::</span>Value elem<span class="token punctuation">;</span>elem<span class="token punctuation">[</span><span class="token string">"title"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>_title<span class="token punctuation">;</span><span class="token comment">// 进行了去重操作之后,获取描述信息的话,我们就用第一个关键字作为锚点就行</span>elem<span class="token punctuation">[</span><span class="token string">"desc"</span><span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token function">GetDesc</span><span class="token punctuation">(</span>doc<span class="token operator">-></span>_content<span class="token punctuation">,</span> item<span class="token punctuation">.</span>_words<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// content是文档的去标签的结果,但是不是我们想要的,我们要的是一部分 TODO</span>elem<span class="token punctuation">[</span><span class="token string">"url"</span><span class="token punctuation">]</span> <span class="token operator">=</span> doc<span class="token operator">-></span>_url<span class="token punctuation">;</span><span class="token comment">// for deubg</span><span class="token comment">// elem["id"] = (int)item._doc_id;</span><span class="token comment">// elem["weight"] = item._weight; // int->string</span>root<span class="token punctuation">.</span><span class="token function">append</span><span class="token punctuation">(</span>elem<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span>Json<span class="token double-colon punctuation">::</span>StyledWriter r<span class="token punctuation">;</span><span class="token operator">*</span>json_string <span class="token operator">=</span> r<span class="token punctuation">.</span><span class="token function">write</span><span class="token punctuation">(</span>root<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">return</span> <span class="token boolean">true</span><span class="token punctuation">;</span><span class="token punctuation">}</span></code></pre><p>可以看到,这次我们再来搜索服务器就只给我们返回了一个条目,并且权重是 4 了,这样我们就完成了去重功能的编写啦!</p></li></ul><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/f6a7e1db57804a36ac901263aebc4cba.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/f6a7e1db57804a36ac901263aebc4cba.png" /></p><h3>9.2 去掉暂停词</h3><p>暂停词的概念之前提过哈:</p><blockquote><p>暂停词是在自然语言处理中被过滤掉的常见词语,通常是那些对文本含义贡献不大的词,比如“的”、“是”、“在”等。这些词通常在文本处理和分析过程中被忽略,因为它们在大多数情况下不影响文本的含义。</p></blockquote><p>我们处理暂停词的时机就是在进行分词的时候,判断分词结果是否有暂停词就行了,如果有去掉就行。这么来看,我们需要穷举所有的暂停词。我只能说不用,因为 <code>cppjieba</code> 这个库里面就有暂停词这个文件,里面就是一堆的暂停词啦!</p><p><noscript><img decoding="async" class="aligncenter" src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/1a1c624fefdc4bca9596e15fe88174b5.png" /></noscript><img decoding="async" class="lazyload aligncenter" src='data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E' data-src="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/1a1c624fefdc4bca9596e15fe88174b5.png" /></p><ul><li>我们在 <code>JiebaUtil</code> 类中加入去掉暂停词的功能。要求不影响上层调用这个接口,即上层代码不需要更改。</li><li>首先我们需要读取这个暂停词文件,将所有暂停词加载到内存中。因为我们需要快速查找一个字符串的分词结果中是否含有暂停词,还是得使用 <code>unordered_map</code> 来存储暂停词。</li><li>我们需要遍历 <code>Jieba</code> 分词的结果,判断这个词语是不是暂停词,如果是的话,就要讲这个词语从分词结果中删除,这里一定要注意 <code>vector</code> 迭代器失效的问题!</li><li>最后,我们可以将 <code>JiebaUtil</code> 这个类做成单例。</li><li>加上去掉暂停词的功能,建立索引的过程会慢的要死,你斟酌斟酌加不加吧!</li></ul><pre><code class="prism language-cpp"><span class="token keyword">class</span> <span class="token class-name">JiebaUtil</span><span class="token punctuation">{</span><span class="token keyword">private</span><span class="token operator">:</span><span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">:</span> <span class="token function">jieba</span><span class="token punctuation">(</span>DICT_PATH<span class="token punctuation">,</span> HMM_PATH<span class="token punctuation">,</span> USER_DICT_PATH<span class="token punctuation">,</span> IDF_PATH<span class="token punctuation">,</span> STOP_WORD_PATH<span class="token punctuation">)</span> <span class="token punctuation">{</span><span class="token punctuation">}</span><span class="token function">JiebaUtil</span><span class="token punctuation">(</span><span class="token keyword">const</span> JiebaUtil<span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span>JiebaUtil<span class="token operator">&</span> <span class="token keyword">operator</span><span class="token operator">=</span><span class="token punctuation">(</span><span class="token keyword">const</span> JiebaUtil<span class="token operator">&</span><span class="token punctuation">)</span> <span class="token operator">=</span> <span class="token keyword">delete</span><span class="token punctuation">;</span><span class="token keyword">public</span><span class="token operator">:</span><span class="token keyword">static</span> JiebaUtil<span class="token operator">*</span> <span class="token function">GetInstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 单例之懒汉式</span><span class="token keyword">if</span><span class="token punctuation">(</span>_instance <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token function">pthread_mutex_lock</span><span class="token punctuation">(</span><span class="token operator">&</span>_mutex<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span>_instance <span class="token operator">==</span> <span class="token keyword">nullptr</span><span class="token punctuation">)</span><span class="token punctuation">{</span>_instance <span class="token operator">=</span> <span class="token keyword">new</span> JiebaUtil<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token function">pthread_mutex_unlock</span><span class="token punctuation">(</span><span class="token operator">&</span>_mutex<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">return</span> _instance<span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">void</span> <span class="token function">InitJiebaUtil</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 将保存暂停词的文件读取上来</span>std<span class="token double-colon punctuation">::</span>ifstream <span class="token function">in_file</span><span class="token punctuation">(</span>STOP_WORD_PATH<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token keyword">if</span><span class="token punctuation">(</span><span class="token operator">!</span>in_file<span class="token punctuation">.</span><span class="token function">is_open</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span>std<span class="token double-colon punctuation">::</span>cerr <span class="token operator"><<</span> <span class="token string">"file "</span> <span class="token operator"><<</span> STOP_WORD_PATH <span class="token operator"><<</span> <span class="token string">" open failed"</span> <span class="token operator"><<</span> std<span class="token double-colon punctuation">::</span>endl<span class="token punctuation">;</span><span class="token keyword">return</span><span class="token punctuation">;</span><span class="token punctuation">}</span>std<span class="token double-colon punctuation">::</span>string stop_word<span class="token punctuation">;</span><span class="token keyword">while</span><span class="token punctuation">(</span><span class="token function">getline</span><span class="token punctuation">(</span>in_file<span class="token punctuation">,</span> stop_word<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">//插入到哈希表中</span>_stop_words<span class="token punctuation">.</span><span class="token function">insert</span><span class="token punctuation">(</span><span class="token punctuation">{</span>stop_word<span class="token punctuation">,</span> <span class="token boolean">true</span><span class="token punctuation">}</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token keyword">void</span> <span class="token function">CutStringHelper</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string <span class="token operator">&</span>src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">{</span>jieba<span class="token punctuation">.</span><span class="token function">CutForSearch</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> <span class="token operator">*</span>out<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token comment">// 进行分词</span><span class="token keyword">for</span><span class="token punctuation">(</span><span class="token keyword">auto</span> iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">begin</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> iter <span class="token operator">!=</span> out<span class="token operator">-></span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token keyword">if</span><span class="token punctuation">(</span>_stop_words<span class="token punctuation">.</span><span class="token function">find</span><span class="token punctuation">(</span><span class="token operator">*</span>iter<span class="token punctuation">)</span> <span class="token operator">==</span> _stop_words<span class="token punctuation">.</span><span class="token function">end</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// 说明这个次是暂停词</span>iter <span class="token operator">=</span> out<span class="token operator">-></span><span class="token function">erase</span><span class="token punctuation">(</span>iter<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">else</span> iter<span class="token operator">++</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token punctuation">}</span><span class="token keyword">static</span> <span class="token keyword">void</span> <span class="token function">CutString</span><span class="token punctuation">(</span><span class="token keyword">const</span> std<span class="token double-colon punctuation">::</span>string<span class="token operator">&</span> src<span class="token punctuation">,</span> std<span class="token double-colon punctuation">::</span>vector<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token operator">></span><span class="token operator">*</span> out<span class="token punctuation">)</span><span class="token punctuation">{</span><span class="token comment">// jieba.CutForSearch(src, *out);</span>Util<span class="token double-colon punctuation">::</span><span class="token class-name">JiebaUtil</span><span class="token double-colon punctuation">::</span><span class="token function">GetInstance</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token operator">-></span><span class="token function">CutStringHelper</span><span class="token punctuation">(</span>src<span class="token punctuation">,</span> out<span class="token punctuation">)</span><span class="token punctuation">;</span><span class="token punctuation">}</span><span class="token keyword">private</span><span class="token operator">:</span>cppjieba<span class="token double-colon punctuation">::</span>Jieba jieba<span class="token punctuation">;</span> <span class="token comment">// 分词对象</span><span class="token keyword">static</span> JiebaUtil<span class="token operator">*</span> _instance<span class="token punctuation">;</span> <span class="token comment">//单例</span><span class="token keyword">static</span> pthread_mutex_t _mutex<span class="token punctuation">;</span> <span class="token comment">// 互斥锁</span>std<span class="token double-colon punctuation">::</span>unordered_map<span class="token operator"><</span>std<span class="token double-colon punctuation">::</span>string<span class="token punctuation">,</span> <span class="token keyword">bool</span><span class="token operator">></span> _stop_words<span class="token punctuation">;</span> <span class="token comment">// 暂停词保存在哈希表中方便快速查找</span><span class="token punctuation">}</span><span class="token punctuation">;</span><span class="token comment">//锁,防止多线程下出现并发访问临界资源的情况,使用PTHREAD_MUTEX_INITIALIZER 就不用 destory了</span>pthread_mutex_t JiebaUtil<span class="token double-colon punctuation">::</span>_mutex <span class="token operator">=</span> PTHREAD_MUTEX_INITIALIZER<span class="token punctuation">;</span> JiebaUtil<span class="token operator">*</span> JiebaUtil<span class="token double-colon punctuation">::</span>_instance <span class="token operator">=</span> <span class="token keyword">nullptr</span><span class="token punctuation">;</span> <span class="token comment">// 单例的那个 例</span></code></pre><p>最后运行我们的服务程序就行啦:</p><pre><code class="prism language-bash">./Server <span class="token operator"><span class="token file-descriptor important">2</span>></span> err.txt <span class="token comment"># 将标准错误重定向到 err.txt</span></code></pre><p>最后可以加上守护进程,让你的服务一直跑起来!我的服务器比较拉垮,就不让他一直跑起来了!!</p><h2>10. 总结</h2><ol><li>出现 bug 一定是自己的问题,不是其他什么客观因素导致的。</li><li>在 <code>DocInfo</code> 这个结构体初始化的时候习惯就这样写了:<code>DocInfo doc = {0}</code>,导致我找了好久的错。</li><li>在使用迭代器遍历容器的时候,使用 <code>while</code> 循环,我总是不将迭代器变量加加,不知一回了。下次一定用 <code>for</code> 循环,或者直接不用迭代器遍历了!<code>C++11</code> 的范围 <code>for</code> 好用。</li><li><code>vector</code> 迭代器失效的问题这次又踩坑了,我想应该没有下次了吧!</li></ol></article></div><div class="related-posts"><h2 class="related-posts-title"><i class="fab fa-hive me-1"></i>相关文章</h2><div class="row g-2 g-md-3 row-cols-2 row-cols-md-3 row-cols-lg-4"><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/37752/" title="C语言——指针(五)" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/528b78af05954d2596c053620465d416.png"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/37752/" title="C语言——指针(五)">C语言——指针(五)</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/42617/" title="多输入多输出 | MATLAB实现SSA-CNN麻雀算法优化卷积神经网络多输入多输出预测" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/c9f604ac8a904f6084a05921b0b95809.png"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/42617/" title="多输入多输出 | MATLAB实现SSA-CNN麻雀算法优化卷积神经网络多输入多输出预测">多输入多输出 | MATLAB实现SSA-CNN麻雀算法优化卷积神经网络多输入多输出预测</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/40381/" title="Android 常用注解一览" data-bg="/wp-content/themes/ripro-v5/assets/img/thumb.jpg"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/40381/" title="Android 常用注解一览">Android 常用注解一览</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/44883/" title="前端(二十三)——轮询和长轮询" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/693d73f711ef413f9ca6ff46c63d1852.gif"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/44883/" title="前端(二十三)——轮询和长轮询">前端(二十三)——轮询和长轮询</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/23043/" title="03_Hello_React重构" data-bg="/wp-content/themes/ripro-v5/assets/img/thumb.jpg"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/23043/" title="03_Hello_React重构">03_Hello_React重构</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/30847/" title="c语言——字符转ASCLL码" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/977287dad00a4fa99f77ec3db00db56e.png"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/30847/" title="c语言——字符转ASCLL码">c语言——字符转ASCLL码</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/11306/" title="​ISP算法及架构分析介绍" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/img_convert/83d940cb295bd587b61ca1dfcd05a52c.png"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/11306/" title="​ISP算法及架构分析介绍">​ISP算法及架构分析介绍</a></h2></div></article></div><div class="col"><article class="post-item item-grid"><div class="tips-badge position-absolute top-0 start-0 z-1 m-2"></div><div class="entry-media ratio ratio-3x2"> <a target="" class="media-img lazy bg-cover bg-center" href="https://www.maxssl.com/article/14256/" title="maven 项目导入本地jar包" data-bg="https://img.maxssl.com/uploads/?url=https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/20210508214456573.png"> </a></div><div class="entry-wrapper"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/14256/" title="maven 项目导入本地jar包">maven 项目导入本地jar包</a></h2></div></article></div></div></div></div><div class="sidebar-wrapper col-md-12 col-lg-3 h-100" data-sticky><div class="sidebar"><div id="recent-posts-4" class="widget widget_recent_entries"><h5 class="widget-title">最新关注</h5><ul><li> <a href="https://www.maxssl.com/article/57859/">【MySQL】InnoDB存储引擎</a></li><li> <a href="https://www.maxssl.com/article/57858/">DB-GPT:强强联合Langchain-Vicuna的应用实战开源项目,彻底改变与数据库的交互方式</a></li><li> <a href="https://www.maxssl.com/article/57857/">TigerBeetle:世界上最快的会计数据库</a></li><li> <a href="https://www.maxssl.com/article/57856/">【SQL server】玩转SQL server数据库:第三章 关系数据库标准语言SQL(二)数据查询</a></li><li> <a href="https://www.maxssl.com/article/57855/">马斯克400条聊天记录被法院公开,原来推特收购是在短信上谈崩的</a></li><li> <a href="https://www.maxssl.com/article/57854/">戏精摩根大通:从唱空比特币到牵手贝莱德</a></li></ul></div><div id="ri_sidebar_posts_widget-2" class="widget sidebar-posts-list"><h5 class="widget-title">热文推荐</h5><div class="row g-3 row-cols-1"><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/11909/" title="故障注入的方法与工具" data-bg="https://img.maxssl.com/uploads/?url=https://img2023.cnblogs.com/blog/3144817/202304/3144817-20230414163912262-1546218135.png"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/11909/" title="故障注入的方法与工具">故障注入的方法与工具</a></h2></div></div></article></div><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/5004/" title="【k哥爬虫普法】爬取数据是否一定构成不正当竞争?" data-bg="https://img.maxssl.com/uploads/?url=https://img2023.cnblogs.com/other/2501174/202212/2501174-20221202142816913-977058465.gif"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/5004/" title="【k哥爬虫普法】爬取数据是否一定构成不正当竞争?">【k哥爬虫普法】爬取数据是否一定构成不正当竞争?</a></h2></div></div></article></div><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/43158/" title="免费的GPT4来了,你还不知道吗?" data-bg="https://img.maxssl.com/uploads/?url=https://img.maxssl.com/uploads/?url=https://csdnimg.cn/release/blog_editor_html/release2.3.6/ckeditor/plugins/CsdnLink/icons/icon-default.png"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/43158/" title="免费的GPT4来了,你还不知道吗?">免费的GPT4来了,你还不知道吗?</a></h2></div></div></article></div><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/47830/" title="微信公众号配置 Token 认证以及消息推送功能" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/be3e44a68b4241adbf2ae10bb536dbd0.png"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/47830/" title="微信公众号配置 Token 认证以及消息推送功能">微信公众号配置 Token 认证以及消息推送功能</a></h2></div></div></article></div><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/45890/" title="Visual Studio中,每次新建文件都会自动出现提前设置好的头文件配置方法" data-bg="https://img.maxssl.com/uploads/?url=https://img-blog.csdnimg.cn/direct/9de7e55041164fa49484b2062216499f.png"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/45890/" title="Visual Studio中,每次新建文件都会自动出现提前设置好的头文件配置方法">Visual Studio中,每次新建文件都会自动出现提前设置好的头文件配置方法</a></h2></div></div></article></div><div class="col"><article class="post-item item-list"><div class="entry-media ratio ratio-3x2 col-auto"> <a target="" class="media-img lazy" href="https://www.maxssl.com/article/45935/" title="基于OpenCV-Python的图像位置校正和版面分析" data-bg="https://img.maxssl.com/uploads/?url=https://img2023.cnblogs.com/blog/3039442/202312/3039442-20231223113733041-883757296.png"></a></div><div class="entry-wrapper"><div class="entry-body"><h2 class="entry-title"> <a target="" href="https://www.maxssl.com/article/45935/" title="基于OpenCV-Python的图像位置校正和版面分析">基于OpenCV-Python的图像位置校正和版面分析</a></h2></div></div></article></div></div></div></div></div></div></div></main><footer class="site-footer py-md-4 py-2 mt-2 mt-md-4"><div class="container"><div class="text-center small w-100"><div>Copyright © <script>today=new Date();document.write(today.getFullYear());</script> maxssl.com 版权所有 <a href="https://beian.miit.gov.cn/" target="_blank" rel="nofollow noopener">浙ICP备2022011180号</a></div><div class=""><script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-7656930379472324"
     crossorigin="anonymous"></script></div></div></div></footer><div class="rollbar"><ul class="actions"><li><a target="" href="https://www.maxssl.com/" rel="nofollow noopener noreferrer"><i class="fas fa-home"></i><span></span></a></li><li><a target="" href="http://wpa.qq.com/msgrd?v=3&uin=6666666&site=qq&menu=yes" rel="nofollow noopener noreferrer"><i class="fab fa-qq"></i><span></span></a></li></ul></div><div class="back-top"><i class="fas fa-caret-up"></i></div><div class="dimmer"></div><div class="off-canvas"><div class="canvas-close"><i class="fas fa-times"></i></div><div class="logo-wrapper"> <a class="logo text" href="https://www.maxssl.com/">MaxSSL</a></div><div class="mobile-menu d-block d-lg-none"></div></div> <script></script><noscript><style>.lazyload{display:none}</style></noscript><script data-noptimize="1">window.lazySizesConfig=window.lazySizesConfig||{};window.lazySizesConfig.loadMode=1;</script><script async data-noptimize="1" src='https://www.maxssl.com/wp-content/plugins/autoptimize/classes/external/js/lazysizes.min.js'></script><script src='//cdn.bootcdn.net/ajax/libs/jquery/3.6.0/jquery.min.js' id='jquery-js'></script> <script src='//cdn.bootcdn.net/ajax/libs/highlight.js/11.7.0/highlight.min.js' id='highlight-js'></script> <script src='https://www.maxssl.com/wp-content/themes/ripro-v5/assets/js/vendor.min.js' id='vendor-js'></script> <script id='main-js-extra'>var zb={"home_url":"https:\/\/www.maxssl.com","ajax_url":"https:\/\/www.maxssl.com\/wp-admin\/admin-ajax.php","theme_url":"https:\/\/www.maxssl.com\/wp-content\/themes\/ripro-v5","singular_id":"56340","post_content_nav":"0","site_notify_auto":"0","current_user_id":"0","ajax_nonce":"022d319fcd","gettext":{"__copypwd":"\u5bc6\u7801\u5df2\u590d\u5236\u526a\u8d34\u677f","__copybtn":"\u590d\u5236","__copy_succes":"\u590d\u5236\u6210\u529f","__comment_be":"\u63d0\u4ea4\u4e2d...","__comment_succes":"\u8bc4\u8bba\u6210\u529f","__comment_succes_n":"\u8bc4\u8bba\u6210\u529f\uff0c\u5373\u5c06\u5237\u65b0\u9875\u9762","__buy_be_n":"\u8bf7\u6c42\u652f\u4ed8\u4e2d\u00b7\u00b7\u00b7","__buy_no_n":"\u652f\u4ed8\u5df2\u53d6\u6d88","__is_delete_n":"\u786e\u5b9a\u5220\u9664\u6b64\u8bb0\u5f55\uff1f"}};</script> <script src='https://www.maxssl.com/wp-content/themes/ripro-v5/assets/js/main.min.js' id='main-js'></script> </body></html>