2019独角兽企业重金招聘Python工程师标准>>>
技术简介:
Sphinx是一个基于SQL的全文检索引擎,可以结合MySQL,PostgreSQL做全文搜索,它可以提供比数据库本身更专业的搜索功能,使得应用程序更容易实现专业化的全文检索。Sphinx特别为一些脚本语言设计搜索API接口,如PHP,Python,Perl,Ruby等,同时为MySQL也设计了一个存储引擎插件。
Sphinx 单一索引最大可包含1亿条记录,在1千万条记录情况下的查询速度为0.x秒(毫秒级)。Sphinx创建索引的速度为:创建100万条记录的索引只需 3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建 如果用到sphinx,全文索引交给sphinx来做,sphinx返回含有该word的ID号,然后用该ID号直接去数据库准确定位那些数据,整个过程如下图:一次只需几十秒。
因为sphinx默认不支持中文索引及检索,而coreseek基于sphinx开发了coreseek全文检索服务器,它提供了为sphinx设计的中文分词包libmmseg包含mmseg中文分词,是目前用的最多的sphinx中文检索。在没有sphinx之前,mysql数据库要对海量的文章中的词进行全文索引,一般用的语句例如:SELECT *** WHERE *** LIKE '%word%';这样的LIKE查询,并且再结合通配符%,是使用不到mysql本身的索引,需要全表扫描,时间超慢的!
如果用到sphinx,全文索引交给sphinx来做,sphinx返回含有该word的ID号,然后用该ID号直接去数据库准确定位那些数据,整个过程如下图:
[第一步] 先安装mmseg3
1. cd /data/program
2. tar zxvf coreseek-4.1-beta.tar.gz
3. cd coreseek-4.1-beta
4. cd mmseg-3.2.14
5. ./bootstrap
6. ./configure --prefix=/usr/local/mmseg3
7. make && make install
8.
9. 遇到的问题:
10. error: cannot find input file: src/Makefile.in
11. 或者遇到其他类似error错误时...
12.
13. 解决方案:
14. 依次执行下面的命令
15. yum -y install libtool
16.
17. aclocal
18. libtoolize --force
19. automake --add-missing
20. autoconf
21. autoheader
22. make clean
安装好'libtool'继续从'aclocal'开始执行上面提到的一串命令,执行完后再运行最开始的安装流程即可。
[第二步] 安装coreseek
1. ##安装coreseek
2. $ cd csft-3.2.14 或者 cd csft-4.0.1 或者 cd csft-4.1
3. $ sh buildconf.sh #输出的warning信息可以忽略,如果出现error则需要解决
4. 如无法编译
5. 1. 在 csft-4.1/buildconf.sh 文件中,查找
6. && aclocal \ 后加上
7. && automake --add-missing \
8. 2. 在 csft-4.1/configure.ac 文件中,
9. 查找:
10. AM_INIT_AUTOMAKE([-Wall -Werror foreign])改为:
11. AM_INIT_AUTOMAKE([-Wall foreign])
12. 查找:
13. AC_PROG_RANLIB 后面加上:
14. AM_PROG_AR
15. 3. 最后,在 csft-4.1/src/sphinxexpr.cpp 文件中, 替换所有:
16. T val = ExprEval ( this->m_pArg, tMatch );成为:
17. T val = this->ExprEval ( this->m_pArg, tMatch );
无错误后继续执行下面命令:
18. $ ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql
19. ##如果提示mysql问题,可以查看MySQL数据源安装说明 http://www.coreseek.cn/product_install/install_on_bsd_linux/#mysql
20. $ make && make install
21. $ cd ..
22.
23.
24. ##命令行测试mmseg分词,coreseek搜索(需要预先设置好字符集为zh_CN.UTF-8,确保正确显示中文)
25. $ cd testpack
26. $ cat var/test/test.xml #此时应该正确显示中文
27. $ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml
28. $ /usr/local/coreseek/bin/indexer -c etc/csft.conf --all
29. $ /usr/local/coreseek/bin/search -c etc/csft.conf 网络搜索
如出现这个 xmlpipe2 support NOT compiled in. To use xmlpipe2, install missing XML libra 错误,执行以下命令:
1. yum -y install expat expat-devel
依次安装后,从新编译coreseek,然后再生成索引,就可以通过了。
测试结果如下:
1. Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]
2. Copyright (c) 2007-2011,
3. Beijing Choice Software Technologies Inc (http://www.coreseek.com)
4.
5. using config file 'etc/csft.conf'...
6. index 'xml': query '网络搜索 ': returned 1 matches of 1 total in 0.000 sec
7.
8. displaying matches:
9. 1. document=1, weight=1590, published=Thu Apr 1 07:20:07 2010, author_id=1
10.
11. words:
12. 1. '网络': 1 documents, 1 hits
- 2. '搜索': 2 documents, 5 hits
【第三步】配置sphinx与mysql
创建配置sphinx与mysql的配置文件
1. # vi /usr/local/coreseek/etc/csft_mysql.conf
1. #MySQL数据源配置,详情请查看:http://www.coreseek.cn/docs/coreseek_4.1-sphinx_2.0.1-beta.html#conf-reference
2.
2. #源定义
3. source threads
4. {
5. type = mysql
6.
7. sql_host = localhost
8. sql_user = root
9. sql_pass = root
10. sql_db = ultrax
11. sql_port = 3306
12. sql_query_pre = SET NAMES utf8 #(这里如果数据库是utf8的就修改为utf8,下面所有SET NAME设置同这里)
13. # sql_query_pre = SET SESSION query_cache_type=OFF
14. sql_query_pre = CREATE TABLE IF NOT EXISTS pre_common_sphinxcounter ( indexid INTEGER PRIMARY KEY NOT NULL,maxid INTEGER NOT NULL)
15. sql_query_pre = REPLACE INTO pre_common_sphinxcounter SELECT 1, MAX(tid)-10 FROM pre_forum_thread
16.
17. sql_query = SELECT t.tid AS id,t.tid,t.subject,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \
18. FROM pre_forum_thread AS t \
19. WHERE t.tid>=$start AND t.tid<=$end
20.
21. sql_query_range = SELECT (SELECT MIN(tid) FROM pre_forum_thread),maxid FROM pre_common_sphinxcounter WHERE indexid=1
22. sql_range_step = 4096
23.
24. sql_attr_uint = tid
25. sql_attr_uint = digest
26. sql_attr_uint = displayorder
27. sql_attr_uint = authorid
28. sql_attr_uint = special
29.
30. sql_attr_timestamp =lastpost
31.
32. sql_query_info = SELECT * FROM pre_forum_thread WHERE tid=$id
33. }
34.
35. #threads
36. index threads
37. {
38. source = threads
39. path = /var/data/threads #windows下最好用全路径
40. docinfo = extern
41. mlock = 0
42. morphology = none
43. min_word_len = 1
44. html_strip = 0
45. charset_dictpath = /usr/local/mmseg3/etc/ #BSD、Linux环境下设置,/符号结尾
46. charset_type = zh_cn.utf-8
47. #charset_debug = 0
48. ngram_len = 0
49. }
50.
51. #threads_minute
52. source threads_minute : threads
53. {
54. sql_query_pre =
55. sql_query_pre = SET NAMES utf8
56. # sql_query_pre = SET SESSION query_cache_type=OFF
57. sql_query_range = SELECT maxid+1,(SELECT MAX(tid) FROM pre_forum_thread) FROM pre_common_sphinxcounter WHERE indexid=1
58. }
59.
60. #threads_minute
61. index threads_minute : threads
62. {
63. source = threads_minute
64. path = /var/data/threads_minute #windows下最好用全路径
65. }
66.
67. ##################################################
68.
69. #posts
70. source posts
71. {
72. type = mysql
73.
74. sql_host = localhost
75. sql_user = root
76. sql_pass = root
77. sql_db = ultrax
78. sql_port = 3306
79. sql_query_pre =
80. sql_query_pre = SET NAMES utf8
81. # sql_query_pre = SET SESSION query_cache_type=OFF
82. sql_query_pre = REPLACE INTO pre_common_sphinxcounter SELECT 2, MAX(pid)-2 FROM pre_forum_post
83. sql_query = SELECT p.pid AS id,p.tid,p.subject,p.message,t.digest,t.displayorder,t.authorid,t.lastpost,t.special \
84. FROM pre_forum_post AS p LEFT JOIN pre_forum_thread AS t USING(tid) \
85. WHERE p.pid>=$start AND p.pid<=$end
86.
87. sql_query_range = SELECT (SELECT MIN(pid) FROM pre_forum_post),maxid FROM pre_common_sphinxcounter WHERE indexid=2
88. sql_range_step = 4096
89.
90. sql_attr_uint = tid
91. sql_attr_uint = digest
92. sql_attr_uint = displayorder
93. sql_attr_uint = authorid
94. sql_attr_uint = special
95.
96. sql_attr_timestamp =lastpost
97.
98. sql_query_info = SELECT * FROM pre_forum_post WHERE pid=$id
99. }
100.
101. #posts
102. index posts
103. {
104. source = posts
105. path = /var/data/posts #windows下最好用全路径
106. docinfo = extern
107. mlock = 0
108. morphology = none
109. min_word_len = 1
110. html_strip = 0
111. charset_dictpath = /usr/local/mmseg3/etc/ #BSD、Linux环境下设置,/符号结尾
112. charset_type = zh_cn.utf-8
113. #charset_debug = 0
114. ngram_len = 0
115. }
116.
117. #posts_minute
118. source posts_minute : posts
119. {
120. sql_query_pre =
121. sql_query_pre = SET NAMES utf8
122. # sql_query_pre = SET SESSION query_cache_type=OFF
123. sql_query_range = SELECT maxid+1,(SELECT MAX(pid) FROM pre_forum_post) FROM pre_common_sphinxcounter WHERE indexid=2
124. }
125.
126. #posts_minute
127. index posts_minute : posts
128. {
129. source = posts_minute
130. path = /var/data/posts_minute #windows下最好用全路径
131. }
132.
133. #全局indexer定义
134. indexer
135. {
136. mem_limit = 256M
137. }
138.
139. #searchd服务定义
140. searchd
141. {
142. listen = 3312
143. read_timeout = 5
144. max_children = 30
145. max_matches = 10000
146. seamless_rotate = 0
147. preopen_indexes = 0
148. unlink_old = 1
149. pid_file = /usr/local/coreseek/var/log/searchd_discuzx.pid #windows下最好用全路径
150. log = /usr/local/coreseek/var/log/searchd_discuzx.log #windows下最好用全路径
151. query_log = /usr/local/coreseek/var/log/query_discuzx.log #windows下最好用全路径
152. }
【第四步】启动服务,创建索引
启动后台服务:(必须开启)
1. # /usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/csft_mysql.conf
执行索引:(查询、测试前必须执行一次)
1. /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf --all --rotate
执行增量索引: (delta替换为具体索引名)
1. /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf delta --rotate
合并索引:(main、delta替换为具体索引名,一次只能两个索引进行合并)
1. /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf --merge main delta --rotate --merge-dst-range deleted 0 0
(为了防止多个关键字指向同一个文档加上--merge-dst-range deleted 0 0)
后台服务测试:
1. # /usr/local/coreseek/bin/search -c /usr/local/coreseek/etc/csft_mysql.conf aaa
关闭后台服务:
1. # /usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/csft_mysql.conf --stop
自动化命令:(每隔一分钟执行一遍增量索引,每五分钟执行一遍合并索引,每天1:30执行整体索引。)
1. crontab -e
2. */1 * * * * /bin/sh /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf delta --rotate
3. */5 * * * * /bin/sh /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf --merge main delta --rotate --merge-dst-range deleted 0 0
4. 30 1 * * * /bin/sh /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/csft_mysql.conf --all --rotate
至此所有配置工作完成,只要在discuz后台配置开启sphinx即可。