大概的思路是这样:
1 可以通过sphinx建立索引来检索mysql数据库中数据
2 可以通过sphinx对mysql数据库中数据建立指定索引,然后在mysql数据库中利用生成的sphinx索引执行sql查询生成正确结果
$wget http://www.coreseek.com/uploads/sources/mmseg-0.7.3.tar.gz
$tar zxvfmmseg-0.7.3.tar.gz
$cd mmseg-0.7.3/
$./configure --prefix=/usr/local/mmseg(安装到我们指定的文件夹下)
$make
$make install
$cd ../
1 下载并解压mysql源码包:
$wget http://blog.s135.com/soft/linux/nginx_php/mysql/mysql-5.1.26-rc.tar.gz
$tar zxvfmysql-5.1.26-rc.tar.gz
2 下载sphinx相应的两个中文补丁,下载并解压sphinx源码包。
$wgethttp://www.sphinxsearch.com/downloads/sphinx-0.9.8-rc2.tar.gz
$wgethttp://www.coreseek.com/uploads/sources/sphinx-0.98rc2.zhcn-support.patch
$wgethttp://www.coreseek.com/uploads/sources/fix-crash-in-excerpts.patch
$tar zxvfsphinx-0.9.8-rc2.tar.gz
3 给sphinx打上中文补丁
$cdsphinx-0.9.8-rc2/
$patch -p1 <../sphinx-0.98rc2.zhcn-support.patch
$patch -p1 <../fix-crash-in-excerpts.patch
4 将相应的sphinx插件解压到mysql源码包中,注意是将mysqlse中的内容拷贝到mysql源码包中相应位置,而不是拷贝mysqlse文件夹。
$cp -rf mysqlse../mysql-5.1.26-rc/storage/sphinx
$cd ../
5 编译安装mysql源码包
$cd mysql-5.1.26-rc/
$shBUILD/autorun.sh
$./configure--with-plugins=sphinx --prefix=/usr/local/mysql/ --enable-assembler--with-extra-charsets=complex --enable-thread-safe-client --with-big-tables--with-readline --with-ssl --with-embedded-server --enable-local-infile(安装是最好指定自己的安装文件夹,即prefix选项,这样后续操作更容易控制)
$make
$make install
$cd ../
到这里我们已经编译安装好mysql及其sphinx插件了,如果安装成功的话,进入mysql下可以看到相应的存储引擎,具体步骤如下:
$cd /usr/local/mysql
$mysql -uroot-h127.0.0.1 -P3307 -p
Enter password(此处直接回车)
Welcome to theMySQL monitor. Commands end with ; or\g.
Your MySQLconnection id is 32
Server version:5.1.26-rc-log Source distribution
Copyright (c)2000, 2010, Oracle and/or its affiliates. All rights reserved.
This softwarecomes with ABSOLUTELY NO WARRANTY. This is free software,
and you arewelcome to modify and redistribute it under the GPL v2 license
Type 'help;' or'\h' for help. Type '\c' to clear the current input statement.
root@(none)09:44:58>show engines;
+------------+---------+-----------------------------------------------------------+--------------+----+------------+
|Engine | Support | Comment | Transactions | XA | Savepoints |
+------------+---------+-----------------------------------------------------------+--------------+----+------------+
|CSV | YES | CSV storage engine |NO | NO | NO |
|SPHINX | YES | Sphinx storage engine 0.9.8 | NO | NO | NO |
|MEMORY | YES |Hash based, stored in memory, useful for temporary tables | NO | NO | NO |
|MyISAM | DEFAULT | Default engine asof MySQL 3.23 with great performance |NO | NO | NO |
|MRG_MYISAM | YES | Collection of identicalMyISAM tables |NO | NO | NO |
+------------+---------+-----------------------------------------------------------+--------------+----+------------+
5 rows in set(0.00 sec)
看到相应的sphinc存储引擎,说明你已经安装成功了!
6 编译安装sphinx源码包
查看相应的python版本:
$whereis python
查看结果为:python:/usr/bin/python2.4 /usr/bin/python /usr/lib/python2.4 /usr/include/python2.4
按照查看结果指定相应参数:
CPPFLAGS=-I/usr/include/python2.4
LDFLAGS=-lpython2.4
./configure--prefix=/usr/local/sphinx --with-mysql=/usr/local/mysql
(该命令可能会报错,说是找不到mmseg include文件或者lib文件,如果那样就指定相应mmseg路径,上述命令再加上--with-mmseg-includes=/usr/local/mmseg/include--with-mmseg-libs=/usr/local/mmseg/lib --with-mmseg)
$make
$make install
到这里sphinx已经安装成功
1 进入mysql创建test数据库
mysql> create databasetest;
2 配置sphinx配置文件
$ cd /usr/local/sphinx/etc
$ sudo cp sphinx.conf.distsphinx.conf
$ vi sphinx.conf(使用vi更改conf配置文件,其中参数查看自己连接mysql数据库时候的使用参数,可参看~/local/mysql/my.cnf文件,改动后的sphinx.conf如下)
source src1
{
# data source type. mandatory, nodefault value
# known types are 'mysql', 'pgsql','xmlpipe', 'xmlpipe2'
type = mysql
#####################################################################
## SQL settings (for 'mysql' and'pgsql' types)
#####################################################################
# some straightforward parameters forSQL source types
sql_host = 127.0.0.1
sql_user = root
sql_pass =
sql_db = test
sql_port = 3307 # optional, default is 3306
# UNIX socket name
3 生成sphinx自带示例所需的数据表及测试数据
$mysql –uroot –h127.0.0.1 –P3307-p test < /usr/local/sphinx/etc/example.sql
4 sphinx生成索引
$ sudo/usr/local/sphinx/bin/indexer –all(该步骤是可能会出现错误:error while loading shared libraries: libmysqlclient.so.16: cannotopen shared object file: No such file or directory,解决方法是:将/u01/mysql/lib或者是你安装路径下的mysql的lib库加到/etc/ld.so.conf配置文件中去,具体做法就是用vi打开ld.so.conf文件,然后在文件的最后添加一行:include /u01/mysql/lib,然后在运行下 /sbin/ldconfig –v)
5 利用sphinx生成的索引查找关键字
$~/local/sphinx/bin/searchtest
输出结果如下:
Sphinx 0.9.8-rc2 (r1234)
Copyright (c) 2001-2008,Andrew Aksyonoff
using config file'/home/zhangjie_z.pt/local/sphinx/etc/sphinx.conf'...
index 'test1': query 'test': returned 3 matches of 3 total in 0.000 sec
displaying matches:
1. document=1, weight=2,group_id=1, date_added=Wed Jun 29 16:30:25 2011
id=1
group_id=1
group_id2=5
date_added=2011-06-29 16:30:25
title=test one
content=this is my test document numberone. also checking search within phrases.
2. document=2, weight=2,group_id=1, date_added=Wed Jun 29 16:30:25 2011
id=2
group_id=1
group_id2=6
date_added=2011-06-29 16:30:25
title=test two
content=this is my test document numbertwo
3. document=4, weight=1,group_id=2, date_added=Wed Jun 29 16:30:25 2011
id=4
group_id=2
group_id2=8
date_added=2011-06-29 16:30:25
title=doc number four
content=this is to test groups
words:
1. 'test': 3 documents, 5hits
index 'test1stemmed':query 'test ': returned 3 matches of 3 total in 0.000 sec
displaying matches:
1. document=1, weight=2,group_id=1, date_added=Wed Jun 29 16:30:25 2011
id=1
group_id=1
group_id2=5
date_added=2011-06-29 16:30:25
title=test one
content=this is my test document numberone. also checking search within phrases.
2. document=2, weight=2,group_id=1, date_added=Wed Jun 29 16:30:25 2011
id=2
group_id=1
group_id2=6
date_added=2011-06-29 16:30:25
title=test two
content=this is my test document numbertwo
3. document=4, weight=1,group_id=2, date_added=Wed Jun 29 16:30:25 2011
id=4
group_id=2
group_id2=8
date_added=2011-06-29 16:30:25
title=doc number four
content=this is to test groups
words:
1. 'test': 3 documents, 5 hits
表明你已经成功安装了sphinx,通过了测试!
1 下载sql脚本,导入相关数据:
$wgethttp://blog.51yip.com/wp-content/uploads/2010/02/test.sql_.zip
进入mysql导入相关数据:
$mysql>source/你的解压路径/test.sql;
2 修改sphinf.conf配置,修改如下:
$vi /usr/local/sphinx/sphinx.conf 修改
sql_host = 127.0.0.1
sql_user = root
sql_pass =
sql_db = test
sql_port = 3307 # optional, default is 3306(参考自己mysql端口设置)sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION query_cache_type=OFF
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROMdocuments
#这个是为增量索引用的,用于记录最后一条插入时的ID
sql_query = \# 注意一下\这个符号不要搞丢了,不然就报错了
SELECT id, group_id,UNIX_TIMESTAMP(date_added) AS date_added, title, content \
FROM documents \
WHERE id<=(select max_doc_id from sph_counter where counter_id=1)
#括号里面的东西是为增量索引服务的。(sphinx中很多参数被‘#’注释了,上面的每行不用自己添加,找到相关行把注释去掉即可)
3 启动sphinx
$/usr/local/sphinx/bin/indexer --config/usr/local/sphinx/etc/sphinx.conf --all
$/usr/local/sphinx/bin/searchd--config /usr/local/sphinx/etc/sphinx.conf(这里可能会出现错误:FATAL: failedto lock pid file '/home/zhangjie_z.pt/local/sphinx/var/log/searchd.pid':Resource temporarily unavailable (searchd already running?)意思就是你已经打开了一个search进程,你需要找到相关进程号ps –e|grepsearchd,然后kill PID杀死之前的进程,在启动新进程)
$ps -e|grep searchd
21040 pts/3 00:00:00 searchd
这表明已启动
4进入mysql利用sphinx索引查询数据:
mysql> SELECT doc. *
-> FROM documents doc
-> JOIN sphinx ON ( doc.id = sphinx.id )
-> WHERE query = 'test;mode= any ';
输出结果,查询成功:
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
| id | group_id | group_id2 |date_added | title | content |
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
| 1| 1 | 5 | 2011-06-29 16:30:25 | testone | this is my test documentnumber one. also checking search within phrases. |
| 2| 1 | 6 | 2011-06-29 16:30:25 | testtwo | this is my test documentnumber two |
| 4| 2 | 8 | 2011-06-29 16:30:25 | doc numberfour | this is to test groups |
+----+----------+-----------+---------------------+-----------------+---------------------------------------------------------------------------+
3 rows in set (0.00 sec)
表明可以检索;
1在之前编译安装的mmseg的基础上,生成和使用分词字典:
$~/local/mmseg/bin/mmseg -u ~/tmp/mmseg-0.7.3/data/unigram.txt(两个路径分别为mmseg的安装路径和源代码的路径)
执行完后会在data目录下生成unigram.txt.uni文件,该文件不小,为mmseg的分词字典,我们需要将其拷贝到sphinx的安装路径下,然后更改sphinx.conf配置文件,指定字典位置。
$cp unigram.txt.uni ~/local/sphinx/dict/uni.lib(将其拷贝后更名为uni.lib)
2修改sphinx.conf文件,重新生成indexer,进行检索
sphinx是以sphinx.conf为配置文件,索引与搜索均以这个文件为依据进行,要进行全文检索,首先就要配置好sphinx.conf,告诉sphinx哪些字段需要进行索引,哪些字段需要在where,orderby,groupby中用到。
修改sphinx.conf文件:
在索引配置项中加入:
charset_type= utf-8
charset_dictpath = ~/local/sphinx/dict(将字典位置加入到配置文件中)
加入中文编码:
charset_table= U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z,U+FF21..U+FF3A->a..z,A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF,U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,U+00E0..U+00F6,U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101,U+0101,U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107,U+0107, U+0108->U+0109,U+0109, U+010A->U+010B, U+010B, U+010C->U+010D,U+010D, U+010E->U+010F, U+010F,U+0110->U+0111, U+0111, U+0112->U+0113,U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,U+0117, U+0118->U+0119,U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,U+011E->U+011F,U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133,U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A,U+013B->U+013C, U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140,U+0141->U+0142, U+0142, U+0143->U+0144,U+0144, U+0145->U+0146, U+0146,U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,U+014C->U+014D, U+014D,U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,U+0153,U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F,U+015F, U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165,U+0165, U+0166->U+0167, U+0167,U+0168->U+0169, U+0169, U+016A->U+016B,U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,U+016F, U+0170->U+0171,U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,U+0176->U+0177,U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A,U+017B->U+017C,U+017C, U+017D->U+017E, U+017E,U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,U+05D0..U+05EA,U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A,U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3,U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F,U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF,U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3,U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71,U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9,U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F,U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822,U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5,U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9,U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9,U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5,U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5,U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3,U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9,U+03B1..U+03C1, U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45,U+0E47, U+0E50..U+0E59, U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF,U+20000..U+2A6DF, U+F900..U+FAFF, U+2F800..U+2FA1F,U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F,U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F,U+A000..U+A48F,U+A490..U+A4CF
ngram_len=1
ngram_chars=U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,U+2F800..U+2FA1F,U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,U+3040..U+309F,U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,U+3130..U+318F,U+A000..U+A48F, U+A490..U+A4CF
(注意上面的ngram_chars和charset_table处均是一行,使用vi编辑器的时候:set nu 这两块内容必须分别之占一行,需要换行的话加 \ )
3 测试插入中文文本,进行分词检索:
进入mysql数据库在documents表下插入中文文本数据,为了防止编码冲突问题,记得使用set namesutf8;
root@test02:56:19>set names utf8;
Query OK, 0rows affected (0.00 sec)
root@test03:06:09>select * from documents;
+----+----------+-----------+---------------------+--------------------------------+--------------------------------------------------
| id |group_id | group_id2 | date_added | title |content
+----+----------+-----------+---------------------+--------------------------------+--------------------------------------------------
| 1 | 1 | 5 | 2011-06-2916:30:25 | test one | this is my test document number one. also checking search withinphrase|
| 2 | 1 | 6 | 2011-06-2916:30:25 | test two | this is my test document number two |
| 3 | 2 | 7 | 2011-06-29 16:30:25 | another doc | this is anothergroup |
| 4 | 2 | 8 | 2011-06-2916:30:25 | doc number four | this is to test groups |
| 9 | 9 | 9 | 2011-07-0514:10:04 | 张杰很无语testtest | adfasfasdfafasdfdsaf
| 10 | 6 | 6 | 2011-07-05 14:45:28 | 张杰醒醒 fightting , test |test groups haha
+----+----------+-----------+---------------------+--------------------------------+------------------------------------------------------------------------+
6 rows inset (0.00 sec)
4 直接利用sphinx进行中文分词检索:
$~/local/sphinx/bin/search张
输出结果:
Sphinx0.9.8-rc2 (r1234)
Copyright(c) 2001-2008, Andrew Aksyonoff
usingconfig file '/home/zhangjie_z.pt/local/sphinx/etc/sphinx.conf'...
index'test1': query '张 ': returned 2 matchesof 2 total in 0.000 sec
displayingmatches:
1.document=9, weight=1, group_id=9, date_added=Tue Jul 5 14:10:04 2011
id=9
group_id=9
group_id2=9
date_added=2011-07-05 14:10:04
title=?????testtest
content=adfasfasdfafasdfdsaf
2.document=10, weight=1, group_id=6, date_added=Tue Jul 5 14:45:28 2011
id=10
group_id=6
group_id2=6
date_added=2011-07-05 14:45:28
title=?? ?? fightting , test
content=test groups haha
words:
1. '张': 2 documents, 2 hits
index'test1stemmed': query '张 ': returned 2 matchesof 2 total in 0.000 sec
displayingmatches:
1.document=9, weight=1, group_id=9, date_added=Tue Jul 5 14:10:04 2011
id=9
group_id=9
group_id2=9
date_added=2011-07-05 14:10:04
title=?????testtest
content=adfasfasdfafasdfdsaf
2.document=10, weight=1, group_id=6, date_added=Tue Jul 5 14:45:28 2011
id=10
group_id=6
group_id2=6
date_added=2011-07-05 14:45:28
title=?? ??fightting , test
content=test groups haha
words:
1. '张': 2 documents, 2 hits
(虽然可以中文分词检索,可以匹配,可是打印的时候中文还是会有问题,见输出中红子的问号?)
5进入mysql利用生成的index进行select查询:
进入mysql利用生成的index进行检索:
root@test03:05:20>select doc.* from documents doc join sphinx on(doc.id=sphinx.id)where query='张;mode=any';
+----+----------+-----------+---------------------+--------------------------------+----------------------+
| id |group_id | group_id2 | date_added | title |content |
+----+----------+-----------+---------------------+--------------------------------+----------------------+
| 9 | 9 | 9 | 2011-07-0514:10:04 | 张杰很无语testtest | adfasfasdfafasdfdsaf |
| 10 | 6 | 6 | 2011-07-05 14:45:28 | 张杰醒醒 fightting , test |test groups haha |
+----+----------+-----------+---------------------+--------------------------------+----------------------+
2 rows inset (0.00 sec)
至此完成支持中文的mysql利用sphinx索引的select查询
Ps:解决利用SecurityCRT远程登录SQA服务器无法输入中文的问题,将securityCRT的字符集设为utf-8便可以了,对于mysql中无法输入中文的解决方案是利用命令set namesutf8,不过每次都是暂时的,重启mysql后还需执行该命令,
1 增量索引和全量索引的配置备份了~/bakup中了
2 将增量索引和全量索引的重建写进脚本里,让增量索引每隔3分钟重建一次,全量索引每天凌晨重建,这样我们在mysql中插入相关数据,隔3分钟就可以用自动生成的增量索引进行检索,查看新更新的数据。
相关文章见http://www.wodianer.net/article_8.html
3使用sysbench向mysql中插入大数据量,测试sphinx稳定性,稳定性也不错;
插入大数据的命令:
~/local/sysbench/bin/sysbench--test=oltp --num-threads=32 --max-time=60 --mysql-host=127.0.0.1--mysql-db=test --mysql-port=3309 --mysql-user=root --oltp-test-mode=nontrx--oltp-table-name=document1 --oltp-table-size=40000 --oltp-nontrx-mode=selectprepare
记录下相应的测试数据:
$./searchqqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt -c ../etc/sphinx.conf
index'test1': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned1000 matches of 10610000 total in 0.516 sec
index'test1stemmed': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned1000 matches of 23610000 total in 1.188 sec
**********************************************************************************************************************
index'test1': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned1000 matches of 23650000 total in 1.143 sec
index'test1stemmed': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ':returned 1000 matches of 23610000 total in 1.148 sec
**********************************************************************************************************************
index'test1': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned1000 matches of 23650000 total in 1.820 sec
index 'test1stemmed': query'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned 1000 matches of24950000 total in 1.935 sec
index'test1': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ': returned1000 matches of 24990000 total in 1.203 sec
index'test1stemmed': query 'qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt ':returned 1000 matches of 24950000 total in 1.199 sec
对使用原理比较模糊,存在一些照猫画虎的成分,配置过程中可能有的参数不知道什么意思,或者不知道哪一步是干什么的,对具体的使用原理不是特别清楚,需要时间熟悉。