Getting Started With MySQL's Full-Text Search Capabilities
(Page 1 of 6 )
Need a rock-solid, powerful search solution for your PHP/MySQL driven web site? In this article Mitchell introduces us to MySQL's full-text and Boolean search capabilities.In February 2002 I wrote an article called " Developing A Site Search Engine With PHP And MySQL ". In this article I showed you how to build a " search engine
" using a keyword based approach. Notice how the words search and engine are in quotes in my last sentence? That's because it wasn't actually a search engine in the right meaning of the word. It was based on the concept of keyword searching.
Today I'm going to show you how to use MySQL's extremely powerful full-text search capabilities, which you can use to build your own simple search engine. In this article I will assume that you have an intermediate knowledge of MySQL.
I will also assume that you have at least MySQL 3.23.23 installed (which is when support for full-text searching was added.). You will need MySQL version 4.0 or above to test the Boolean search methods described later in this article.
Getting Started With MySQL's Full-Text Search Capabilities - What is Full-Text Searching?
(Page 2 of 6 )
Imagine having a database that contained 10,000 tables. In each of these tables there are 1,000 rows with 100 fields. How would you effectively search this sort of information structure without killing your web server? The answer is MySQL's full-text search capabilities.
A full-text search makes use of indexes, which you can define against a table either when it is created or by using MySQL's ALTER TABLE command. These indexes are setup on specific fields of a table and change the way that MySQL stores the records for that particular table.
For example, let's say that you have a table to store the title, price, description, availability and picture of some computer products. When users perform a search, they will most commonly enter part of the products title or description.
If we setup an index on the title and description fields in our MySQL database, then we can make MySQL store the records for the products table in an indexed format, meaning that when a user performs a search, MySQL can retrieve related records quicker, because it has already indexed (or ordered) them.
Full-text searches are faster than other search methods such as wildcard or character based searches, which are commonly performed using MySQL's LIKE command.
Besides being significantly quicker than normal character based searches, why would you want to use full-text searching? Here's a quick list of points to wet your appetite:
- Full-text searching is ideal for extremely large databases that contain thousands or even millions of rows. Computations are performed faster and rows can be ranked based on search relevance, which is returned as a decimal number by MySQL.
- Noise words and any words that are 3 characters or less in length such as the, and, etc are removed from the search query. This means that more accurate results are returned. If you searched for "as the people", then the noise words "the" and "as" will automatically be removed from your query.
- In addition to simple searches, full-text searches can also be performed in Boolean mode. Boolean mode allows searches based on and/or criteria, such as "+person +Mitchell", which would only return all records that contained the words person AND Mitchell. We will look at Boolean searches later in this article.
- The query is case-insensitive, meaning that "cat" is ranked the same as "Cat", "CAT" and "cAT".
As you can see, full-text searching is fast, powerful and smart. It eliminates the need for us to write complicated search and Boolean algorithms, and you can be up and running with MySQL's full-text searching and indexing capabilities in under 5 minutes.
Your First Full-Text Search
Let's jump right in and see full-text searching in action. First off we need to create our database, so fire up the MySQL console window and create a new database called testDB, like this:
create database testDB:
use testDB;
Next, we need to create a table using MySQL's FULLTEXT command, specifying which fields we want to index for searching:
create table testTable
(
pk_tId int auto_increment not null,
firstName varchar(20),
lastName varchar(20),
age int,
details text,
primary key(pk_tId),
unique id(pk_tId),
fulltext(firstName, lastName, details)
);
What we've done here is create a basic table that will store the details of some people. The first field is a uniquely identified primary key and the rest are simple varchar, int and text fields.
If you already have a table setup and want to index existing fields, use the ALTER TABLE command like this: ALTER TABLE myTable ADD FULLTEXT(field1, field2);
Take a look at the highlighted line in our CREATE TABLE command, shown above:
fulltext(firstName, lastName, details)
This line tells MySQL to set an index on the firstName, lastName and details fields of our table. Indexes can only be created on fields of type VARCHAR and TEXT. Because these fields contain indexes, we can now use the power of a MySQL full-text search to find records in this table based on the values in these 3 fields.
Before we can perform a full-text search however, we need to add some records to our table with the following MySQL commands:
insert into testTable values(0, 'Mitchell', 'Harper', 20, 'Mitchell is the founder and manager of devArticles and various other sites across the SiteCubed network');
insert into testTable values(0, 'Ben', 'Rowe', 19, 'Ben is our Flash guru. He is currently writing a series of Flash articles that show beginners how to use Flash to create dynamic content with PHP');
insert into testTable values(0, 'Havard', 'Lindset', 19, 'Havard is a PHP and MySQL power user');
insert into testTable values(0, 'Michael', 'Manatissian', 25, 'Michael is the senior devArticles editor and is a capable Linux/Apache administrator');
insert into testTable values(0, 'Sandra', 'Lee', 23, 'Sandra is our finance and accounting wizard, having 4 years of experience under her belt');
Because full-text searching was designed for larger databases, it is possible for MySQL to return incorrect results when it's used on tables containing smaller amount records, that's why I've created 5 "chunky" records for our test table.
Now that we have records in our table, how do we search them? Simple -– we create a query to invoke MySQL's full-text search feature, like this:
select firstName from testTable where match(firstName, lastName, details) against('guru');
Getting Started With MySQL's Full-Text Search Capabilities - What is Full-Text Searching (contd.)
(Page 3 of 6 )
This query returns the following result:
Let's analyze our query. First up, we have the SELECT and FROM parts of our query:
select firstName from testTable
There's nothing unusual about this part of the query: it simply tells MySQL to retrieve the firstName field from the table called testTable. Next up, we have the WHERE clause:
where match(firstName, lastName, details) against('guru');
This is where the power of full-text searching starts. In the first part of the query we call the MATCH command, which tells MySQL to match against the values of the firstName, lastName and details fields when it performs a full-text, natural language search.
When the MATCH command is used as part of the SELECT clause it returns a relevance ranking, which is a positive decimal number. The closer to 0 this number is, the less relevant the record is. This relevance value is determined based on the search expression, the number of words in the indexed fields, as well as the total number of records being searched.
Lastly, we have the AGAINST command. AGAINST is simple enough and accepts just one parameter, which is the string that we're searching for. Later in this article we will see how to perform boolean searches using the AGAINST command in combination with the IN BOOLEAN MODE keywords.
So far so good, right? But how does a full-text search differ from using the LIKE command in a query like this:
select firstName from testTable where details like '%guru%';
... they both return the same result don’t they? Well yes and no. Let's now see how we can determine a relevance ranking for each of the records returned from a MySQL full-text search.
Determining A Relevance Ranking
Still working with our test table, take the following query into consideration:
select concat(firstName, ' ', lastName) as name, match(firstName, lastName, details) against('devArticles') as relevance from testTable where match(firstName, lastName, details) against('devArticles');
This query produces the following results:
In this query I've included the MATCH command in the SELECT clause to return a relevance ranking for each record. The query performs a full-text search against the firstName, lastName and details fields for the string "DevArticles":
where match(firstName, lastName, details) against('devArticles');
The query has returned two records, which contained the string "devArticles" in either their firstName, lastName or details fields. Looking back at our records, we can see that the records for Mitchell and Michael contained the string "DevArticles" in their details fields:
'Mitchell is the founder and manager of
devArticles and various other sites across the SiteCubed network'
'Michael is the senior
devArticles editor and is a capable Linux/Apache administrator'
The relevance field returned from our query was generated with the following expression:
match(firstName, lastName, details) against('devArticles') as relevance
Looking carefully at the query, we can see that this expression has been used twice: once in the SELECT clause and once in the WHERE clause. MySQL picks up on this and only performs one full-text search on the table and not two, meaning that there is no additional overhead produced for a query like this –- a huge bonus and time saver if a similar query was performed on a table with 5 million rows.
When the MATCH command is used in the WHERE clause, MySQL automatically sorts the rows from highest to lowest relevance. In our previous example query we only returned records that actually had a match against the string "devArticles" when a full-text search was conducted. Here's a query that returns the relevance ranking for every record in our table:
select concat(firstName, ' ', lastName) as name, match(firstName, lastName, details) against('devArticles') as relevance from testTable;
We've left out the WHERE clause, so the resultant records are unordered, as we can see below:
Getting Started With MySQL's Full-Text Search Capabilities - Full-Text Rules And The MATCH Command
(Page 4 of 6 )
Remember earlier when I listed a number of bullet points, one of which stated that MySQL removes noise words and those of less than 3 characters. Let's test that theory will 2 basic full-text search queries:
select firstName, match(firstName, lastName, details) against('devArticles is on the www') as relevance from testTable;
This query returns the following records:
Notice how our last query only had one word that was longer than 3 characters in lengh, "devArticles". If we remove all words of 3 characters or less from the search string then the relevance ranking will remain the same:
select firstName, match(firstName, lastName, details) against('devArticles') as relevance from testTable;
Here is the list of records that matches the search:
As we can clearly see, the relevance ranking remains the same -– MySQL does indeed remove noise words and those words with 3 characters or less.
MySQL's full-text search ranks words based on their semantic values -- common words rank lower than uncommon words. This makes sense, as a word that exists in many records will have a lesser relevance to a word that only appears in 1 or 2 records. Semantic word rankings are used in most popular full-text searching algorithms. Popular search engines and directories also employ this method.
The 50% Threshold
MySQL removes noise words and short words, but if a word is present in more than 50% of the records being searched, then those records will not be returned. MySQL calls this the "50% threshold". In a way this makes sense, as it filters out records that have a low relevance.
Here's one of the comments from a MySQL user on their site:
"
... you should add at least 3 rows to the table before you try to match anything, and what you're searching for should only be contained in one of the three rows. This is because of the 50% threshold. If you insert only one row, then now matter what you search for, it is in 50% or more of the rows in the table, and therefore disregarded. "
Getting Started With MySQL's Full-Text Search Capabilities - Performing A Boolean Search
(Page 5 of 6 )
MySQL version 4.0.1 and above can perform complex full-text Boolean searches. I am using MySQL version 4.0.1, which is still in the Alpha stage. If you want to perform Boolean searches then click here to download MySQL version 4.0.
So what exactly is a Boolean search? Put simply, it is a powerful way to include or extract words and phrases from your search criteria. If you've ever searched for something like "+download +games", then you've used a Boolean search engine.
By combining various operators within your search string, you can filter in/out other words, change a words contribution to the relevance value and more. Here's the list of Boolean operators, as listed at MySQL.com:
- + A leading plus sign indicates that this word must be present in every row returned.
- - A leading minus sign indicates that this word must not be present in any row returned.
- By default (when neither plus nor minus is specified) the word is optional, but the rows that contain it will be rated higher. This mimics the behaviour of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier.
- < > These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The < operator decreases the contribution and the > operator increases it.
- ( ) Parentheses are used to group words into sub expressions.
- ~ A leading tilde acts as a negation operator, causing the word's contribution to the row relevance to be negative. It's useful for marking noise words. A row that contains such a word will be rated lower than others, but will not be excluded altogether, as it would be with the - operator.
- * An asterisk is the truncation operator. Unlike the other operators, it should be appended to the word, not prepended.
- " The phrase, that is enclosed in double quotes ", matches only rows that contain this phrase literally, as it was typed.
A Boolean search is performed in much the same way as a normal full-text search, however it includes the IN BOOLEAN MODE keywords, as shown in the example below:
select * from testTable where match(firstName, lastName, details) against ('+DevArticles -flash' in boolean mode);
In the example query above I'm specifying that every row returned must contain the word "DevArticles" but must NOT contain the word "flash". Still working with our test table, which record do you think will have the highest relevance ranking if we executed this query?:
select firstName, match(firstName, lastName, details) against('+mysql -devArticles >flash' in boolean mode) as relevance from testTable where match(firstName, lastName, details) against('+mysql -devArticles >flash' in boolean mode);
Looking at the bulleted list above, the results of this query will be ranked higher if:
- It contains the word "mysql"
- It doesn’t contain the word "devArticles"
- It contains one or more instances of the word "flash"
Here is the result from the query:
How about this query:
select firstName, match(firstName, lastName, details) against('+devArticles -sitecubed' in boolean mode) as relevance from testTable where match(firstName, lastName, details) against('+devArticles -sitecubed' in boolean mode);
In this query, records that contain the word DevArticles and NOT the word siteCubed will rank. The results are shown below:
All of the other records returned a 0 relevance ranking, and are therefore not shown in the results. The best way to become familiar with Boolean full-text searches is to experiment with the various operators, including where they are positioned in the search string (including the use of brackets) and how often they are used.
Getting Started With MySQL's Full-Text Search Capabilities - Conclusion
(Page 6 of 6 )
In this article we've seen how to setup and use MySQL
full-text and Boolean searches. Just to reiterate, here are the steps you need to take to get full-text working on your database:
1. Create/alter a table, making sure a call to the FULLTEXT command is issued when creating/altering the table.
2. To perform a full-text normal search, use the MATCH and AGAINST commands in your query for fields that have been marked as full text, like this:
SELECT *, MATCH(field1, field3) AGAINST('my_query') as relevance FROM myTable WHERE MATCH(field1, field3) AGAINST('my_query')
4. To perform a full-text Boolean search, use the MATCH and AGAINST commands in combination with the IN BOOLEAN MODE keywords in your query for fields that have been marked as full text, like this:
SELECT *, MATCH(field1, field3) AGAINST('my_query' IN BOOLEAN MODE) as relevance FROM myTable WHERE MATCH(field1, field3) AGAINST('my_query IN BOOLEAN MODE')
Using the examples and information provided in this article you should now be able to create a basic full-text search against your database. Simply use a scripting language such as PHP to gather a search query from your users. Next, pass this string to your MySQL query as the parameter to the AGAINST command. Lastly, iterate through the results, displaying the relevance ranking if desired.
原文地址:http://www.devarticles.com/c/a/MySQL/Getting-Started-With-MySQLs-Full-Text-Search-Capabilities/