Since Version 3.23.23, MySQL has support for full-text indexing
and searching. Full-text indexes in MySQL are an index of type
FULLTEXT
. FULLTEXT
indexes can be created from VARCHAR
and TEXT
columns at CREATE TABLE
time or added later with
ALTER TABLE
or CREATE INDEX
. For large datasets, adding
FULLTEXT
index with ALTER TABLE
(or CREATE INDEX
) would
be much faster than inserting rows into the empty table with a FULLTEXT
index.
Full-text search is performed with the MATCH
function.
mysql> CREATE TABLE articles ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO articles VALUES -> (0,'MySQL Tutorial', 'DBMS stands for DataBase Management ...'), -> (0,'How To Use MySQL Efficiently', 'After you went through a ...'), -> (0,'Optimizing MySQL','In this tutorial we will show how to ...'), -> (0,'1001 MySQL Trick','1. Never run mysqld as root. 2. Normalize ...'), -> (0,'MySQL vs. YourSQL', 'In the following database comparison we ...'), -> (0,'MySQL Security', 'When configured properly, MySQL could be ...'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database'); +----+-------------------+---------------------------------------------+ | id | title | body | +----+-------------------+---------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison we ... | | 1 | MySQL Tutorial | DBMS stands for DataBase Management ... | +----+-------------------+---------------------------------------------+ 2 rows in set (0.00 sec)
The function MATCH
matches a natural language query AGAINST
a text collection (which is simply the set of columns covered by a
FULLTEXT
index). For every row in a table it returns relevance -
a similarity measure between the text in that row (in the columns that are
part of the collection) and the query. When it is used in a WHERE
clause (see example above) the rows returned are automatically sorted with
relevance decreasing. Relevance is a non-negative floating-point number.
Zero relevance means no similarity. Relevance is computed based on the
number of words in the row, the number of unique words in that row, the
total number of words in the collection, and the number of documents (rows)
that contain a particular word.
The above is a basic example of using MATCH
function. Rows are
returned with relevance decreasing.
mysql> SELECT id,MATCH (title,body) AGAINST ('Tutorial') FROM articles; +----+-----------------------------------------+ | id | MATCH (title,body) AGAINST ('Tutorial') | +----+-----------------------------------------+ | 1 | 0.64840710366884 | | 2 | 0 | | 3 | 0.66266459031789 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----+-----------------------------------------+ 5 rows in set (0.00 sec)
This example shows how to retrieve the relevances. As neither WHERE
nor ORDER BY
clauses are present, returned rows are not ordered.
mysql> SELECT id, body, MATCH (title,body) AGAINST ( -> 'Security implications of running MySQL as root') AS score -> FROM articles WHERE MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root'); +----+-----------------------------------------------+-----------------+ | id | body | score | +----+-----------------------------------------------+-----------------+ | 4 | 1. Never run mysqld as root. 2. Normalize ... | 1.5055546709332 | | 6 | When configured properly, MySQL could be ... | 1.31140957288 | +----+-----------------------------------------------+-----------------+ 2 rows in set (0.00 sec)
This is more complex example - the query returns the relevance and still
sorts the rows with relevance decreasing. To achieve it one should specify
MATCH
twice. Note, that this will cause no additional overhead, as
MySQL optimizer will notice that these two MATCH
calls are
identical and will call full-text search code only once.
MySQL uses a very simple parser to split text into words. A ``word'' is any sequence of letters, numbers, `'', and `_'. Any ``word'' that is present in the stopword list or just too short (3 characters or less) is ignored.
Every correct word in the collection and in the query is weighted, according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. The weights of the words are then combined to compute the relevance of the row.
Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not reflect adequately their semantical value, and this model may sometimes produce bizarre results.
mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL'); Empty set (0.00 sec)
Search for the word MySQL
produces no results in the above example.
Word MySQL
is present in more than half of rows, and as such, is
effectively treated as a stopword (that is, with semantical value zero).
It is, really, the desired behavior - a natural language query should not
return every second row in 1GB table.
A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that such rows have been assigned a low semantical value in this particular dataset.
Unfortunately, full-text search has no user-tunable parameters yet, although adding some is very high on the TODO. However, if you have a MySQL source distribution (See section 4.7 Installing a MySQL Source Distribution.), you can somewhat alter the full-text search behavior.
Note that full-text search was carefully tuned for the best searching effectiveness. Modifying the default behavior will, in most cases, only make the search results worse. Do not alter the MySQL sources unless you know what you are doing!
myisam/ftdefs.h
file by the line
#define MIN_WORD_LEN 4Change it to the value you prefer, recompile MySQL, and rebuild your
FULLTEXT
indexes.
myisam/ft_static.c
Modify it to your taste, recompile MySQL and rebuild
your FULLTEXT
indexes.
myisam/ftdefs.h
:
#define GWS_IN_USE GWS_PROBto
#define GWS_IN_USE GWS_FREQand recompile MySQL. There is no need to rebuild the indexes in this case.
This section includes a list of the fulltext features that are already implemented in the 4.0 tree. It explains More functions for full-text search entry of section H.1 Things that should be in 4.0.
REPAIR TABLE
with FULLTEXT
indexes,
ALTER TABLE
with FULLTEXT
indexes, and
OPTIMIZE TABLE
with FULLTEXT
indexes are now
up to 100 times faster.
MATCH ... AGAINST
now supports the following
boolean operators:
+
word means the that word must be present in every
row returned.
-
word means the that word must not be present in every
row returned.
<
and >
can be used to decrease and increase word
weight in the query.
~
can be used to assign a negative weight to a noise
word.
*
is a truncation operator.
ft_dump
added for low-level FULLTEXT
index operations (querying/dumping/statistics).
FULLTEXT
index faster.
()
in boolean full-text search.
FULLTEXT
index
(yes, very slow).
MERGE
tables.
FULLTEXT
in CREATE/ALTER TABLE
).
Go to the first, previous, next, last section, table of contents.