According to Wikipedia, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. This distance can be used to find a row in a SQL database where the keyword does not match exactly the fields. By computing the distance between the keywords and the candidates, it becomes possible to find the closest (in term of Levenshtein distance) row in the database. It can be usefull for example for finding a nickname, a username or a city even if the keywords does not match exactly.
Let's take an example. In the following, we'll assume the database contains a row containing the field "New-York". The user was asked for a city and input the following:
|New-York||New-York||0||The words match perfectly, the distance is 0.|
|New York||New-York||1||One substitution.|
|NewYorke||New-York||2||One insertion and one deletion.|
|Paris||New-York||8||Five substitutions and three deletions.|
The following implementation has been tested with these versions:
The first step is to implement the
levenshtein function in mySQL. Select the
table where the distance between words is required and copy the
following SQL code (from http://www.artfulsoftware.com):
DELIMITER $$ CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END$$ DELIMITER ;
Run the query on your server.
Result must be OK before continuing:
The function must appear in your current table. It may need few minutes to be refreshed, depending on your version. Refresh the whole page if needed.
Before performing complex queries, let's check that the function works properly. Run the following query:
SELECT levenshtein('New-York', 'Paris')
It should return a distance of 8:
The Levenshtein function can now be used in a SQL query:
SELECT * FROM `cities` WHERE levenshtein('$keyword', `name`) BETWEEN 0 AND 4
The previous query returns the rows of table
cities where the distance between
keyword and the field
name is less than 5. Let's try with
the word newyorke:
That's all folks !