UTL_MATCH - String Similarity in Oracle

Thursday, December 07, 2006

UTL_MATCH - String Similarity in Oracle

You can compare string using equality operator (=), or using similarity operator (LIKE). But there are cases where it is not enough. You Can use UTL_MATCH package to calculate string similarity index. This packages offers 4 functions that take two strings as parameters:

  • edit_distance - algorithm by Levenshtein - returns number of edits that must be done to change one string into second,
  • edit_distance_similarity - normalized results of edit_distance in percents - integer values,
  • jaro_winkler - returns similarity based on Jaro-Winkler distance algorithm,
  • jaro_winkler_similarity - same as above but presented as integer in range 0-100.
Lets teke a look how it works. First we can create sample table and insert some data:
create table countries(name varchar2(15) not null);
insert into countries values ('Poland');
insert into countries values ('Germany');
insert into countries values ('United States');
insert into countries values ('Portugal');
insert into countries values ('Czech Republic');
insert into countries values ('China');
insert into countries values ('Slovakia');
insert into countries values ('Slovenia');
commit;
Now we can take a look at results. Lets compare to miss-spelled country name: ' Slovnia'
select name
 ,to_char(utl_match.edit_distance(name, 'Slovnia'),'999') edit_dist
 ,to_char(utl_match.edit_distance_similarity(name, 'Slovnia'),'999') edit_dist_sim
 ,to_char(utl_match.jaro_winkler(name, 'Slovnia'),'999d9999') jaro_winkler
 ,to_char(utl_match.jaro_winkler_similarity(name, 'Slovnia'),'999') jaro_winkler_sim
from countries
order by jaro_winkler_sim desc;

NAME                 EDIT EDIT JARO_WINK JARO
-------------------- ---- ---- --------- ----
Slovenia                1   88     .9750   97
Slovakia                2   75     .8881   88
China                   5   29     .5619   56
United States          12    8     .5531   55
Poland                  6   15     .5317   53
Portugal                7   13     .5119   51
Germany                 7    0     .3571   35
Czech Republic         13    8     .0000    0
Above we can observe differences in algorithms.
Lets test it on NULLs:
SQL> select to_char(utl_match.edit_distance('test', NULL),'999')
  2      edit_dist
  3   ,to_char(utl_match.edit_distance_similarity('test', NULL),'999')
  4      edit_dist_sim
  5   ,to_char(utl_match.jaro_winkler('test', NULL),'999d9999')
  6      jaro_winkler
  7   ,to_char(utl_match.jaro_winkler_similarity('test', NULL),'999')
  8      jaro_winkler_sim
  9  from dual;

EDIT EDIT JARO_WINK JARO
---- ---- --------- ----
  -1  125     .0000    0
We can see that using edit_distance on NULLs migth be dengerous.
All samples were run on Oracle 10g XE

你可能感兴趣的:(oracle,sql)