“Fuzzy logic” search in SQL Server
I was looking hard for a good, robust, yet possibly simple and fast to implement method of “fuzzy” logic search for an application based on SQL Server 2005. “Fuzzy” meaning that a search would return not only exact (or wildcard) matches but also phonetically similar words. This is especially useful for any database applications with operators using last names, component names or any name actually heard over the phone for example.
SQL offers two methods: DIFFERENCE and SOUNDEX. Unfortunately both are quite inefficient unfortunately.
Books online give some examples how they work, but the fact is that the algorithm is a well-known and simple one: https://en.wikipedia.org/wiki/Soundex.
The first obstacle is that it retains the first letter of string, so “goat” and “boat” are according to SOUNDEX not phonetically similar…
What we did is take SOUNDEX of two compared words and then run them through DIFFERENCE. A difference of 3 would then mean similar words. This makes gets rid of the first letter “obstacle”, but then produces a too far-off comparison for words beginning with the same letter.
Unfortunately SQL does not have anything more built-in. The only way is to improve the algorithm yourself which can be quite complex – increased performance hit, and time for development. An example is a 600+ line stored procedure:
https://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=519&lngWId=5
The most well-known improvements are DoubleMetaphone and adding distance algorithms, e.g. "Levenshtein Distance".
It might be interesting to compare how those features are implemented in other systems like Oracle, SAP, etc.
Comments
Anonymous
April 08, 2007
Don't forget the rich data mining algorithms in Analysis Services, and the Fuzzy Lookup function in Integration Services.Anonymous
July 22, 2008
Aloha, Any new findings on this topic? I am doing my very first SQL Server project and I need this kind of functionality as well. On top of it I need to find a way to rank the results by weighing matches in columns differently. I would rather not have to do that kind of stuff in userland (and/or by de-normalizing). Another approach that might work to some degree are synonym lists, but I guess its hard to keep those dynamic.Anonymous
July 22, 2008
Hi, You might have a look at these posts: http://blogs.conchango.com/jamiethomson/archive/2008/05/21/fuzzyness.aspx. It talks about 2008 sql improvements in that area, but has some references to how you can use SSIS packages for that (I know its far from native querying).