MSSQL Levenshtein
Creating Stored Procedures and User-Defined Functions with Managed Code
They actually saved my life after we found out that the new Fuzzy Matching engine that our development team had to implement, crashed on us as soon as the tens of thousands daily transactions started bombarding it!
Anyway, if you had a look at the reference blogs you will have a fair idea of the secret I'm about to share...
C# and Dynamic-Link Libraries (DLL)
(Oh no... C# isn't that .net? MS Sql is MS and MS is .net so FACE IT!)
Open a notepad or if you have it an IDE and write the following code:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
{
if(S1.IsNull)
S1 = new SqlString("");
if(S2.IsNull)
S2 = new SqlString("");
String SC1 = S1.Value.ToUpper();
String SC2 = S2.Value.ToUpper();
int n = SC1.Length;
int m = SC2.Length;
int[,] d = new int[n + 1, m + 1];
int cost = 0;
if (n + m == 0) {
return 100;
} else if (n == 0) {
return 0;
} else if (m == 0) {
return 0;
}
for (int i = 0; i <= n; i++)
d[i, 0] = i;
for (int j = 0; j <= m; j++)
d[0, j] = j;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
if (SC1[i - 1] == SC2[j - 1])
cost = 0;
else
cost = 1;
d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);
return percentage;
}
};
Save this file at any location on your hard drive... Call it for instance 'levenshtein.cs'.
Next step of the process is to compile this little class into a dll file.
Windows comes installed with a C# compiler so this should be no problem, just open a dos command window, point it to the location you saved this file and type:
E:\projects\C#\DBObjects>csc.exe /t:library /out:UserFunctions.dll levenshtein.cs
You will get a response that looks something like this:
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.5420
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.
E:\projects\C#\DBObjects>
If the dos box comes back with something like:
'csc.exe' is not recognized as an internal or external command, operable program or batch file.
That just means you need to add the path to your c# folder to your windows environment PATH variable. Just search your computer for 'csc.exe' and you should find multiple hits. If a compilation fails, just try one of the others or make sure that the folder containing the csc.exe file also contains the System.Data.dll file. Sooner or later you will find one that works to compile the levenshtein.cs file.
In the same folder as where you created the levenshtein.cs file you will now find a file called 'UserFunctions.dll' (That is if you named it that way in the dos command).
That dll file we can now use to create an assembly in MS Sql.
To do that, open your Microsoft SQL Server Management Server Studio and open the database you want to bind the dll to.
Open the programmability folder, right click on Assemblies and click New Assembly...
Next click the browse button in the window that opens up and select your newly created dll file.
Now you will see that your Assemblies contains an extra package called UserFunctions (if that is what you called the dll file)
In order for SQL to start using the functions contained in your Assembly we will create a wrapper function in native SQL-talk.
I open a new query in my Microsoft SQL Server Manager and type the following and run it...
CREATE Function fn_Levenshtein(@S1 nvarchar(4000), @S2 nvarchar(4000))
RETURNS float as EXTERNAL NAME UserFunctions.StoredFunctions.Levenshtein
GO
UserFunctions.StoredFunctions.Levenshtein or in other words: AssemblyName.ClassName.FunctionName.
Now you can use the fn_levenshtein function to your leisure. Testing
SELECT dbo.fn_Levenshtein(
replicate('abcdeghij',299),
replicate('abcdfghij',299)
)
Should give you a result of 88.89% match.
It is possible that your SQL server is set up to not allow clr functions. This you can fix easily running the query:
sp_configure 'clr enabled', 1
GO
reconfigure
GO
There, now you can have a ball and boost your fuzzy matching queries with Levenshtein logic at a very low cost as you will soon find out!
I hope this may be of help to anyone!