Thursday, April 21, 2011

How it is done...

First of all I would like to thank the people who were kind enough to help the world by posting these two references on the web:

MSSQL Levenshtein
Creating Stored Procedures and User-Defined Functions with Managed Code

They actually saved my life after we found out that the new Fuzzy Matching engine that our development team had to implement, crashed on us as soon as the tens of thousands daily transactions started bombarding it!

Anyway, if you had a look at the reference blogs you will have a fair idea of the secret I'm about to share...

C# and Dynamic-Link Libraries (DLL)
(Oh no... C# isn't that .net? MS Sql is MS and MS is .net so FACE IT!)

Open a notepad or if you have it an IDE and write the following code:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
    public static SqlDouble Levenshtein(SqlString S1, SqlString S2)
    {
        if(S1.IsNull)
          S1 = new SqlString("");

        if(S2.IsNull)
          S2 = new SqlString("");

        String SC1 = S1.Value.ToUpper();
        String SC2 = S2.Value.ToUpper();
        
        int n = SC1.Length;
        int m = SC2.Length;

        int[,] d = new int[n + 1, m + 1];
        int cost = 0;
        
        if (n + m == 0) {
            return 100;
        } else if (n == 0) {
            return 0;
        } else if (m == 0) {
            return 0;
        }

        for (int i = 0; i <= n; i++)
            d[i, 0] = i;

        for (int j = 0; j <= m; j++)
            d[0, j] = j;

        for (int i = 1; i <= n; i++)
        {
            for (int j = 1; j <= m; j++)
            {
                if (SC1[i - 1] == SC2[j - 1])
                    cost = 0;
                else
                    cost = 1;

                d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
            }
        }
          
        double percentage = System.Math.Round((1.0 - ((double)d[n, m]/(double)System.Math.Max(n,m))) * 100.0,2);
        return percentage;
    }
};

Save this file at any location on your hard drive... Call it for instance 'levenshtein.cs'.

Next step of the process is to compile this little class into a dll file.

Windows comes installed with a C# compiler so this should be no problem, just open a dos command window, point it to the location you saved this file and type:
E:\projects\C#\DBObjects>csc.exe /t:library /out:UserFunctions.dll levenshtein.cs

You will get a response that looks something like this:
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.5420
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.


E:\projects\C#\DBObjects>
You will be notified with compilation errors if you made a mistake in your script.

If the dos box comes back with something like:

'csc.exe' is not recognized as an internal or external command, operable program or batch file.

That just means you need to add the path to your c# folder to your windows environment PATH variable. Just search your computer for 'csc.exe' and you should find multiple hits. If a compilation fails, just try one of the others or make sure that the folder containing the csc.exe file also contains the System.Data.dll file. Sooner or later you will find one that works to compile the levenshtein.cs file.

In the same folder as where you created the levenshtein.cs file you will now find a file called 'UserFunctions.dll' (That is if you named it that way in the dos command).

That dll file we can now use to create an assembly in MS Sql.

To do that, open your Microsoft SQL Server Management Server Studio and open the database you want to bind the dll to.

Open the programmability folder, right click on Assemblies and click New Assembly...




















Next click the browse button in the window that opens up and select your newly created dll file.

























Now you will see that your Assemblies contains an extra package called UserFunctions (if that is what you called the dll file)





















In order for SQL to start using the functions contained in your Assembly we will create a wrapper function in native SQL-talk.
I open a new query in my Microsoft SQL Server Manager and type the following and run it...
CREATE Function fn_Levenshtein(@S1 nvarchar(4000), @S2 nvarchar(4000))
    RETURNS float as EXTERNAL NAME UserFunctions.StoredFunctions.Levenshtein
GO

UserFunctions.StoredFunctions.Levenshtein or in other words: AssemblyName.ClassName.FunctionName.

Now you can use the fn_levenshtein function to your leisure. Testing

SELECT dbo.fn_Levenshtein(
 replicate('abcdeghij',299),
 replicate('abcdfghij',299)
)

Should give you a result of 88.89% match.

It is possible that your SQL server is set up to not allow clr functions. This you can fix easily running the query:

sp_configure 'clr enabled', 1
GO
reconfigure
GO

There, now you can have a ball and boost your fuzzy matching queries with Levenshtein logic at a very low cost as you will soon find out!

I hope this may be of help to anyone!

9 comments:

  1. I appreciate this post, Tom, if for no other reason than to remind us all to leverage the power of .NET in our SQL server instances wherever possible (instead of trying to deal with the limitations of SQL and it's often wacky overhead).

    Good work!

    ReplyDelete
  2. Thanks Brent, your response is much appreciated.
    And a fast comment too :) the post isn't even finished yet.
    I have obligations today to take my loving wife out so I'll probably finish up the post tomorrow.

    Cheers!

    ReplyDelete
  3. Hi Tom

    I found this post when i was googling levenshtein, and it helps a lot.

    Thanks.

    xj

    ReplyDelete
  4. Hi can you please give an example of how to use this is SQL or .Net to return text/strings which are a fuzzy match of the input string

    ReplyDelete
    Replies
    1. Hi there,

      Well for example:

      DECLARE @inputString VARCHAR(50)

      SET @inputString = 'My Search'

      SELECT aColumn
      FROM aTable
      WHERE dbo.fn_Levenshtein(aColumn, @inputString) > 80

      Would return all fuzzy matches that resemble for 80 or more percent according to Levenshtein distance.

      Cheers!

      Delete
  5. Excellent post :)

    Need to Enabling CLR Integration before we are able to run the custom function but excellent. Nicely described.

    ReplyDelete
  6. Great stuff. Exactly what I needed when I moved the backend of an MS Access application over SQL Server and broke a couple of my UDFs (Levenshtein was one). Had forgotten all about the ability to use custom assemblies.

    ReplyDelete
  7. Excellent Tom,

    How could you improve it with a "Max Distance Parameter" as third argument?
    It could save a lot of time in some cases.

    ReplyDelete
  8. Any ideas on how to get this to work on SQL Server 2008 R2 Standard Edition? I've tried the above but get the following errors:

    CREATE ASSEMBLY for assembly 'UserFunctions' failed because the assembly is built for an unsupported version of the Common Language Runtime. (.Net SqlClient Data Provider)

    ReplyDelete