Compare Two (Hungarian) Names using Permutations and LEVENSTEIN

Let’s say we want to check for name existence in a database. However the user input might not be sanitized or in the same  order as the one in the database. The name might have spelling mistakes. Also the female name is sometimes adjusted to match the male name after mariage. This might pose an issue in identifying the man or woman.

There is a wide range of selection of a married name. Until about the 18th century noblewomen kept their names at marriage and their children received their father’s name; it became compulsory only under the reign of Joseph II). When Hungary was under Habsburg rule and became influenced by Western European traditions, women became known by their husbands’ names. So for example Szendrey Júlia, marrying Petőfi Sándor, became Petőfi Sándorné (the -né suffix approximately means “wife of”, and this is the Hungarian equivalent of “Mrs.” as in “Mrs. John Smith”). This was both the law and the tradition until the 1950s. During the Communist rule of Hungary, great emphasis was put upon the equality of women and men, and from that time, women could either choose to keep their maiden name or take that of their husband. Most women did the latter except for artists.

Now, the alternatives for a woman when she marries are as shown below (using the examples of Szendrey Júlia and Petőfi Sándor – Júlia and Sándor are their given names):

  • Júlia can keep her maiden name, as Szendrey Júlia (now very popular, especially among more-educated women).

  • Júlia gives up her name, adds the suffix -né to her husband’s full name, and will be called Petőfi Sándorné.

  • Júlia adds the suffix -né to her husband’s family name, adds her full name and will be called Petőfiné Szendrey Júlia.

  • Júlia adds the suffix -né to her husband’s full name, adds her full name and will be called Petőfi Sándorné Szendrey Júlia (less popular these days, because it is long to write).

  • Júlia takes her husband’s family name, keeps her given name “Júlia” and will be called Petőfi Júlia.

You can see thy this might be a problem.

 

This is my solution:

It uses code from here: Getting percentage of character match of 2 strings – MSSQL – SQL Server / Alternative to SOUNDEX

 

CREATE FUNCTION [dbo].[CompareTwoNames]
(
	@string1 NVARCHAR(2000),
	@string2 NVARCHAR(2000),
	@percentMatchThreshold INT  = 80,
	@withWordPermutation BIT = 0
)
RETURNS BIT
AS
BEGIN

	SET @string1 = REPLACE(@string1, 'DR.','')
	SET @string2 = REPLACE(@string2, 'DR.','')
	SET @string1 = LTRIM(RTRIM(@string1))
    SET @string2 = LTRIM(RTRIM(@string2))
    
    
	DECLARE @AreStringsEqual BIT
    
    SET @AreStringsEqual = 0
    
    SET @string1 = REPLACE(@string1,N'Ăś', 'Ö')    
    SET @string1 = REPLACE(@string1,N'Ăź', 'Ü')
    SET @string1 = REPLACE(@string1,N'Ĺą', 'Ü')
    SET @string1 = REPLACE(@string1,N'ĂĄ', 'Á')
    SET @string1 = REPLACE(@string1,N'ĂŠ', 'É')
    SET @string1 = REPLACE(@string1,N'Ăł','Ó')
    SET @string1 = REPLACE(@string1,N'Ă','Á')
    SET @string1 = REPLACE(@string1,N'Ĺ','O')
    SET @string1 = REPLACE(@string1,N'é','É' )
    SET @string1 = REPLACE(@string1, N'ó','Ó'   )    
    SET @string1 = REPLACE(@string1,N'Ăł','Ó')
    SET @string1 = REPLACE(@string1,N'Ăś','O' )
    SET @string1 = REPLACE(@string1, N'É','É'   )    
    SET @string1 = REPLACE(@string1,N'ù','Ü' )
    SET @string1 = REPLACE(@string1, N'è','É'  )
    
    
    
    
    
    SET @string2 = REPLACE(@string2,N'Ăś', 'Ö')    
    SET @string2 = REPLACE(@string2,N'Ăź', 'Ü')
    SET @string2 = REPLACE(@string2,N'Ĺą', 'Ü')
    SET @string2 = REPLACE(@string2,N'ĂĄ', 'Á')
    SET @string2 = REPLACE(@string2,N'ĂŠ', 'É')
    SET @string2 = REPLACE(@string2,N'Ăł','Ó')
     SET @string2 = REPLACE(@string2,N'Ă','Á')
    SET @string2 = REPLACE(@string2,N'Ĺ','O')
    SET @string2 = REPLACE(@string2,N'é','É' )
    SET @string2 = REPLACE(@string2, N'ó','Ó')    
     SET @string2 = REPLACE(@string2,N'Ăł','Ó')
    SET @string2 = REPLACE(@string2,N'Ăś','O' )
    SET @string2 = REPLACE(@string2, N'É','É')
     SET @string2 = REPLACE(@string2,N'ù','Ü' )
    SET @string2 = REPLACE(@string2, N'è','É'  )
    
    
    
    
    
    IF @string1 IS NULL AND @string2 IS NULL
		BEGIN
			SET @AreStringsEqual = 1			
			RETURN @AreStringsEqual
		END
		
	IF LEN( Ltrim(Rtrim(@string1))) = 0 AND LEN(Ltrim(Rtrim(@string2))) = 0
	BEGIN
		SET @AreStringsEqual = 1
		RETURN @AreStringsEqual 
	END
	
	IF  @string1 IS NULL AND LEN(Ltrim(Rtrim(@string2))) = 0
	BEGIN
		SET @AreStringsEqual = 1
		RETURN @AreStringsEqual 
	END
	
	IF LEN( Ltrim(Rtrim(@string1))) = 0 AND @string2 IS NULL
	BEGIN
		SET @AreStringsEqual = 1
		RETURN @AreStringsEqual 
	END
    
    --SET @string1= @string1 COLLATE SQL_Latin1_General_CP1_CI_AS
    -- SET @string2= @string2 COLLATE SQL_Latin1_General_CP1_CI_AS
     
     SET @string1 = dbo.RemoveAccents(@string1)
     SET @string2 = dbo.RemoveAccents(@string2)
     
     SET @string1 = UPPER(@string1)
     SET @string2 = UPPER(@string2)
    
    IF ( Ltrim(Rtrim(@string1 COLLATE Latin1_General_CI_AI)) = Ltrim(Rtrim(@string2 COLLATE Latin1_General_CI_AI)) ) 
      BEGIN
          SET @AreStringsEqual = 1
          GOTO JustBeforeReturn--RETURN @AreStringsEqual 
      END
      
    IF SOUNDEX(@string1) = SOUNDEX(@string2)
      BEGIN
          SET @AreStringsEqual = 1
          GOTO JustBeforeReturn--RETURN @AreStringsEqual 
      END
      
    DECLARE @percentageMatch INT
    SELECT @percentageMatch = [AstraHu].[dbo].[GetPercentageOfTwoStringMatching] (  @string1, @string2)   
    
    IF @percentageMatch >= @percentMatchThreshold
	  BEGIN
          SET @AreStringsEqual = 1
          GOTO JustBeforeReturn--RETURN @AreStringsEqual 
      END
      
    IF(@string1 LIKE '%' + @string2 +'%') OR (@string2 LIKE '%' + @string1 +'%')
	  BEGIN
          SET @AreStringsEqual = 1
          GOTO JustBeforeReturn--RETURN @AreStringsEqual 
      END
      

    IF @withWordPermutation = 1
		BEGIN
			DECLARE @namesWithWordPermutationTable TABLE
			(
				id INT NOT NULL IDENTITY(1,1),
				name NVARCHAR(500)				
			)
			INSERT INTO @namesWithWordPermutationTable
			SELECT name FROM [dbo].[GenerateNamePermutation](@string1)
			DECLARE @Index INT = 1, @MaxIndex INT
			SELECT @MaxIndex = COUNT(1) FROM @namesWithWordPermutationTable
			WHILE @Index <= @MaxIndex
				BEGIN
					SET @AreStringsEqual = [dbo].[CompareTwoNames]((SELECT TOP(1) name FROM @namesWithWordPermutationTable WHERE id = @Index) ,	@string2 ,	@percentMatchThreshold,	 0)
					IF @AreStringsEqual = 1
						BREAK
					SET @Index= @Index + 1
				END
		END
     
    JustBeforeReturn:
    DECLARE @string1ContainsFemaleName BIT = 0, @string2ContainsFemaleName BIT = 0
    IF (@string1 LIKE '%NÉ %') OR (@string1 LIKE '%NÉ') OR (@string1 LIKE '%NE %') OR (@string1 LIKE '%NE')
		SET @string1ContainsFemaleName = 1
	IF (@string2 LIKE '%NÉ %') OR (@string2 LIKE '%NÉ') OR (@string2 LIKE '%NE %') OR (@string2 LIKE '%NE')
		SET @string2ContainsFemaleName = 1
		
	IF(@string1ContainsFemaleName <> @string2ContainsFemaleName)
		SET @AreStringsEqual = 0
		
	IF @string1ContainsFemaleName = @string2ContainsFemaleName AND @string2ContainsFemaleName = 1
		BEGIN	
			DECLARE @string1WithoutFemaleForm VARCHAR(2000), @string2WithoutFemaleForm VARCHAR(2000)
			SET @string1WithoutFemaleForm =		REPLACE(
													REPLACE(
														REPLACE(@string1,'NÉ ',' ')
														,'NE '
														,' ')
														,'NE'
														,'');
			SET @string2WithoutFemaleForm =		REPLACE(
													REPLACE(
														REPLACE(@string2,'NÉ ',' ')
														,'NE '
														,' ')
														,'NE'
														,'');
			SET @AreStringsEqual = [dbo].[CompareTwoNames](@string1WithoutFemaleForm,	@string2WithoutFemaleForm ,	@percentMatchThreshold,	 @withWordPermutation)
		END
		
    RETURN @AreStringsEqual 

END

CREATE FUNCTION [dbo].[GenerateNamePermutation] 
(
	@words NVARCHAR(500)
)
RETURNS 
@returnTable TABLE 
(
	name NVARCHAR(500)
)
AS
BEGIN
	-- Fill the table variable with the rows for your result set
	DECLARE @separator CHAR(1)
    DECLARE @NameParts TABLE (
      PartId INT IDENTITY,
      part   NVARCHAR(50))
   -- DECLARE @words NVARCHAR(50)
    
    SET @separator = ' '
    
    ;WITH Parts(pn, start, finish)
         AS (SELECT 1,
                    1,
                    CHARINDEX(@separator, @words)
             UNION ALL
             SELECT pn + 1,
                    finish + 1,
                    CHARINDEX(@separator, @words, finish + 1)
             FROM   Parts
             WHERE  finish > 0)
    INSERT INTO @NameParts
                (part)
    SELECT TOP(5) SUBSTRING(@words, start, CASE
                                     WHEN finish > 0 THEN finish - start
                                     ELSE 50
                                   END)
    FROM   Parts
    
    DECLARE @tokencount INT
    
    SELECT @tokencount = COUNT(*)
    FROM   @NameParts;
    
    WITH Subsets
         AS (SELECT CAST(' ' + part AS NVARCHAR(MAX)) Permutation,
                    CAST(1 AS INT)                   AS Iteration
             FROM   @NameParts
             UNION ALL
             SELECT Permutation + ' ' + part AS Permutation,
                    Iteration + 1            AS Iteration
             FROM   Subsets s
                    JOIN @NameParts n
                      ON s.Permutation NOT LIKE '%' + n.part + '%')
    INSERT INTO @returnTable
    SELECT STUFF(Permutation, 1, 1, '') AS Perm
    FROM   SUBSETS
    WHERE  Iteration = @tokencount 
    
	RETURN 
END

Getting percentage of character match of 2 strings – MSSQL – SQL Server / Alternative to SOUNDEX

Let’s say I have a set of 2 words:

Alexander and Alecsander OR Alexander and Alegzander

Alexander and Aleaxnder, or any other combination. In general we are talking about human error in typing of a word or a set of words.

What I want to achieve is to get the percentage of matching of the characters of the 2 strings.

Ok, here is my solution so far:

SELECT  [dbo].[GetPercentageOfTwoStringMatching]('valentin123456'  ,'valnetin123456')

returns 86%

CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(100)
    ,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1)
    , @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] (   @string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters

END




-- =============================================     
-- Create date: 2011.12.14
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
-- =============================================

CREATE FUNCTION [dbo].[LEVENSHTEIN](@left  VARCHAR(100),
                                    @right VARCHAR(100))
returns INT
AS
  BEGIN
      DECLARE @difference    INT,
              @lenRight      INT,
              @lenLeft       INT,
              @leftIndex     INT,
              @rightIndex    INT,
              @left_char     CHAR(1),
              @right_char    CHAR(1),
              @compareLength INT

      SET @lenLeft = LEN(@left)
      SET @lenRight = LEN(@right)
      SET @difference = 0

      IF @lenLeft = 0
        BEGIN
            SET @difference = @lenRight

            GOTO done
        END

      IF @lenRight = 0
        BEGIN
            SET @difference = @lenLeft

            GOTO done
        END

      GOTO comparison

      COMPARISON:

      IF ( @lenLeft >= @lenRight )
        SET @compareLength = @lenLeft
      ELSE
        SET @compareLength = @lenRight

      SET @rightIndex = 1
      SET @leftIndex = 1

      WHILE @leftIndex <= @compareLength
        BEGIN
            SET @left_char = substring(@left, @leftIndex, 1)
            SET @right_char = substring(@right, @rightIndex, 1)

            IF @left_char <> @right_char
              BEGIN -- Would an insertion make them re-align?
                  IF( @left_char = substring(@right, @rightIndex + 1, 1) )
                    SET @rightIndex = @rightIndex + 1
                  -- Would an deletion make them re-align?
                  ELSE IF( substring(@left, @leftIndex + 1, 1) = @right_char )
                    SET @leftIndex = @leftIndex + 1

                  SET @difference = @difference + 1
              END

            SET @leftIndex = @leftIndex + 1
            SET @rightIndex = @rightIndex + 1
        END

      GOTO done

      DONE:

      RETURN @difference
  END 

 

The solution uses the Levenstein algorithm:

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.