2
\$\begingroup\$

Logic based of this site which had a VB example I used as a base.

function Get-LevenshteinDistance{ [cmdletbinding()] param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length Write-Verbose "Comparision String: '$CompareString' with length '$compareStringLength'" Write-Verbose "Difference String: '$DifferenceString' with length '$differenceStringLength'" # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # Build the comparison matrix and populate the first column and first row. $comparisonMatrix = New-Object 'object[,]' ($differenceStringLength+1),($compareStringLength+1) # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0,$index]=$index } # Populate the first column for($index=0; $index -le $differenceStringLength; $index++){ $comparisonMatrix[$index,0]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list $differenceCharacter = $differenceString[$rowIndex-1] Write-Verbose "Matrix location: [$rowIndex, $columnIndex]" Write-Verbose "Compare character: $compareCharacter - Difference character: $differenceCharacter" # Calculate the cost $cost=if($compareCharacter -eq $differenceCharacter){0}else{1} Write-Verbose "Cost: $cost" # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1), $columnIndex] + 1 Write-Verbose "Cell Above: [$($rowIndex-1), $columnIndex] + 1 = $cellAbove" # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex,($columnIndex-1)] + 1 Write-Verbose "Cell Left: [$rowIndex,$($columnIndex-1)] + 1 = $cellLeft" # The cell diagonally above and to the left plus the cost $cellUpperLeft = $comparisonMatrix[($rowIndex-1),($columnIndex-1)] + $cost Write-Verbose "Cell Upper Left: [$($rowIndex-1),$($columnIndex-1)] + cost($cost) = $cellUpperLeft" # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. $comparisonMatrix[$rowIndex,$columnIndex] = $cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum | select -ExpandProperty Minimum } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength,$compareStringLength] } 

Was working on a PPCG question where I had to roll my own LD calculator. Since I saw future value in this I made a more professional looking version. Since it is iterating over every character in a string for every other character in the other string it really starts to show its performance issues when they start to get over 50 chars in size.

The model it uses supports deletions, insertions and substitutions. If I wanted to change the weight of substitutions I suppose I would need to edit this line and support a parameter for it?

$cost=if($compareCharacter -eq $differenceCharacter){0}else{1} 

Never really played with multidimensional arrays before and I am not sure how I could improve on this. I know there are a few things I could do to make the code more brief and still functional however I opted for this solution as I liked the improved readability.

\$\endgroup\$
2
  • \$\begingroup\$ itterating should probably be iterating \$\endgroup\$ Commented May 30, 2017 at 12:08
  • \$\begingroup\$ @yuri probably :) \$\endgroup\$ Commented May 30, 2017 at 12:08

1 Answer 1

2
\$\begingroup\$

Here is my development series (in parentheses: approx fast ratio compared with the previous, tested using strings cca 100 characters in size, see the comparison table below):

  • Get-LevenshteinDistanceOrig  (= , for comparison): the original function body (Copy&Paste);
  • Get-LevenshteinDistanceOrig+ (> , still insufficient): removed all time-expensive Write-Verbose output;
  • Get-LevenshteinDistanceTry1  (> 50×, a substantial improvement): time-expensive pipeline to Measure-Object replaced by [math]::Min(n,m) static method (see .NET system.math class). The following are only minor advancement steps:
  • Get-LevenshteinDistanceTry2  (~ ..): slight algorithm renovation:
    • instead calculating the cost ($cost variable): if compared characters are equal then computing and comparing values $cellAbove, $cellLeft and $cellUpperLeft is useless as we already know the necessary value;
    • added: if strings are equal the LD is known (zero) and we can stop;
  • Get-LevenshteinDistance      (~ ): PoSH array implementation: instead of two-dimensional comparison matrix used a rectangular jagged array.

Comparison table: .\CR\164518test.ps1 | Format-Table -AutoSize

cmdlet similarity runtime (ms) lengths LD ------ ---------- ------------ ------- -- Get-LevenshteinDistanceOrig equal 6829.1053 106 106 0 Get-LevenshteinDistanceOrig+ equal 2010.4538 106 106 0 Get-LevenshteinDistanceTry1 equal 35.8835 106 106 0 Get-LevenshteinDistanceTry2 equal .1795 106 106 0 Get-LevenshteinDistance equal .1539 106 106 0 Get-LevenshteinDistanceOrig stochastic 6556.4117 106 102 79 Get-LevenshteinDistanceOrig+ stochastic 1932.6051 106 102 79 Get-LevenshteinDistanceTry1 stochastic 33.9023 106 102 79 Get-LevenshteinDistanceTry2 stochastic 28.3165 106 102 79 Get-LevenshteinDistance stochastic 13.4852 106 102 79 Get-LevenshteinDistanceOrig similar 6640.5884 106 102 4 Get-LevenshteinDistanceOrig+ similar 2023.8179 106 102 4 Get-LevenshteinDistanceTry1 similar 31.5843 106 102 4 Get-LevenshteinDistanceTry2 similar 14.9307 106 102 4 Get-LevenshteinDistance similar 8.4234 106 102 4 Get-LevenshteinDistanceOrig different 6613.5267 106 102 106 Get-LevenshteinDistanceOrig+ different 1943.7630 106 102 106 Get-LevenshteinDistanceTry1 different 35.1824 106 102 106 Get-LevenshteinDistanceTry2 different 27.1371 106 102 106 Get-LevenshteinDistance different 13.8924 106 102 106 

Column explanation:

  • cmdlet       - function name
  • similarity   - in brief "like" similarity of input strings
  • runtime (ms) - TotalMilliseconds
  • lengths      - lengths of input strings, space delimited
  • LD           - the Levenshtein Distance of input strings

Comparison script 164518test.ps1:

Function TestLD { param ([string]$Similarity = '') $aux = 0 Write-Progress -Activity "LD info ($Similarity strings)" ` -CurrentOperation 'Start' -PercentComplete $aux $cmdletTails = 'Orig','Orig+','Try1','Try2','' foreach ( $cmdletTail in $cmdletTails ) { $cmdlet = "Get-LevenshteinDistance$cmdletTail" $scriptBlock = { $LevenshteinDistance = & $cmdlet ` -CompareString $strC -DifferenceString $strD } $TimeSpan = Measure-Command -Expression $scriptBlock $aux += [int](100 / $cmdletTails.Count) Write-Progress -Activity "LD info ($Similarity strings)" ` -CurrentOperation $cmdlet -PercentComplete $aux [PSCustomObject]@{ 'cmdlet' = $cmdlet 'similarity' = $Similarity 'runtime (ms)' = $($TimeSpan.TotalMilliseconds. ToString('#####.###0',$cultureinfo). PadLeft(10)) 'lengths' = "$($strC.Length) $($strD.Length)" 'LD' = $LevenshteinDistance } } } . D:\PShell\CR\164518.ps1 # reload the functions $cultureinfo = [cultureinfo]::InvariantCulture # my one is different $strC='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum placerat leo ut turpis viverra lacinia' $strD=$strC TestLD -Similarity 'equal' $strD='Cras efficitur nec orci et posuere. Suspendisse potenti. Quisque blandit auctor purus id facilisis est' TestLD -Similarity 'stochastic' $strC='a' * $strC.Length $strD='a' * $strD.Length TestLD -Similarity 'similar' $strD='x' * $strD.Length TestLD -Similarity 'different' 

The main script 164518.ps1 (functions are defined here in the reverse order):

function Get-LevenshteinDistance{ param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # If strings are equal the LD is known and we can stop if ($CompareString -ceq $DifferenceString) { return 0 } # Build the comparison matrix as a (rectangular) jagged array # $comparisonMatrix = New-Object 'object[]' ($differenceStringLength+1) $comparisonMatrix = [System.Array]::CreateInstance([System.Object], ($differenceStringLength+1)) for($index=0; $index -le $differenceStringLength; $index++){ # Create row # $comparisonMatrix[$index]=New-Object 'object[]' ($compareStringLength+1) $comparisonMatrix[$index]=[System.Array]::CreateInstance([System.Object], ($compareStringLength+1)) # Populate the first column $comparisonMatrix[$index][0]=$index } # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0][$index]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # multiple use => to a variable # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list (a variable for the only use) # $differenceCharacter = $differenceString[$rowIndex-1] # the only use # Calculate the cost if($compareCharacter -ceq $differenceString[$rowIndex-1]){ $comparisonMatrix[$rowIndex][$columnIndex] = $comparisonMatrix[($rowIndex -1)][($columnIndex -1)] } else { # $cost = 1 # unneeded variable for the only use # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1)][$columnIndex] + 1 # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex][($columnIndex-1)] + 1 # # The cell diagonally above and to the left plus the cost ↓ $cellUpperLeft = $comparisonMatrix[($rowIndex-1)][($columnIndex-1)] + 1 # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. $comparisonMatrix[$rowIndex][$columnIndex] = [math]::Min([math]::Min($cellAbove,$cellLeft),$cellUpperLeft) } } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength][$compareStringLength] } function Get-LevenshteinDistanceTry2{ param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # If strings are equal the LD is known and we can stop if ($CompareString -ceq $DifferenceString) { return 0 } # Build the comparison matrix and populate the first column and first row. $comparisonMatrix = New-Object 'object[,]' ($differenceStringLength+1),($compareStringLength+1) # $comparisonMatrix = [System.Array]::CreateInstance([System.Object], ($differenceStringLength+1),($compareStringLength+1)) # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0,$index]=$index } # Populate the first column for($index=0; $index -le $differenceStringLength; $index++){ $comparisonMatrix[$index,0]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list $differenceCharacter = $differenceString[$rowIndex-1] # Calculate the cost if($compareCharacter -ceq $differenceCharacter){ $comparisonMatrix[$rowIndex,$columnIndex] = $comparisonMatrix[($rowIndex -1),($columnIndex -1)] } else { $cost = 1 # a variable for the only use? Useless! # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1), $columnIndex] + 1 # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex,($columnIndex-1)] + 1 # The cell diagonally above and to the left plus the cost $cellUpperLeft = $comparisonMatrix[($rowIndex-1),($columnIndex-1)] + $cost # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. # $comparisonMatrix[$rowIndex,$columnIndex] = $cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum | select -ExpandProperty Minimum # $comparisonMatrix[$rowIndex,$columnIndex] = ($cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum).Minimum $comparisonMatrix[$rowIndex,$columnIndex] = [math]::Min([math]::Min($cellAbove,$cellLeft),$cellUpperLeft) } } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength,$compareStringLength] } function Get-LevenshteinDistanceTry1{ #[cmdletbinding()] param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length ##Write-Verbose "Comparision String: '$CompareString' with length '$compareStringLength'" ##Write-Verbose "Difference String: '$DifferenceString' with length '$differenceStringLength'" # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # Build the comparison matrix and populate the first column and first row. $comparisonMatrix = New-Object 'object[,]' ($differenceStringLength+1),($compareStringLength+1) # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0,$index]=$index } # Populate the first column for($index=0; $index -le $differenceStringLength; $index++){ $comparisonMatrix[$index,0]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list $differenceCharacter = $differenceString[$rowIndex-1] ##Write-Verbose "Matrix location: [$rowIndex, $columnIndex]" ##Write-Verbose "Compare character: $compareCharacter - Difference character: $differenceCharacter" # Calculate the cost $cost=if($compareCharacter -ceq $differenceCharacter){0}else{1} ##Write-Verbose "Cost: $cost" # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1), $columnIndex] + 1 ##Write-Verbose "Cell Above: [$($rowIndex-1), $columnIndex] + 1 = $cellAbove" # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex,($columnIndex-1)] + 1 ##Write-Verbose "Cell Left: [$rowIndex,$($columnIndex-1)] + 1 = $cellLeft" # The cell diagonally above and to the left plus the cost $cellUpperLeft = $comparisonMatrix[($rowIndex-1),($columnIndex-1)] + $cost ##Write-Verbose "Cell Upper Left: [$($rowIndex-1),$($columnIndex-1)] + cost($cost) = $cellUpperLeft" # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. # $comparisonMatrix[$rowIndex,$columnIndex] = $cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum | select -ExpandProperty Minimum $comparisonMatrix[$rowIndex,$columnIndex] = [math]::Min([math]::Min($cellAbove,$cellLeft),$cellUpperLeft) } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength,$compareStringLength] } function Get-LevenshteinDistanceOrig+{ [cmdletbinding()] param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length ##Write-Verbose "Comparision String: '$CompareString' with length '$compareStringLength'" ##Write-Verbose "Difference String: '$DifferenceString' with length '$differenceStringLength'" # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # Build the comparison matrix and populate the first column and first row. $comparisonMatrix = New-Object 'object[,]' ($differenceStringLength+1),($compareStringLength+1) # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0,$index]=$index } # Populate the first column for($index=0; $index -le $differenceStringLength; $index++){ $comparisonMatrix[$index,0]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list $differenceCharacter = $differenceString[$rowIndex-1] ##Write-Verbose "Matrix location: [$rowIndex, $columnIndex]" ##Write-Verbose "Compare character: $compareCharacter - Difference character: $differenceCharacter" # Calculate the cost $cost=if($compareCharacter -ceq $differenceCharacter){0}else{1} ##Write-Verbose "Cost: $cost" # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1), $columnIndex] + 1 ##Write-Verbose "Cell Above: [$($rowIndex-1), $columnIndex] + 1 = $cellAbove" # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex,($columnIndex-1)] + 1 ##Write-Verbose "Cell Left: [$rowIndex,$($columnIndex-1)] + 1 = $cellLeft" # The cell diagonally above and to the left plus the cost $cellUpperLeft = $comparisonMatrix[($rowIndex-1),($columnIndex-1)] + $cost ##Write-Verbose "Cell Upper Left: [$($rowIndex-1),$($columnIndex-1)] + cost($cost) = $cellUpperLeft" # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. $comparisonMatrix[$rowIndex,$columnIndex] = $cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum | select -ExpandProperty Minimum } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength,$compareStringLength] } function Get-LevenshteinDistanceOrig{ [cmdletbinding()] param( [string]$CompareString, [string]$DifferenceString ) # Collect the string lengths $compareStringLength = $CompareString.Length $differenceStringLength = $DifferenceString.Length Write-Verbose "Comparision String: '$CompareString' with length '$compareStringLength'" Write-Verbose "Difference String: '$DifferenceString' with length '$differenceStringLength'" # If either of the strings are empty the LD is known and we can stop if($compareStringLength -eq 0){return $differenceStringLength} if($differenceStringLength -eq 0){return $compareStringLength} # Build the comparison matrix and populate the first column and first row. $comparisonMatrix = New-Object 'object[,]' ($differenceStringLength+1),($compareStringLength+1) # Populate the first row for($index=0; $index -le $compareStringLength; $index++){ $comparisonMatrix[0,$index]=$index } # Populate the first column for($index=0; $index -le $differenceStringLength; $index++){ $comparisonMatrix[$index,0]=$index } # Calculate the Levenshtein distance by working through each position in the matrix. # Working the columns for($columnIndex=1; $columnIndex -le $compareStringLength; $columnIndex++){ # Cycle each character in the list $compareCharacter = $compareString[$columnIndex-1] # Working the rows for($rowIndex=1; $rowIndex -le $differenceStringLength; $rowIndex++){ # Cycle each character in the list $differenceCharacter = $differenceString[$rowIndex-1] Write-Verbose "Matrix location: [$rowIndex, $columnIndex]" Write-Verbose "Compare character: $compareCharacter - Difference character: $differenceCharacter" # Calculate the cost $cost=if($compareCharacter -ceq $differenceCharacter){0}else{1} Write-Verbose "Cost: $cost" # The cell immediately above plus 1 $cellAbove = $comparisonMatrix[($rowIndex-1), $columnIndex] + 1 Write-Verbose "Cell Above: [$($rowIndex-1), $columnIndex] + 1 = $cellAbove" # The cell immediately to the left plus 1 $cellLeft = $comparisonMatrix[$rowIndex,($columnIndex-1)] + 1 Write-Verbose "Cell Left: [$rowIndex,$($columnIndex-1)] + 1 = $cellLeft" # The cell diagonally above and to the left plus the cost $cellUpperLeft = $comparisonMatrix[($rowIndex-1),($columnIndex-1)] + $cost Write-Verbose "Cell Upper Left: [$($rowIndex-1),$($columnIndex-1)] + cost($cost) = $cellUpperLeft" # Select minumum of the of the last 3 cells calculations and assign it to the current matrix position. $comparisonMatrix[$rowIndex,$columnIndex] = $cellAbove,$cellLeft,$cellUpperLeft | Measure-Object -Minimum | select -ExpandProperty Minimum } } # The cacluated LD will now be in the bottom right of the matrix. return $comparisonMatrix[$differenceStringLength,$compareStringLength] } 

Please note case sensitivity of the above functions. For case insensitive Levenshtein Distance:

  • permanently: change -ceq to -eq in their two occurrences, or
  • ad hoc, in a particular case: use .ToUpper() or .ToLower() functions, e.g. as
Get-LevenshteinDistance -CompareString $strC.ToUpper() -DifferenceString $strD.ToUpper() # or Get-LevenshteinDistance -CompareString $strC.ToLower() -DifferenceString $strD.ToLower() 
\$\endgroup\$
0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.