[TEXT-131] JaroWinklerDistance: Calculation deviates from definition (original) (raw)
The calculation in JaroWinklerDistance deviates from the definition of the Jaro-Winkler Similarity. By definition the common prefix length is only determine for the first 4 characters. Further, the JaroWinkler is defined as JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity ).
Therefore, I recommend the following changes:
- Update Jaro-Winkler Similarity calculation
final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / mtp[3]) * mtp[2] * (1D - j);
to
final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j); - Update calculation of Common Prefix Length
for (int mi = 0; mi < min.length(); mi++) {
to
for (int mi = 0; mi < Math.min(4, min.length()); mi++) { - Remove unnecessary return value
return new int[] {matches, transpositions, prefix, max.length()};
to
return new int[] {matches, transpositions, prefix};