[TEXT-131] JaroWinklerDistance: Calculation deviates from definition (original) (raw)

The calculation in JaroWinklerDistance deviates from the definition of the Jaro-Winkler Similarity. By definition the common prefix length is only determine for the first 4 characters. Further, the JaroWinkler is defined as JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity ).
Therefore, I recommend the following changes:

  1. Update Jaro-Winkler Similarity calculation
    final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / mtp[3]) * mtp[2] * (1D - j);
    to
    final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j);
  2. Update calculation of Common Prefix Length
    for (int mi = 0; mi < min.length(); mi++) {
    to
    for (int mi = 0; mi < Math.min(4, min.length()); mi++) {
  3. Remove unnecessary return value
    return new int[] {matches, transpositions, prefix, max.length()};
    to
    return new int[] {matches, transpositions, prefix};