[TEXT-158] Incorrect values for Jaccard similarity with empty strings (original) (raw)

In a discussion part of TEXT-126, it was pointed that the Jaccard similarity returns 0.0, and the distance 1.0. While in other libraries it returns the opposite for each.

package br.eti.kinoshita.tests.text;

import java.util.Collections;

public class EditDistances {

public static void main(String[] args) {
    System.out.println("Testing jaccard sim/dis with empty strings");
    System.out.println("---");
    org.simmetrics.metrics.Jaccard<String> j1 = new org.simmetrics.metrics.Jaccard<>();
    float s1 = j1.compare(Collections.emptySet(), Collections.emptySet());
    System.out.println("Simmetrics Jaccard similarity: " + s1);
    float d1 = j1.distance(Collections.emptySet(), Collections.emptySet());
    System.out.println("Simmetrics Jaccard distance: " + d1);
    
    System.out.println("---");
    
    info.debatty.java.stringsimilarity.Jaccard j2 = new info.debatty.java.stringsimilarity.Jaccard();
    double s2 = j2.similarity("", "");
    System.out.println("javastringsimilarity Jaccard similarity: " + s2);
    double d2 = j2.distance("", "");
    System.out.println("javastringsimilarity Jaccard distance: " + d2);
    
    System.out.println("---");
    
    org.apache.commons.text.similarity.JaccardSimilarity j3_1 = new org.apache.commons.text.similarity.JaccardSimilarity();
    double s3 = j3_1.apply("", "");
    System.out.println("commons-text Jaccard similarity: " + s3);
    org.apache.commons.text.similarity.JaccardDistance j3_2 = new org.apache.commons.text.similarity.JaccardDistance();
    double d3 = j3_2.apply("", "");
    System.out.println("commons-text Jaccard distance: " + d3);
}

}

Produces:

Testing jaccard sim/dis with empty strings

Simmetrics Jaccard similarity: 1.0 Simmetrics Jaccard distance: 0.0

javastringsimilarity Jaccard similarity: 1.0 javastringsimilarity Jaccard distance: 0.0

commons-text Jaccard similarity: 0.0 commons-text Jaccard distance: 1.0

We need to confirm what's the correct output for similarity and distance with empty strings. And either document why we are returning what we are returning, or fix it as a bug for the next release.

is related to

Improvement - An improvement or enhancement to an existing feature or task. TEXT-126 Dice's Coefficient Algorithm in String similarity

links to