[CODEC-330] org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String) does not remove special characters like punctuation (original) (raw)

Method: org.apache.commons.codec.language.DaitchMokotoffSoundex.cleanup(String input)

Problem

The private method cleanup(final String input) in DaitchMokotoffSoundex is responsible for sanitizing the input string before the phonetic encoding is applied. While it correctly removes whitespace and performs ASCII folding, it does not remove non-letter special characters such as $, @, #, !, or digits. These characters remain in the cleaned string.

As a result, special characters may interfere with phonetic rule matching in downstream methods like "soundex" and "encode", potentially leading to incorrect or inconsistent results.

For example, cleanup("Hello$World") -> "hello$world"

The dollar sign ($) should have been removed, but it remains in the result.

The expected result should be "helloworld"

Suggested Fix

Modify the cleanup() method to include a check for non-letter characters:

if (!Character.isLetter(ch))

{ continue; // Ignore non-letter characters like $, @, -, etc. }

This small change will make the method more robust when processing real-world input strings that may contain unexpected non-letter characters.

Additional Context

This issue was identified during unit testing using JUnit 5. After applying the above fix, all test cases involving inputs with special characters pass successfully. Without this fix, the current implementation fails to process inputs containing unexpected special characters.