fix: support accented characters in word segmentation for return_word… by Ghazi-raad · Pull Request #17201 · PaddlePaddle/PaddleOCR (original) (raw)

@Ghazi-raad

…_box

Fixes PaddlePaddle#17156

The word segmentation in get_word_info() was using [a-zA-Z0-9] regex which only matched ASCII letters and digits. This caused words with accented characters (ä, ö, ü, é, à, etc.) to be incorrectly split into separate segments.

Changed to use \w with re.UNICODE flag which properly matches:

This fix enables proper word grouping for German, French, Polish, and other languages with accented characters while maintaining backward compatibility with existing ASCII text processing.

Example: 'Grüßen' now stays as one word instead of ['Gr', 'üß', 'en']