fix: support accented characters in word segmentation for return_word… by Ghazi-raad · Pull Request #17201 · PaddlePaddle/PaddleOCR (original) (raw)
…_box
Fixes PaddlePaddle#17156
The word segmentation in get_word_info() was using [a-zA-Z0-9] regex which only matched ASCII letters and digits. This caused words with accented characters (ä, ö, ü, é, à, etc.) to be incorrectly split into separate segments.
Changed to use \w with re.UNICODE flag which properly matches:
- All Unicode letter characters (including accented/diacritic characters)
- Digits from all scripts
- Excludes underscore (which \w includes but we want as splitter)
This fix enables proper word grouping for German, French, Polish, and other languages with accented characters while maintaining backward compatibility with existing ASCII text processing.
Example: 'Grüßen' now stays as one word instead of ['Gr', 'üß', 'en']