Issue32285
Created on 2017-12-12 01:16 by Maxime Belanger, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Pull Requests |
|
|
|
URL |
Status |
Linked |
Edit |
PR 4806 |
merged |
maxbelanger,2017-12-12 01:19 |
|
Messages (4) |
|
|
msg308085 - (view) |
Author: Maxime Belanger (Maxime Belanger) |
Date: 2017-12-12 01:16 |
In our deployment of Python 2.7, we've patched `unicodedata` to introduce a new function: `is_normalized` can check whether a unistr is in a given normal form. This currently has to be done by creating a normalized copy, then checking whether it is equal to the source string. This function uses the internal helper (also called `is_normalized`) that can "quick check" normalization, but falls back on creating a normalized copy and comparing (when necessary). We're contributing this change in case this can helpful to others. Feedback is welcome! |
|
|
msg308122 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2017-12-12 12:25 |
Python 2.7 is in feature freeze, so this can only go into 3.7. I would find this useful, and would like this feature. However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. That could be expensive, and shouldn't be needed. According to here: http://unicode.org/reports/tr15/#Detecting_Normalization_Forms in the worst case, you can incrementally check only the code points in doubt (around the "MAYBE" code points). |
|
|
msg308127 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-12-12 13:10 |
> However, I'm concerned by your comment that you fall back on creating a normalized copy and comparing. The purpose of the function is to be faster than str == unicodedata.normalize(form, str). So yeah, any optimization is welcome. But I don't bother with MAYBE suboptimal case which is implemented with: str == unicodedata.normalize(form, str). It can be optimized later, if needed. If someone cares of performance, I will require a benchmark, since I only trust numbers :-) |
|
|
msg329276 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2018-11-04 23:58 |
New changeset 2810dd7be9876236f74ac80716d113572c9098dd by Benjamin Peterson (Max Bélanger) in branch 'master': closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) https://github.com/python/cpython/commit/2810dd7be9876236f74ac80716d113572c9098dd |
|
|
History |
|
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:55 |
admin |
set |
github: 76466 |
2018-11-04 23:58:27 |
benjamin.peterson |
set |
status: open -> closednosy: + benjamin.petersonmessages: + resolution: fixedstage: patch review -> resolved |
2018-10-25 00:06:09 |
Maxime Belanger |
set |
versions: + Python 3.8, - Python 3.7 |
2017-12-12 13:10:46 |
vstinner |
set |
messages: + |
2017-12-12 12:25:08 |
steven.daprano |
set |
versions: - Python 2.7nosy: + steven.dapranomessages: + type: enhancement |
2017-12-12 01:19:59 |
maxbelanger |
set |
keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest4703> |
2017-12-12 01:16:10 |
Maxime Belanger |
create |
|