Issue 28450: Misleading/inaccurate documentation about unknown escape sequences in regular expressions (original) (raw)
Created on 2016-10-15 11:00 by lelit, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (14)
Author: Lele Gaifax (lelit) *
Date: 2016-10-15 11:00
Python 3.6+ is stricter about escaped sequences in string literals.
The documentation need some improvement to clarify the change: for example https://docs.python.org/3.6/library/re.html#re.sub first says that “Unknown escapes such as & are left alone” then, in the “Changed in” section below, states that “[in Py3.6] Unknown escapes consisting of '' and an ASCII letter now are errors”.
When such changes are made, usually the documentation reports the “new”/“current” behaviour, and the history section mention when and how some detail changed.
See this thread for details: https://mail.python.org/pipermail/python-list/2016-October/715462.html
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-10-16 08:04
Thank you for your report Lele. Agreed, the documentation looks misleading.
Do you want to provide more clear wording?
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-22 19:08
Maybe just remove the phrase "Unknown escapes such as & are left alone"?
Author: Barry A. Warsaw (barry) *
Date: 2016-11-22 19:10
I disagree that the documentation is at fault. This is known to break existing code, e.g. http://bugs.python.org/msg281496
I think it's not correct to change the documentation but leave the error-raising behavior for 3.6 because the deprecation was never documented in 3.5 so this will look like a gratuitous regression. for reference.
I also question whether it makes sense for such escapes to be illegal in the repl argument of re.sub(). I could understand for this limitation in the pattern argument, but that's not what's causing the error.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-22 19:16
The deprecation was documented in 3.5.
https://docs.python.org/3.5/library/re.html#re.sub
Deprecated since version 3.5, will be removed in version 3.6: Unknown escapes consist of '' and ASCII letter now raise a deprecation warning and will be forbidden in Python 3.6.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-22 19:28
The reason for disallowing some undefined escapes is the same as in pattern strings: this would allow as to introduce new special escape sequences. For example:
- \N{...} for named character escape.
- Perl and extended PCRE use \L and \U for making lower and upper casing of the replacement. \U is already used for other purpose, but you have an idea.
Of course the need in new special escape sequences in template string is much less then in pattern string.
Author: Matthew Barnett (mrabarnett) *
Date: 2016-11-22 19:42
@Barry: repl already supports some escapes, e.g. \g for named groups, although not \xXX et al, so deprecating unknown escapes like in the pattern makes sense to me.
BTW, the regex module already supports \xXX, \N{XXX}, etc.
Author: Barry A. Warsaw (barry) *
Date: 2016-11-22 20:28
On Nov 22, 2016, at 07:28 PM, Serhiy Storchaka wrote:
The reason for disallowing some undefined escapes is the same as in pattern strings: this would allow as to introduce new special escape sequences.
I'll note that technically speaking, you can still introduce new escapes for repl without breaking the documented contract. All the docs say are that "unknown escapes such as & are left alone", but that doesn't list what are unknown escapes. So if new escapes are added in Python 3.7, and they are transformed in repl, that would be allowed.
I'll also note that not all unknown sequences are rejected now, only backslashes followed by an ASCII letter. So & is still probably left alone, while \s is now rejected. That does add to the confusion, although the deprecation note in the re.sub() documentation does document the new behavior correctly.
On Nov 22, 2016, at 07:55 PM, R. David Murray wrote:
There is still the argument that we shouldn't break 2.7 compatibility unnecessarily until 2.7 is out of maintenance. That is: warnings are good, removals are bad. (I haven't read through this issue, so I may be off base.)
This is also a reasonable argument, but not one I've thought about since I'm using Python 2 only rarely these days.
On Nov 22, 2016, at 07:34 PM, Serhiy Storchaka wrote:
If you insist I could revert converting warnings to errors (only in replacement string or all?) in 3.6.
pattern is a regular expression string so it already follows the syntax as described in $6.2.1 Regular Expression Syntax. But I think a reading of that section (and the "special sequences" bit that follows) could also argue that unknown escapes shouldn't throw an error.
But I think they should left errors in 3.7. The earlier we make undefined escapes the errors, the earlier we can define new special escape sequences without confusing users. It is bad if the escape sequence is valid in two Python versions but has different meaning.
Perhaps so, but I do think this is a tricky question from a compatibility point of view. One possible optional, although it's late in the cycle, would be to introduce a new flag so the user could tell re exactly what behavior they want. The default would have to be backward compatible (i.e. leave unknown sequences alone), but there could be say an re.STRICTESCAPES flag that would cause the error to be thrown.
Author: Ned Deily (ned.deily) *
Date: 2016-11-29 03:55
Where do we stand on this issue? At the moment, 3.6.0 is on track to be released as is.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-29 05:02
I think we should discuss this on Python-Dev.
Author: Ned Deily (ned.deily) *
Date: 2016-12-06 22:30
Note that 1b162d6e3d01 in Issue27030 (for 3.6.0rc1) has changed the behavior for re.sub replacement templates to produce a deprecation warning in 3.6 while still being treated as an error in 3.7.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2017-11-16 15:01
Barry, could you please improve the documentation about unknown escape sequences in regular expressions? My skills is not enough for this.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2019-02-25 15:58
New changeset a180b007d96fe68b32f11dec720fbd0cd5b6758a by Serhiy Storchaka in branch 'master': bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920) https://github.com/python/cpython/commit/a180b007d96fe68b32f11dec720fbd0cd5b6758a
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2019-02-25 16:28
New changeset 95fc8e687c487ecf97f4b1b98dfc0c05e3c9cbff by Serhiy Storchaka in branch '3.7': [3.7] bpo-28450: Fix and improve the documentation for unknown escapes in RE. (GH-11920). (GH-12029) https://github.com/python/cpython/commit/95fc8e687c487ecf97f4b1b98dfc0c05e3c9cbff
History
Date
User
Action
Args
2022-04-11 14:58:38
admin
set
github: 72636
2019-02-25 16:30:13
serhiy.storchaka
set
status: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-02-25 16:28:55
serhiy.storchaka
set
messages: +
2019-02-25 16🔞04
serhiy.storchaka
set
pull_requests: + <pull%5Frequest12060>
2019-02-25 15:58:33
serhiy.storchaka
set
messages: +
2019-02-18 15:17:11
serhiy.storchaka
set
keywords: + patch
stage: needs patch -> patch review
pull_requests: + <pull%5Frequest11945>
2019-02-14 16:47:40
serhiy.storchaka
link
2017-11-16 15:01:37
serhiy.storchaka
set
messages: +
2016-12-06 22:30:30
ned.deily
set
messages: +
2016-11-29 05:02:15
serhiy.storchaka
set
messages: +
2016-11-29 03:55:31
ned.deily
set
nosy: + ned.deily
messages: +
2016-11-22 21:01:50
abarry
set
nosy: + abarry
2016-11-22 20:28:47
barry
set
messages: +
2016-11-22 19:42:41
mrabarnett
set
messages: +
2016-11-22 19:28:56
serhiy.storchaka
set
messages: +
2016-11-22 19:16:45
serhiy.storchaka
set
messages: +
2016-11-22 19:10:59
barry
set
nosy: + barry
messages: +
2016-11-22 19:08:06
serhiy.storchaka
set
messages: +
2016-10-16 08:12:09
serhiy.storchaka
set
nosy: + ezio.melotti
components: + Regular Expressions
2016-10-16 08:04:14
serhiy.storchaka
set
versions: + Python 3.5, Python 3.7
type: enhancement
nosy: + nedbat, serhiy.storchaka, Rosuav, mrabarnett
title: Misleading/inaccurate documentation about unknown escape sequences -> Misleading/inaccurate documentation about unknown escape sequences in regular expressions
messages: +
stage: needs patch
2016-10-15 11:00:13
lelit
create