Issue 1466065: base64 module ignores non-alphabet characters (original) (raw)
Created on 2006-04-07 01:19 by sanxiyn, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (33)
Author: Seo Sanghyeon (sanxiyn)
Date: 2006-04-07 01:19
See: http://mail.python.org/pipermail/python-dev/2006-April/063504.html
base64.b64decode function ignores non-alphabet characters, which contradicts its documentation: see http://docs.python.org/lib/module-base64.html
Also, the exception must be ValueError instead of TypeError.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-28 22:38
Attached a patch that modifies Modules/binascii.c to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.
Patch built against Python 3.2, r74246.
TypeError was preserved over ValueError, despite the lack of related logic, for legacy purposes.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-28 22:39
Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.
Patch built against Python 2.7, r74246.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-28 22:40
Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than continuing silently. Also includes unit tests.
Patch built against Python 3.2, r74246.
Correction to previous text: base64 continues silently, discarding unrecognized characters, rather than failing.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2009-07-29 13:44
The new unit tests pass without modifying the library. Could you include a case that fails with the current version?
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 16:31
I can't add a test for that without changing unrelated behaviour in the library that is already known to work fine or checking the string value of the raised exception, which seems like a bad idea, even though it would work.
If a character is ignored and this leads to a padding-length issue, TypeError is raised in both 2.7 and 3.2: 2.7 because everything is a TypeError, and 3.2 because binascii.Error is converted to TypeError for legacy purposes.
If a character is ignored and the string's length is still acceptable, then no error is reported because this was a silent problem.
Post-library-modification, both of these cases will uniformly produce the proper error, although it is, through type-checking alone, indistinguishable from the errors that would have existed before -- the value is in the fact that it will tell the user the nature of the failure, and it will be noisy when it may have been silent before.
Author: R. David Murray (r.david.murray) *
Date: 2009-07-29 16:40
If "it may be noisy where it was silent before", then add one of those cases and make sure the noise doesn't happen before your fix, and does happen after.
If you have to check the value of the exception string for other tests, then do so. There are plenty of examples of this in the existing tests, (see the pydoc tests, for example). If you can limit what you test for so that the test will be resitent to changes in the exact text, so much the better. You can use assertRaisesRegexp for this in 2.7.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2009-07-29 16:44
What is the correct behavior for something like this? base64.b64decode('!') 2.6 silently returns ''.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 16:58
According to the documentation cited by Seo Sanghyeon in the first post, "A TypeError is raised if s were incorrectly padded or if there are non-alphabet characters present in the string."
The valid range of characters is [A-Za-z0-9], and one or two '='s may appear at the end of the input to signify dimension-padding. Therefore, '!' should fail with a TypeError alerting the user to the presence of unrecognized data, rather than being discarded.
(Additionally, it looks like newline characters in the input aren't unheard of, and those are probably the reason behind the silent ignores)
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 16:58
R. David Murray, should I update the patches for both the pure-Python solution and the C solution, or is one domain preferable here? The Python-based solution keeps all of the invalid-character TypeErrors collected in the same module, but the C-based solution allows this problem to be caught more efficiently.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2009-07-29 17:00
Therefore, '!' should fail with a TypeError Here is your test case! "Errors should never pass silently."
Author: R. David Murray (r.david.murray) *
Date: 2009-07-29 17:05
Amaury is probably better qualified to answer that question, but I would think the C code version is preferable.
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *
Date: 2009-07-29 17:23
I've dig into the history of the file and found this change: http://svn.python.org/view?view=rev&revision=13939 "- Illegal padding is now ignored. (Recommendation by GvR.)"
The motivation at the time was based on "the general Internet philosophy": http://mail.python.org/pipermail/python-bugs-list/1999-October/000234.html
I don't know if this is still valid 10 years later, though.
Author: Antoine Pitrou (pitrou) *
Date: 2009-07-29 17:35
Perhaps Guido remembers why the decision was made.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 17:40
I'll hold off on uploading new patches until someone makes a decision, then.
It seems like, perhaps, simply amending the documentation would be sufficient, since this behaviour shouldn't break any valid messages that might reach this module. At worst, it'll just treat gibberish as valid, and that's what it's been doing for a decade. (Although the other decode routines are all strict by comparison)
Author: Guido van Rossum (gvanrossum) *
Date: 2009-07-29 17:55
It strikes me as simply a documentation bug. Maybe whoever wrote the docs just assumed it would work that way. I expect that changing it to insist on valid input would break some use cases. If you want a validating API, perhaps a new API could be added that validates as well as converts.
Author: R. David Murray (r.david.murray) *
Date: 2009-07-29 18:04
It turns out that the RFC is unambiguous on this point. Quoting the base64 section of RFC 2045:
The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.
Since "some circumstances" is not something the base64 decoder can decide, that has to be left to a higher level ap. So if unexpected characters are to generate an error, it would need to be enabled via a flag that defaults to not raising the error, IMO. Unless someone has a use case for rejecting an improperly encoded message, we should probably just fix the docs (perhaps noting that this behavior is in accordance with the RFC).
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 18:14
RFC 3548, referenced by the base64 module's docs, has a rather different statement on how invalid characters should be treated.
From 2.3 Interpretation of non-alphabet characters in encoded data: Implementations MUST reject the encoding if it contains characters outside the base alphabet when interpreting base encoded data, unless the specification referring to this document explicitly states otherwise. Such specifications may, as MIME does, instead state that characters outside the base encoding alphabet should simply be ignored when interpreting data ("be liberal in what you accept").
So it looks like we can safely just say that invalid characters are ignored in the docs, as long as it's explicit, but that's probably not what people will expect.
I'll add doc patches in a moment, and someone who's actually a developer (i.e., not me) can decide whether they're good enough.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 18:23
Attached a documentation patch indicating that the ignore-invalid-characters behaviour is intentional, citing relevant RFCs for support.
Patch built against Python 3.2, r74261.
Author: R. David Murray (r.david.murray) *
Date: 2009-07-29 18:39
Hmm. But if the module is used outside of MIME (which it can be, and in fact is in the stdlib itself), then an error must be raised in order to comply with that RFC. So it sounds like we really ought to have that flag. And I was even wrong about the appropriate default.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 18:53
It isn't written that only MIME may ignore such content. The key terms there are 'may' and 'explicitly states otherwise'.
If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.
While I do agree that standards are always good and that workarounds are bad, Guido does have a very valid point: "changing it to insist on valid input would break some use cases", and I think we already missed the 2.x -> 3.x window where that would have been acceptable.
Author: Antoine Pitrou (pitrou) *
Date: 2009-07-29 19:05
If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.
But having to validate input manually kinds of defeats the point of having a decoder in the stdlib, therefore I agree with MRAB that a validation flag would be useful.
Author: R. David Murray (r.david.murray) *
Date: 2009-07-29 19:07
And if the flag defaults to the current behavior that should satisfy the backward compatiblity issues.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 19:36
Attached a documentation/unit-test/solution patch that adds a validate=False parameter to the b64decode function of Lib/base64.py, which may be set to True to have invalid base64 content be rejected with a TypeError.
Patch built against Python 3.2, r74261.
Author: Neil Tallim (Red HamsterX)
Date: 2009-07-29 19:38
Attached a documentation/unit-test/solution patch that adds a validate=False parameter to the b64decode function of Lib/base64.py, which may be set to True to have invalid base64 content be rejected with a TypeError.
Patch built against Python 2.7, r74261.
Note: Sorry this went on for so long. However, I think I understand the patch-submission process a lot better now.
Author: Mark Lawrence (BreamoreBoy) *
Date: 2010-08-22 09:05
Attached patches against 2.7 and 3.2 contain doc and unit test changes so can someone please review and commit if acceptable.
Author: R. David Murray (r.david.murray) *
Date: 2010-09-06 02:29
Comments on patch:
if I'm reading the RFC correctly, to be validating strictly in compliance with the RFC \r and \n should also raise an error. Do you agree?
We've pretty much dropped the convention of adding history notes to the file itself.
The original code documented a TypeError on incorrect padding but in py3k in fact raises a binascii.Error, as you noted in the patch. I wonder if it would be better to raise a binascii.Error on invalid data as well, since it isn't, strictly speaking, a TypeError. That would also make it easier to move the code into the C module later if that is deemed desirable.
The negative in the doc language ("If validate is not set to True...") is awkward and confusing. Better would be "If validate is False (the default)..."
Since the patch does add an API change (AKA a feature) I think this can only go into 3.2.
I have some additional concerns when considering this in the context of email6. email6 will have a 'strict' mode where it would be sensible for invalid base64 to raise an error. But in non-strict mode, it would be ideal to have a way to (a) know if there is invalid input, but still decode it, and (b) decode it even if the padding is off after ignoring the invalid data. I'm not saying that this patch should try to address those issues, I just want to put them on record in case I want to do something about them later.
Author: Neil Tallim (Red HamsterX)
Date: 2010-09-10 22:08
I agree that it does look like RFC 3548 implicitly denies the use of \r and \n. A few very simple tests also makes it look like ommitting them breaks nothing, but we'd need to add a test that includes them before applying the patch.
Other than that, I agree with everything and find the email6 notes intruiging.
Would you like me to update the patch, or do you want to do it?
Author: R. David Murray (r.david.murray) *
Date: 2010-09-11 01:35
If you could update it that would be great.
Author: R. David Murray (r.david.murray) *
Date: 2010-11-08 21:29
Here is an updated patch that addresses the concerns I noted. I modified the tests: given that I've changed the code to raise binascii.Error as discussed, we don't really care from an API point of view what the error text is, just that the error is raised in the cases given. Nor do we care (again, from an API point of view) if the first error detected is the invalid character or the invalid padding.
I made the new test test the validate=False case, so all of the invalid character tests are now correctly padded if you ignore the invalid characters.
Author: R. David Murray (r.david.murray) *
Date: 2010-11-11 20:12
Committed in r86414.
Author: Lukasz Taczuk (ltaczuk)
Date: 2012-11-26 12:30
Could someone update the docs for python 2.7.3?
This ticket is marked as closed and b64decode still silently ignores non base64 chars, but the documentation (for 2.7.3) still states that while calling b64decode "A TypeError is raised if (...) or if there are non-alphabet characters present in the string.". http://docs.python.org/2.7/library/base64.html
Author: Julian Berman (Julian) *
Date: 2014-07-26 22:44
Created to address not having fixed Py2 here.
History
Date
User
Action
Args
2022-04-11 14:56:16
admin
set
github: 43175
2014-07-26 22:44:38
Julian
set
nosy: + Julian
messages: +
2012-11-26 12:30:22
ltaczuk
set
nosy: + ltaczuk
messages: +
2010-11-11 20:12:52
r.david.murray
set
status: open -> closed
nosy: - BreamoreBoy
messages: +
resolution: accepted
stage: patch review -> resolved
2010-11-08 21:29:50
r.david.murray
set
files: + binascii-validate.patch
messages: +
2010-09-11 01:35:02
r.david.murray
set
messages: +
2010-09-10 22:08:01
Red HamsterX
set
messages: +
2010-09-06 02:29:49
r.david.murray
set
assignee: r.david.murray
messages: +
versions: - Python 3.1, Python 2.7
2010-08-22 16:57:23
gvanrossum
set
nosy: - gvanrossum
2010-08-22 09:05:54
BreamoreBoy
set
versions: + Python 3.1
nosy: + BreamoreBoy
messages: +
stage: test needed -> patch review
2010-02-16 05:50:30
eric.araujo
set
nosy: + eric.araujo
2009-07-29 22:25:28
rhettinger
set
messages: -
2009-07-29 19:38:23
Red HamsterX
set
files: + 1466065[2.7-complete].diff
messages: +
2009-07-29 19:36:54
Red HamsterX
set
files: + 1466065[3.2-complete].diff
messages: +
2009-07-29 19:33:05
Red HamsterX
set
files: - 1466065[3.2].diff
2009-07-29 19:33:01
Red HamsterX
set
files: - 1466065[2.7].diff
2009-07-29 19:07:54
r.david.murray
set
messages: +
2009-07-29 19:05:03
pitrou
set
messages: +
2009-07-29 18:53:24
Red HamsterX
set
messages: +
2009-07-29 18:39:26
r.david.murray
set
messages: +
2009-07-29 18:28:05
rhettinger
set
messages: -
2009-07-29 18:23:55
Red HamsterX
set
files: + 1466065[3.2].diff
messages: +
2009-07-29 18:23:26
Red HamsterX
set
files: + 1466065[2.7].diff
messages: +
2009-07-29 18:14:04
Red HamsterX
set
messages: +
2009-07-29 18:04:57
r.david.murray
set
messages: +
2009-07-29 17:55:46
gvanrossum
set
messages: +
2009-07-29 17:40:17
Red HamsterX
set
messages: +
2009-07-29 17:35:27
pitrou
set
nosy: + gvanrossum, pitrou
messages: +
versions: + Python 2.7, Python 3.2, - Python 2.6, Python 3.0
2009-07-29 17:23:20
amaury.forgeotdarc
set
messages: +
2009-07-29 17:14:34
Red HamsterX
set
files: - 1466065[3.2-pure-python].diff
2009-07-29 17:14:30
Red HamsterX
set
files: - 1466065[2.7-pure-python].diff
2009-07-29 17:14:26
Red HamsterX
set
files: - 1466065[3.2].diff
2009-07-29 17:14:21
Red HamsterX
set
files: - 1466065[2.7].diff
2009-07-29 17:05:10
r.david.murray
set
messages: +
2009-07-29 17:00:08
amaury.forgeotdarc
set
messages: +
2009-07-29 16:58:32
Red HamsterX
set
messages: +
2009-07-29 16:58:04
Red HamsterX
set
messages: +
2009-07-29 16:44:24
amaury.forgeotdarc
set
messages: +
2009-07-29 16:40:21
r.david.murray
set
nosy: + r.david.murray
messages: +
2009-07-29 16:31:00
Red HamsterX
set
messages: +
2009-07-29 13:44:01
amaury.forgeotdarc
set
nosy: + amaury.forgeotdarc
messages: +
2009-07-28 22:40:57
Red HamsterX
set
files: + 1466065[3.2-pure-python].diff
messages: +
2009-07-28 22:39:11
Red HamsterX
set
files: + 1466065[2.7-pure-python].diff
messages: +
2009-07-28 22:38:07
Red HamsterX
set
files: + 1466065[3.2].diff
messages: +
2009-07-28 22:36:56
Red HamsterX
set
files: + 1466065[2.7].diff
nosy: + ajaksu2, Red HamsterX
messages: +
keywords: + patch
2009-03-21 02:03:38
ajaksu2
set
stage: test needed
type: behavior
versions: + Python 2.6, Python 3.0
2006-04-07 01:19:53
sanxiyn
create