Issue 1466065: base64 module ignores non-alphabet characters (original) (raw)

Created on 2006-04-07 01:19 by sanxiyn, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (33)

msg60903 - (view)

Author: Seo Sanghyeon (sanxiyn)

Date: 2006-04-07 01:19

See: http://mail.python.org/pipermail/python-dev/2006-April/063504.html

base64.b64decode function ignores non-alphabet characters, which contradicts its documentation: see http://docs.python.org/lib/module-base64.html

Also, the exception must be ValueError instead of TypeError.

msg91023 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-28 22:38

Attached a patch that modifies Modules/binascii.c to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.

Patch built against Python 3.2, r74246.

TypeError was preserved over ValueError, despite the lack of related logic, for legacy purposes.

msg91024 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-28 22:39

Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than failing silently. Also includes unit tests.

Patch built against Python 2.7, r74246.

msg91025 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-28 22:40

Attached a patch that modifies Lib/base64.py to raise a TypeError on any unrecognized base64 input, rather than continuing silently. Also includes unit tests.

Patch built against Python 3.2, r74246.

Correction to previous text: base64 continues silently, discarding unrecognized characters, rather than failing.

msg91039 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2009-07-29 13:44

The new unit tests pass without modifying the library. Could you include a case that fails with the current version?

msg91043 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 16:31

I can't add a test for that without changing unrelated behaviour in the library that is already known to work fine or checking the string value of the raised exception, which seems like a bad idea, even though it would work.

If a character is ignored and this leads to a padding-length issue, TypeError is raised in both 2.7 and 3.2: 2.7 because everything is a TypeError, and 3.2 because binascii.Error is converted to TypeError for legacy purposes.

If a character is ignored and the string's length is still acceptable, then no error is reported because this was a silent problem.

Post-library-modification, both of these cases will uniformly produce the proper error, although it is, through type-checking alone, indistinguishable from the errors that would have existed before -- the value is in the fact that it will tell the user the nature of the failure, and it will be noisy when it may have been silent before.

msg91045 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2009-07-29 16:40

If "it may be noisy where it was silent before", then add one of those cases and make sure the noise doesn't happen before your fix, and does happen after.

If you have to check the value of the exception string for other tests, then do so. There are plenty of examples of this in the existing tests, (see the pydoc tests, for example). If you can limit what you test for so that the test will be resitent to changes in the exact text, so much the better. You can use assertRaisesRegexp for this in 2.7.

msg91046 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2009-07-29 16:44

What is the correct behavior for something like this? base64.b64decode('!') 2.6 silently returns ''.

msg91047 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 16:58

According to the documentation cited by Seo Sanghyeon in the first post, "A TypeError is raised if s were incorrectly padded or if there are non-alphabet characters present in the string."

The valid range of characters is [A-Za-z0-9], and one or two '='s may appear at the end of the input to signify dimension-padding. Therefore, '!' should fail with a TypeError alerting the user to the presence of unrecognized data, rather than being discarded.

(Additionally, it looks like newline characters in the input aren't unheard of, and those are probably the reason behind the silent ignores)

msg91048 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 16:58

R. David Murray, should I update the patches for both the pure-Python solution and the C solution, or is one domain preferable here? The Python-based solution keeps all of the invalid-character TypeErrors collected in the same module, but the C-based solution allows this problem to be caught more efficiently.

msg91049 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2009-07-29 17:00

Therefore, '!' should fail with a TypeError Here is your test case! "Errors should never pass silently."

msg91050 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2009-07-29 17:05

Amaury is probably better qualified to answer that question, but I would think the C code version is preferable.

msg91052 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2009-07-29 17:23

I've dig into the history of the file and found this change: http://svn.python.org/view?view=rev&revision=13939 "- Illegal padding is now ignored. (Recommendation by GvR.)"

The motivation at the time was based on "the general Internet philosophy": http://mail.python.org/pipermail/python-bugs-list/1999-October/000234.html

I don't know if this is still valid 10 years later, though.

msg91053 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-07-29 17:35

Perhaps Guido remembers why the decision was made.

msg91054 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 17:40

I'll hold off on uploading new patches until someone makes a decision, then.

It seems like, perhaps, simply amending the documentation would be sufficient, since this behaviour shouldn't break any valid messages that might reach this module. At worst, it'll just treat gibberish as valid, and that's what it's been doing for a decade. (Although the other decode routines are all strict by comparison)

msg91055 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2009-07-29 17:55

It strikes me as simply a documentation bug. Maybe whoever wrote the docs just assumed it would work that way. I expect that changing it to insist on valid input would break some use cases. If you want a validating API, perhaps a new API could be added that validates as well as converts.

msg91056 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2009-07-29 18:04

It turns out that the RFC is unambiguous on this point. Quoting the base64 section of RFC 2045:

The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.

Since "some circumstances" is not something the base64 decoder can decide, that has to be left to a higher level ap. So if unexpected characters are to generate an error, it would need to be enabled via a flag that defaults to not raising the error, IMO. Unless someone has a use case for rejecting an improperly encoded message, we should probably just fix the docs (perhaps noting that this behavior is in accordance with the RFC).

msg91057 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 18:14

RFC 3548, referenced by the base64 module's docs, has a rather different statement on how invalid characters should be treated.

From 2.3 Interpretation of non-alphabet characters in encoded data: Implementations MUST reject the encoding if it contains characters outside the base alphabet when interpreting base encoded data, unless the specification referring to this document explicitly states otherwise. Such specifications may, as MIME does, instead state that characters outside the base encoding alphabet should simply be ignored when interpreting data ("be liberal in what you accept").

So it looks like we can safely just say that invalid characters are ignored in the docs, as long as it's explicit, but that's probably not what people will expect.

I'll add doc patches in a moment, and someone who's actually a developer (i.e., not me) can decide whether they're good enough.

msg91060 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 18:23

Attached a documentation patch indicating that the ignore-invalid-characters behaviour is intentional, citing relevant RFCs for support.

Patch built against Python 3.2, r74261.

msg91061 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2009-07-29 18:39

Hmm. But if the module is used outside of MIME (which it can be, and in fact is in the stdlib itself), then an error must be raised in order to comply with that RFC. So it sounds like we really ought to have that flag. And I was even wrong about the appropriate default.

msg91062 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 18:53

It isn't written that only MIME may ignore such content. The key terms there are 'may' and 'explicitly states otherwise'.

If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.

While I do agree that standards are always good and that workarounds are bad, Guido does have a very valid point: "changing it to insist on valid input would break some use cases", and I think we already missed the 2.x -> 3.x window where that would have been acceptable.

msg91064 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2009-07-29 19:05

If the documentation is clear, then all future application developers will know to check for validity using a regular expression like '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which validity matters should already have a similar workaround.

But having to validate input manually kinds of defeats the point of having a decoder in the stdlib, therefore I agree with MRAB that a validation flag would be useful.

msg91065 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2009-07-29 19:07

And if the flag defaults to the current behavior that should satisfy the backward compatiblity issues.

msg91067 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 19:36

Attached a documentation/unit-test/solution patch that adds a validate=False parameter to the b64decode function of Lib/base64.py, which may be set to True to have invalid base64 content be rejected with a TypeError.

Patch built against Python 3.2, r74261.

msg91068 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2009-07-29 19:38

Patch built against Python 2.7, r74261.

Note: Sorry this went on for so long. However, I think I understand the patch-submission process a lot better now.

msg114650 - (view)

Author: Mark Lawrence (BreamoreBoy) *

Date: 2010-08-22 09:05

Attached patches against 2.7 and 3.2 contain doc and unit test changes so can someone please review and commit if acceptable.

msg115687 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2010-09-06 02:29

Comments on patch:

if I'm reading the RFC correctly, to be validating strictly in compliance with the RFC \r and \n should also raise an error. Do you agree?
We've pretty much dropped the convention of adding history notes to the file itself.
The original code documented a TypeError on incorrect padding but in py3k in fact raises a binascii.Error, as you noted in the patch. I wonder if it would be better to raise a binascii.Error on invalid data as well, since it isn't, strictly speaking, a TypeError. That would also make it easier to move the code into the C module later if that is deemed desirable.
The negative in the doc language ("If validate is not set to True...") is awkward and confusing. Better would be "If validate is False (the default)..."

Since the patch does add an API change (AKA a feature) I think this can only go into 3.2.

I have some additional concerns when considering this in the context of email6. email6 will have a 'strict' mode where it would be sensible for invalid base64 to raise an error. But in non-strict mode, it would be ideal to have a way to (a) know if there is invalid input, but still decode it, and (b) decode it even if the padding is off after ignoring the invalid data. I'm not saying that this patch should try to address those issues, I just want to put them on record in case I want to do something about them later.

msg116049 - (view)

Author: Neil Tallim (Red HamsterX)

Date: 2010-09-10 22:08

I agree that it does look like RFC 3548 implicitly denies the use of \r and \n. A few very simple tests also makes it look like ommitting them breaks nothing, but we'd need to add a test that includes them before applying the patch.

Other than that, I agree with everything and find the email6 notes intruiging.

Would you like me to update the patch, or do you want to do it?

msg116072 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2010-09-11 01:35

If you could update it that would be great.

msg120803 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2010-11-08 21:29

Here is an updated patch that addresses the concerns I noted. I modified the tests: given that I've changed the code to raise binascii.Error as discussed, we don't really care from an API point of view what the error text is, just that the error is raised in the cases given. Nor do we care (again, from an API point of view) if the first error detected is the invalid character or the invalid padding.

I made the new test test the validate=False case, so all of the invalid character tests are now correctly padded if you ignore the invalid characters.

msg120966 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2010-11-11 20:12

Committed in r86414.

msg176409 - (view)

Author: Lukasz Taczuk (ltaczuk)

Date: 2012-11-26 12:30

Could someone update the docs for python 2.7.3?

This ticket is marked as closed and b64decode still silently ignores non base64 chars, but the documentation (for 2.7.3) still states that while calling b64decode "A TypeError is raised if (...) or if there are non-alphabet characters present in the string.". http://docs.python.org/2.7/library/base64.html

msg224090 - (view)

Author: Julian Berman (Julian) *

Date: 2014-07-26 22:44

Created to address not having fixed Py2 here.

History

Date

User

Action

Args

2022-04-11 14:56:16

admin

set

github: 43175

2014-07-26 22:44:38

Julian

set

nosy: + Julian
messages: +

2012-11-26 12:30:22

ltaczuk

set

nosy: + ltaczuk
messages: +

2010-11-11 20:12:52

r.david.murray

set

status: open -> closed

nosy: - BreamoreBoy
messages: +

resolution: accepted
stage: patch review -> resolved

2010-11-08 21:29:50

r.david.murray

set

files: + binascii-validate.patch

messages: +

2010-09-11 01:35:02

r.david.murray

set

messages: +

2010-09-10 22:08:01

Red HamsterX

set

messages: +

2010-09-06 02:29:49

r.david.murray

set

assignee: r.david.murray
messages: +
versions: - Python 3.1, Python 2.7

2010-08-22 16:57:23

gvanrossum

set

nosy: - gvanrossum

2010-08-22 09:05:54

BreamoreBoy

set

versions: + Python 3.1
nosy: + BreamoreBoy

messages: +

stage: test needed -> patch review

2010-02-16 05:50:30

eric.araujo

set

nosy: + eric.araujo

2009-07-29 22:25:28

rhettinger

set

messages: -

2009-07-29 19:38:23

Red HamsterX

set

files: + 1466065[2.7-complete].diff

messages: +

2009-07-29 19:36:54

Red HamsterX

set

files: + 1466065[3.2-complete].diff

messages: +

2009-07-29 19:33:05

Red HamsterX

set

files: - 1466065[3.2].diff

2009-07-29 19:33:01

Red HamsterX

set

files: - 1466065[2.7].diff

2009-07-29 19:07:54

r.david.murray

set

messages: +

2009-07-29 19:05:03

pitrou

set

messages: +

2009-07-29 18:53:24

Red HamsterX

set

messages: +

2009-07-29 18:39:26

r.david.murray

set

messages: +

2009-07-29 18:28:05

rhettinger

set

messages: -

2009-07-29 18:23:55

Red HamsterX

set

files: + 1466065[3.2].diff

messages: +

2009-07-29 18:23:26

Red HamsterX

set

files: + 1466065[2.7].diff

messages: +

2009-07-29 18:14:04

Red HamsterX

set

messages: +

2009-07-29 18:04:57

r.david.murray

set

messages: +

2009-07-29 17:55:46

gvanrossum

set

messages: +

2009-07-29 17:40:17

Red HamsterX

set

messages: +

2009-07-29 17:35:27

pitrou

set

nosy: + gvanrossum, pitrou

messages: +
versions: + Python 2.7, Python 3.2, - Python 2.6, Python 3.0

2009-07-29 17:23:20

amaury.forgeotdarc

set

messages: +

2009-07-29 17:14:34

Red HamsterX

set

files: - 1466065[3.2-pure-python].diff

2009-07-29 17:14:30

Red HamsterX

set

files: - 1466065[2.7-pure-python].diff

2009-07-29 17:14:26

Red HamsterX

set

files: - 1466065[3.2].diff

2009-07-29 17:14:21

Red HamsterX

set

files: - 1466065[2.7].diff

2009-07-29 17:05:10

r.david.murray

set

messages: +

2009-07-29 17:00:08

amaury.forgeotdarc

set

messages: +

2009-07-29 16:58:32

Red HamsterX

set

messages: +

2009-07-29 16:58:04

Red HamsterX

set

messages: +

2009-07-29 16:44:24

amaury.forgeotdarc

set

messages: +

2009-07-29 16:40:21

r.david.murray

set

nosy: + r.david.murray
messages: +

2009-07-29 16:31:00

Red HamsterX

set

messages: +

2009-07-29 13:44:01

amaury.forgeotdarc

set

nosy: + amaury.forgeotdarc
messages: +

2009-07-28 22:40:57

Red HamsterX

set

files: + 1466065[3.2-pure-python].diff

messages: +

2009-07-28 22:39:11

Red HamsterX

set

files: + 1466065[2.7-pure-python].diff

messages: +

2009-07-28 22:38:07

Red HamsterX

set

files: + 1466065[3.2].diff

messages: +

2009-07-28 22:36:56

Red HamsterX

set

files: + 1466065[2.7].diff

nosy: + ajaksu2, Red HamsterX
messages: +

keywords: + patch

2009-03-21 02:03:38

ajaksu2

set

stage: test needed
type: behavior
versions: + Python 2.6, Python 3.0

2006-04-07 01:19:53

sanxiyn

create