msg235688 - (view) |
Author: Myroslav Opyr (Myroslav.Opyr) * |
Date: 2015-02-10 14:54 |
cgi.FieldStorage has problems parsing the multipart/form-data request with file fields with non-latin filenames. It drops the filename parameter formatted according to RFC6266 [1] (most modern browsers do). There is already python implementation for that RFC in rfc6266 module [2]. Ref: [1] https://tools.ietf.org/html/rfc6266 [2] https://pypi.python.org/pypi/rfc6266 |
|
|
msg235732 - (view) |
Author: Myroslav Opyr (Myroslav.Opyr) * |
Date: 2015-02-11 08:53 |
In test_cgi.py-v2.7.5-rfc6266_filename.patch there is a patch to test_cgi.py (Python 2.7.5) that reveals the issue. |
|
|
msg235734 - (view) |
Author: Myroslav Opyr (Myroslav.Opyr) * |
Date: 2015-02-11 10:32 |
As a proof of concept there is fix for the issue powered by rfc6266 library[1]. See cgi.py-v2.7.5-rfc6266_filename.patch References: [1] https://pypi.python.org/pypi/rfc6266 |
|
|
msg235909 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2015-02-13 18:39 |
Since that library is not part of the stdlib, this is not an appropriate patch for CPython. Note that this issue is also relevant to the email library, which intends to support RFC2616 header parsing/generation, and therefore should also be enhanced to support RFC 6266. |
|
|
msg236101 - (view) |
Author: Myroslav Opyr (Myroslav.Opyr) * |
Date: 2015-02-16 12:52 |
Hi David, According to "Test Cases for HTTP Content-Disposition header field" overview [1], this is not about email headers, but only about HTTP headers. It look like email standards and http standars are different in this area. I do know that my patch is poor. It is just proof of concept, to show that there is an issue in stdlib and one of the possible fast patches to get functionality needed. Regards, Myroslav Ref: [1] http://greenbytes.de/tech/tc2231/ |
|
|
msg236365 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2015-02-21 14:47 |
I know it is called the 'email' package, but the intent is to support http header parsing as well (cf email.policy.HTTP). |
|
|
msg314265 - (view) |
Author: Paweł (pawciobiel) * |
Date: 2018-03-22 15:22 |
I didn't find this and created a duplicate https://bugs.python.org/issue33027 I've added similar/updated changes https://github.com/python/cpython/pull/6027 @r.david.murray wouldn't it be wise to do one step at a time rather than implementing full support for RFC6266? Please tell exactly what is your expectations so I can fix the patch if it needs to be fixed. This is also related to RFC5987 https://tools.ietf.org/html/rfc5987 https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition |
|
|
msg314297 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2018-03-23 02:26 |
I haven't read the http rfcs, but my understanding is that they follow the MIME standards, and the email library already has code to do proper parsing and decoding of encoded filenames in Content-Disposition headers. It should be possible to call that code for this use case (the http libraries already depend on the email libraries, although I'm not sure if cgi itself does currently). There may be additional considerations involved in fully supporting the http RFCs, but to determine that someone will need to read both and understand them, which is not a small undertaking :) In the meantime, I'm pretty sure that using the existing mime header parsing code in the email library (see email.headerregistry) will provide better parsing than the only-handles-simple-cases heuristic in your PR. Granted, I don't think you have to deal with multi-part headers in http, but I vaguely remember that there are other subtleties not handled by a simple split on '. |
|
|
msg359224 - (view) |
Author: And Clover (aclover) * |
Date: 2020-01-02 23:23 |
HTTP generally isn't an RFC 822-family standard. Its headers look a lot like it, but they have their own defined syntax that differs in niggling little details. Using mail parsing code for HTTP isn't usually the right thing. HTTP has always used its own syntax definitions for the headers on the main request/response entities, but it has traditionally partially deferred to RFC 822-family specs for the definitions of structured entity bodies. This is moot, however, as the reality of what browsers support has rarely coincided with those specs. Nowadays HTML5.2 explicitly defers to RFC 7578 for definition of multipart/form-data headers. (This RFC is a replacement for the vague and broken RFC 2388.) As is to be expected for an HTML5-related spec, RFC 7578 shrugs and documents existing browser behaviour [section 4.2]: - some browsers do UTF-8 - some browsers do data mangling (IE's %-encoding sadness) - some browsers might do something else but it explicitly rules out the solution proposed here: "The encoding method described in [RFC5987], which would add a 'filename*' parameter to the Content-Disposition header field, MUST NOT be used." The introductions of both RFC 5987 and RFC 6266 explicitly exclude multipart/form-data headers from their remit. So in summary: - we shouldn't do anything - the situation with submitted filenames will continue to be broken for everyone indefinitely |
|
|
msg359533 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2020-01-07 18:40 |
Are you saying there is no (http) RFC compliant way to fix this, or no way to fix it with the email library parsers? If the latter, the library is pretty flexible and for internal stdlib use it would probably be permissible to directly call methods in the internal parsing module, if those would be useful. I haven't re-read the issue to reload my brain, so this question may be off point (except for the first clause of the question). |
|
|
msg359577 - (view) |
Author: And Clover (aclover) * |
Date: 2020-01-08 10:45 |
> Are you saying there is no (http) RFC compliant way to fix this Sadly, yes. And though RFCs aren't always a fair representation of real-world use, RFC 7578 is informative as well as normative: at present nothing produces "filename*=" in multipart/form-data. |
|
|