Issue 17915: Encoding error with sax and codecs (original) (raw)

Created on 2013-05-06 11:14 by sconseil, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
report.txt sconseil,2013-05-06 11:14 Minimal example to reproduce the issue
test_codecs.py vstinner,2013-05-06 21:51
XMLGenerator_codecs_stream.patch serhiy.storchaka,2013-05-07 13:43 review
Messages (12)
msg188508 - (view) Author: Simon Conseil (sconseil) * Date: 2013-05-06 11:14
There is an encoding issue between codecs.open and sax (see attached file). The issue is reproducible on Python 3.3.1, it is working fine on Python 3.3.0
msg188587 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-06 20:31
Since this is a regression, setting (temporarily perhaps) as release blocker.
msg188599 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-05-06 21:48
It looks like a regression of introduced by the fix of the issue #1470548, changeset 66f92f76b2ce.
msg188600 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-05-06 21:51
Extracted test from report.txt. Test with Python 3.4: $ ./python test_codecs.py Traceback (most recent call last): File "test_codecs.py", line 7, in xml.startDocument() File "/home/haypo/prog/python/default/Lib/xml/sax/saxutils.py", line 148, in startDocument self._encoding) File "/home/haypo/prog/python/default/Lib/codecs.py", line 699, in write return self.writer.write(data) File "/home/haypo/prog/python/default/Lib/codecs.py", line 355, in write data, consumed = self.encode(object, self.errors) TypeError: Can't convert 'bytes' object to str implicitly _gettextwriter() of xml.sax.saxutils does not recognize codecs classes. (See also the PEP 400 :-)).
msg188640 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 10:50
It is not working fine on Python 3.3.0. >>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f: ... xml = XMLGenerator(f, encoding='iso-8859-1') ... xml.startDocument() ... xml.startElement('root', {'attr': u'\u20ac'}) ... xml.endElement('root') ... xml.endDocument() ... Traceback (most recent call last): File "", line 4, in File "/home/serhiy/py/cpython-3.Lib/xml/sax/saxutils.py", line 141, in startElement self._write(' %s=%s' % (name, quoteattr(value))) File "/home/serhiy/py/cpython-3.Lib/xml/sax/saxutils.py", line 96, in _write self._out.write(text) File "/home/serhiy/py/cpython-3.Lib/codecs.py", line 699, in write return self.writer.write(data) File "/home/serhiy/py/cpython-3.Lib/codecs.py", line 355, in write data, consumed = self.encode(object, self.errors) UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256) And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text. Accepting of text streams in XMLGenerator should be deprecated in future versions.
msg188642 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-05-07 12:06
> Accepting of text streams in XMLGenerator should be deprecated in future versions. I agree that the following pattern is strange: with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f: xml = XMLGenerator(f, encoding='iso-8859-1') Why would I specify a codec twice? What happens if I specify two different codecs? with codecs.open('/tmp/test.txt', 'w', encoding='utf-8') as f: xml = XMLGenerator(f, encoding='iso-8859-1') It may be simpler (and safer?) to reject text files. If you cannot detect that f is a text file, just make it explicit in the documentation that f must be a binary file. 2013/5/7 Serhiy Storchaka <report@bugs.python.org>: > > Serhiy Storchaka added the comment: > > It is not working fine on Python 3.3.0. > >>>> with codecs.open('/tmp/test.txt', 'w', encoding='iso-8859-1') as f: > ... xml = XMLGenerator(f, encoding='iso-8859-1') > ... xml.startDocument() > ... xml.startElement('root', {'attr': u'\u20ac'}) > ... xml.endElement('root') > ... xml.endDocument() > ... > Traceback (most recent call last): > File "", line 4, in > File "/home/serhiy/py/cpython-3.Lib/xml/sax/saxutils.py", line 141, in startElement > self._write(' %s=%s' % (name, quoteattr(value))) > File "/home/serhiy/py/cpython-3.Lib/xml/sax/saxutils.py", line 96, in _write > self._out.write(text) > File "/home/serhiy/py/cpython-3.Lib/codecs.py", line 699, in write > return self.writer.write(data) > File "/home/serhiy/py/cpython-3.Lib/codecs.py", line 355, in write > data, consumed = self.encode(object, self.errors) > UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 7: ordinal not in range(256) > > And shouldn't. On Python 2 XMLGenerator works only with binary files and "works" with text files only due implicit str->unicode converting. On Python 3 working with binary files was broken. Issue1470548 restores working with binary file (for which only XMLGenerator can work correctly), but for backward compatibility accepting of text files was left. The problem is that there no trustworthy method to determine whenever a file-like object is binary or text. > > Accepting of text streams in XMLGenerator should be deprecated in future versions. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue17915> > _______________________________________
msg188650 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:43
Here is a patch which adds explicit checks for codecs stream writers and adds tests for these cases. The tests are not entirely honest, they test only that XMLGenerator works with some specially prepared streams. XMLGenerator doesn't work with a stream with arbitrary encoding and errors handler.
msg188654 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:48
Of course, if this patch will be committed, perhaps it will be worth to apply it also for 3.2 which has the same regression.
msg188657 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-05-07 13:57
Perhaps we should add a deprecation warning for codecs streams right in this patch?
msg189003 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-05-12 10:32
New changeset 1c01571ce0f4 by Georg Brandl in branch '3.2': Issue #17915: Fix interoperability of xml.sax with file objects returned by http://hg.python.org/cpython/rev/1c01571ce0f4
msg189009 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-05-12 10:45
Fixed in 3.2, 3.3 and default.
msg189063 - (view) Author: Simon Conseil (sconseil) * Date: 2013-05-12 21:19
thanks everybody !
History
Date User Action Args
2022-04-11 14:57:45 admin set github: 62115
2013-05-12 21:19:48 sconseil set messages: +
2013-05-12 10:45:59 georg.brandl set status: open -> closedresolution: fixedmessages: +
2013-05-12 10:32:42 python-dev set nosy: + python-devmessages: +
2013-05-07 13:57:03 serhiy.storchaka set messages: +
2013-05-07 13:48:21 serhiy.storchaka set stage: needs patch -> patch reviewmessages: + components: + XMLversions: + Python 3.2
2013-05-07 13:43:48 serhiy.storchaka set files: + XMLGenerator_codecs_stream.patchkeywords: + patchmessages: +
2013-05-07 12:06:06 vstinner set messages: +
2013-05-07 10:50:38 serhiy.storchaka set messages: +
2013-05-06 21:51:08 vstinner set files: + test_codecs.pymessages: +
2013-05-06 21:48:19 vstinner set messages: +
2013-05-06 20:31:35 pitrou set priority: normal -> release blockernosy: + larry, pitrou, georg.brandlmessages: + stage: needs patch
2013-05-06 20:30:39 pitrou set nosy: + vstinner, serhiy.storchakatype: behaviorversions: + Python 3.4
2013-05-06 11:14:06 sconseil create