Issue 14251: HTMLParser decode issue (original) (raw)

Created on 2012-03-11 02:23 by rednaks, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)

msg155366 - (view)

Author: rednaks (rednaks)

Date: 2012-03-11 02:23

Hello ! while parsing a HTML code i got an decode Error :

but this issue can be fixed by replacing the last string by s.decode() like in the diff file. I also tried to execute my script under python3.2 and it does not parsing any thing

File "/usr/lib/python2.7/HTMLParser.py", line 111, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 155, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.7/HTMLParser.py", line 260, in parse_starttag attrvalue = self.unescape(attrvalue) File "/usr/lib/python2.7/HTMLParser.py", line 410, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s) File "/usr/lib/python2.7/re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1: ordinal not in range(128)

msg155367 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2012-03-11 02:32

Can you provide a minimal example to reproduce this error?

On Python 2 it's always better to decode the HTML first and then pass unicode to the parser. Even though on Python 2 the parser accepts bytes string too, there are a few corner cases where it fails.

On Python 3 the parser only accepts unicode, and it should work fine with it (especially if you have an updated clone of cpython). Can you show what failure you get with Python 3? Also, can you reproduce the error if you use strict=False?

msg155368 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2012-03-11 02:35

See also #3932.

msg155400 - (view)

Author: rednaks (rednaks)

Date: 2012-03-11 18:12

So we cant make decode by default ? ! Concerning python 3, it seems that it's not reading tags and attributes, i didn't get any error, but i don't have any result

the example i used is there : http://docs.python.org/library/htmlparser.html#module-HTMLParser

Of course, I replaced HTMLParser by html.parser

msg155403 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2012-03-11 18:32

I don't think the patch can be applied as is -- in order to work s should be an ascii-only str. I will look at this again as soon as I have some time and see if something can be done.

FTR the Python 3 doc for html.parser can be found here: http://docs.python.org/py3k/library/html.parser.html#example-html-parser-application

msg155412 - (view)

Author: rednaks (rednaks)

Date: 2012-03-11 21:53

thank you for giving me a little of your time !

Yes that's what i've tested, i used the html.parser module and and I have no result!

msg155533 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2012-03-13 00:11

I test this again and indeed a bare s.decode() is not enough to fix the problem. The attribute might contain non-ascii characters, and that will result in an error (see for example the "test.py" script attached to #3932). The correct solution is to decode the page before passing it to the parser.

History

Date

User

Action

Args

2022-04-11 14:57:27

admin

set

github: 58459

2012-03-13 00:11:30

ezio.melotti

set

status: open -> closed
versions: - Python 3.2
superseder: HTMLParser cannot handle '&' and non-ascii characters in attribute names
messages: +

resolution: duplicate
stage: resolved

2012-03-11 21:53:43

rednaks

set

messages: +

2012-03-11 18:32:53

ezio.melotti

set

messages: +

2012-03-11 18:12:26

rednaks

set

messages: +

2012-03-11 10:21:55

eric.araujo

set

nosy: + eric.araujo

title: [PATCH]HTMLParser decode issue -> HTMLParser decode issue

2012-03-11 02:35:16

ezio.melotti

set

messages: +

2012-03-11 02:32:20

ezio.melotti

set

nosy: + ezio.melotti
messages: +

assignee: ezio.melotti
type: crash -> behavior

2012-03-11 02:23:14

rednaks

create