msg31175 - (view) |
Author: John Nagle (nagle) |
Date: 2007-02-04 22:34 |
I'm running a website page through BeautifulSoup. It parses OK with Python 2.4, but Python 2.5 fails with an exception: Traceback (most recent call last): File "./sitetruth/InfoSitePage.py", line 268, in httpfetch self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form File "./sitetruth/BeautifulSoup.py", line 1326, in __init__ BeautifulStoneSoup.__init__(self, *args, **kwargs) File "./sitetruth/BeautifulSoup.py", line 973, in __init__ self._feed() File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag method(attrs) File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta self._feed(self.declaredHTMLEncoding) File "./sitetruth/BeautifulSoup.py", line 998, in _feed SGMLParser.feed(self, markup or "") File "/usr/lib/python2.5/sgmllib.py", line 99, in feed self.goahead(0) File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128) The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4. I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that. To replicate, run http://www.bankofamerica.com or http://www.gm.com through BeautifulSoup. Something about this code doesn't like big companies. Web sites of smaller companies are going through OK. |
|
|
msg31176 - (view) |
Author: wrstl prmpft (wrstlprmpft) |
Date: 2007-02-05 07:16 |
I had a similar problem recently and did not have time to file a bug-report. Thanks for doing that. The problem is the code that handles entity and character references in SGMLParser.parse_starttag. Seems that it is not careful about unicode/str issues. (But maybe Beautifulsoup needs to tell it to?) My quick'n'dirty workaround was to remove the offending char-entity from the website before feeding it to Beautifulsoup:: text = text.replace('®', '') # remove rights reserved sign entity cheers, stefan |
|
|
msg31177 - (view) |
Author: John Nagle (nagle) |
Date: 2007-02-07 07:57 |
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the code for handling character escapes assumes that ASCII characters have values up to 255. But the correct limit is 127, of course. If a Unicode string is run through SGMLparser, and that string has a character in an attribute with a value between 128 and 255, which is valid in Unicode, the value is passed through as a character with "chr", creating a one-character invalid ASCII string. Then, when the bad string is later converted to Unicode as the output is assembled, the UnicodeDecodeError exception is raised. So the fix is to change 255 to 127 in convert_charref in sgmllib.py, as shown below. This forces characters above 127 to be expressed with escape sequences. Please patch accordingly. Thanks. def convert_charref(self, name): """Convert character reference, may be overridden.""" try: n = int(name) except ValueError: return if not 0 <= n <= 127 : # ASCII ends at 127, not 255 return return self.convert_codepoint(n) |
|
|
msg31178 - (view) |
Author: John Nagle (nagle) |
Date: 2007-04-27 21:41 |
We've been running this fix for several months now, and it seems to work. Would someone please check it and put it into the trunk? Thanks. |
|
|
msg31179 - (view) |
Author: Olivier Dormond (odormond) |
Date: 2007-06-06 16:38 |
Hello, I've been able to fix this entity conversion bug with the following patch. Cheers, Odie --- /usr/lib/python2.5/sgmllib.py 2007-05-27 17:55:15.000000000 +0200 +++ modules/sgmllib.py 2007-06-06 18:29:13.000000000 +0200 @@ -396,7 +396,7 @@ return self.convert_codepoint(n) def convert_codepoint(self, codepoint): - return chr(codepoint) + return unichr(codepoint) def handle_charref(self, name): """Handle character reference, no need to override.""" |
|
|
msg57014 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2007-11-01 17:15 |
Restore bug title. |
|
|
msg57022 - (view) |
Author: Simon (bind) |
Date: 2007-11-01 17:55 |
The 255 -> 127 change works for me. Let me know if I can help with unit tests or whatever to get this patched. |
|
|
msg84648 - (view) |
Author: Daniel Diniz (ajaksu2) *  |
Date: 2009-03-30 21:06 |
A patch against SVN trunk including a unittest would be great. |
|
|
msg84899 - (view) |
Author: Daniel Darabos (cyhawk) |
Date: 2009-03-31 21:00 |
Attached patch against SVN trunk including unittest. The test is not great, because it practically only checks if the patch was applied and not the real-life situation where the exception occurs, but I'm not too handy with sgmllib (I only encountered this problem through BeautifulSoup). |
|
|
msg84934 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2009-03-31 22:12 |
Committed in r70906. |
|
|