So long lxml, and thanks for all the fish by FiloSottile · Pull Request #329 · ytdl-org/youtube-dl (original) (raw)
This code will fail on descriptions that contain non-ASCII characters, such as this video whose description is mostly in Japanese:
rmunn@localhost:~/test$ ./youtube-dl -l -c --skip-download --write-info-json http://www.youtube.com/watch?v=9wMwplzBK-Y
[youtube] Setting language
[youtube] 9wMwplzBK-Y: Downloading video webpage
[youtube] 9wMwplzBK-Y: Downloading video info webpage
[youtube] 9wMwplzBK-Y: Extracting video information
Traceback (most recent call last):
File "./youtube-dl", line 4760, in <module>
main()
File "./youtube-dl", line 4751, in main
_real_main()
File "./youtube-dl", line 4735, in _real_main
retcode = fd.download(all_urls)
File "./youtube-dl", line 964, in download
ie.extract(url)
File "./youtube-dl", line 1230, in extract
return self._real_extract(url)
File "./youtube-dl", line 1496, in _real_extract
video_description = get_element_by_id("eow-description", video_webpage)
File "./youtube-dl", line 256, in get_element_by_id
parser.loads(html)
File "./youtube-dl", line 210, in loads
self.feed(html)
File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.7/HTMLParser.py", line 393, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)