So long lxml, and thanks for all the fish by FiloSottile · Pull Request #329 · ytdl-org/youtube-dl (original) (raw)

This code will fail on descriptions that contain non-ASCII characters, such as this video whose description is mostly in Japanese:

rmunn@localhost:~/test$ ./youtube-dl -l -c --skip-download --write-info-json http://www.youtube.com/watch?v=9wMwplzBK-Y
[youtube] Setting language
[youtube] 9wMwplzBK-Y: Downloading video webpage
[youtube] 9wMwplzBK-Y: Downloading video info webpage
[youtube] 9wMwplzBK-Y: Extracting video information
Traceback (most recent call last):
  File "./youtube-dl", line 4760, in <module>
    main()
  File "./youtube-dl", line 4751, in main
    _real_main()
  File "./youtube-dl", line 4735, in _real_main
    retcode = fd.download(all_urls)
  File "./youtube-dl", line 964, in download
    ie.extract(url)
  File "./youtube-dl", line 1230, in extract
    return self._real_extract(url)
  File "./youtube-dl", line 1496, in _real_extract
    video_description = get_element_by_id("eow-description", video_webpage)
  File "./youtube-dl", line 256, in get_element_by_id
    parser.loads(html)
  File "./youtube-dl", line 210, in loads
    self.feed(html)
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 252, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "/usr/lib/python2.7/HTMLParser.py", line 393, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)