Issue 37071: HTMLParser mistakenly inventing new tags while parsing (original) (raw)
I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content:
Animation Playback Options
-a
<options>
<file(s)>
Playback <file(s)>
, only operates this way when not running in background.
-p
<sx>
<sy>
Open with lower left corner at <sx>
, <sy>
.
-m
Read from disk (Do not buffer).
-f
<fps>
<fps-base>
Specify FPS to start with.
-j
<frame>
Set frame step to <frame>
.
-s
<frame>
Play from <frame>
.
-e
<frame>
Play until <frame>
.
This is the HTML block that is generated by Sphinx:
I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example:
....has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it
ie.
<literal>
<options>
</options>
</literal>
and
<literal>
<file(s)>
</file(s)>
</literal>
which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA.
Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify.
Here is the URL of the HTML file:
https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html
Kind Regards, Hoang Tran
Please verify with 3.7.3+ and the latest version of Sphinx. Even if there is a problem, Sphinx is not an stdlib package. The problem would only be relevant to this tracker, rather than the Sphinx tracker, if it were due to our customizations or use of Sphinx.
Thank you for the report.
Looking at the BeautifulSoup source, there is a comment about this scenario: # Unlike other parsers, html.parser doesn't send separate end tag # events for empty-element tags. (It's handled in # handle_startendtag, but only if the original markup looked like # .) # # So we need to call handle_endtag() ourselves. Since we # know the start event is identical to the end event, we # don't want handle_endtag() to cross off any previous end # events for tags of this name.
HTMLParser itself produces output such as:
class MyParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print(f'start: {tag}') ... def handle_endtag(self, tag): ... print(f'end: {tag}') ... def handle_data(self, data): ... print(f'data: {data}') ... parser = MyParser() parser.feed('
') start: p start: test end: p
My suggestion would be to try a different parser in BeautifulSoup [1] to handle this. Even if we wanted to modify HTMLParser, any such change would probably be backwards incompatible.
[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser