", "lxml"))' while the old libxml2 version caused CDATA to be stripped: $ python -c 'import bs4; print(bs4.BeautifulSoup("", "l...">

Bug #1930164 “lxml backend built against libxml2 2.9.11+ does no...” : Bugs : Beautiful Soup (original) (raw)

When lxml is built against libxml2 2.9.11+, the parser behavior seems to change causing bs4 output to be inconsistent with other parsers. Not sure if this is to be considered a feature or a bug.

For example:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<![CDATA[that]]>", "lxml"))'

<!\[CDATA\[that\]\]>

while the old libxml2 version caused CDATA to be stripped:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<![CDATA[that]]>", "lxml"))'

This causes soupsieve's tests to fail, see: https://github.com/facelessuser/soupsieve/issues/220. I am not sure whether this is something that can/should be fixed in bs4, lxml or libxml2 itself. The parser is a bit beyond my comprehension, so I've figured out that I'll ask here first.

Little debugging I did suggests that previously CDATA was not reported by the parser at all, while now it is reported as two data method calls: first with content of '<', and then with '![CDATA[that]]>'.

The relevant libxml2 commit is:

commit 173a0830dcec769a5f12c5c55ef4ab424b388efb
Author: Nick Wellnhofer
Date: 2020-07-22 23:15:35 +0200

Fix quadratic runtime when push parsing HTML start tags

Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.

In htmlParseTryOrFinish, only switch to START_TAG if the next character
starts a valid name. Otherwise, htmlParseStartTag might return without
consuming all characters up to the final '>'.

Found by OSS-Fuzz.

Note that in order to reproduce this you need to build lxml from source, as binary wheels are statically linked to libxml2 2.9.10.