Bug #2072424 “Misparsing of XML files with very long attributes” : Bugs : Beautiful Soup (original) (raw)

BeautifulSoup misparses XML files with multiple very long attributes, reproducer below. The requirements for this seem to be that the file has at least two elements that each have an attr that is almost 10 MB long. The symptom is that bs4 will, without raising any Exceptions or printing any warnings, silently swallow subsequent elements in the file, even in tree levels above the elems with the long attributes.

I ticked the "security vulnerability" box because this seems like it might be caused by some buffer overflowing somewhere. If this was due to an intentional length check, I'd expect it to happen (a) at a more "round" number of bytes and (b) with some visible error or warning.

Reproducer:
```
#!/usr/bin/env python

from xml.etree import ElementTree
from bs4 import BeautifulSoup

# The number below is the exact point where this issue shows up in my testing.
points = 'A'*9999825
input_svg = f'''

'''

print(f'Length of file is {len(input_svg)} bytes, length of each points attr is {len(points)} bytes')

soup = BeautifulSoup(input_svg, features='lxml-xml')
print('Beautifulsoup:', [e.get('id') for e in soup.find_all('g', recursive=True)])

root = ElementTree.fromstring(input_svg)
print('Python ElementTree:', [e.get('id') for e in root.iterfind('svg:g', {'svg': 'http://www.w3.org/2000/svg'})])
```

Output on my machine w/ Python v3.12.4 and bs4 v4.12.3 from the arch repos:
```
Length of file is 19999904 bytes, length of each points attr is 9999839 bytes
Beautifulsoup: ['one', 'two', 'three', 'four']
Python ElementTree: ['one', 'two', 'three', 'four', 'five']
```

Notice that in the output, the entry "five" is missing in the output from bs4, but present in both the input XML and in the output from Python's etree.