bpo-6686: Fix Lib.xml.sax.expatreader.GetProperty to return a string object · Pull Request #9715 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation36 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jonathan and others added 4 commits

October 3, 2018 20:56

(xml.sax.handler.property_xml_string) returns bytes

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

taleinat changed the title~~Issue6686~~ bpo-6686: Fix Lib.xml.sax.expatreader.GetProperty to return a string object

Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

In addition to my inline comments, IMO checking that the type of the .getProperty() output is not enough. We should check that the output is precisely what we expect given the input. One way to implement this is to have the handler keep all of the outputs, e.g. in a list, and then inspect the output after the parsing is done.

Finally, while I think the name SaxPropertyTest is good, it currently only tests one type of property. I feel that we should include a comment mentioning that it currently doesn't test other property types.

self.test_harness = test_harness
self.reader = reader

def startElement(self, name, attr): # @UnusedVariable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the # @UnusedVariable comment, we don't use these in this codebase.

@@ -177,7 +177,7 @@ def getProperty(self, name):
elif name == property_xml_string:
if self._parser:
if hasattr(self._parser, "GetInputContext"):
return self._parser.GetInputContext()
return self._parser.GetInputContext().decode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.decode() uses the UTF-8 encoding by default. Shouldn't this be using the encoding used by the parser?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this one. Will have a new commit in a little while.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be much more complicated than appears on the surface. Here is a description of what happens:

The encoding is stored in a xmlreader.InputSource instance.
This instance is not saved in any Python class instance,
It is passed unmodified right through to the Expat parser.
It is permitted for the encoding to not be specified by a calling application.
If the encoding is not specified, Expat uses the encoding specified in the XML prolog statement.
Standard Expat only supports US-ASCII, UTF-8, UTF-16 and ISO 8859 (Latin 1)
If the XML prolog does not specify an encoding it uses UTF-8.
The final encoding used is not passed back to Python and Expat converts the input to UTF-8 no matter what the external coding was, losing the information about the original encoding.
This is according to the documentation for Expat and the Python ans CPython code.

This is what will happen in various situations:

If no encoding is specified anywhere and the document is in ISO-8859 or UTF-16, Expat will usually encounter indigestion early and will tell us. It is possible that short pathological documents might slip through.
If no encoding is specified anywhere and the document is in US-ASCII it will be accepted correctly by Expat. If the document is in UTF-8 it will obviously be accepted.
If the encoding is contained in the prolog of the XML document the situation is the same since Expat will reject any unsupported encodings and if the document is mis-coded it will fail in Expat as described above.
I have not yet found where the property string which is actually the raw source line gets converted back to bytes within Python or Expat or whether Python is able to get a copy of the line before it is converted to UTF-8 by Expat for its' internal processing. Do you think I should continue researching this?.

My recommendation is to leave our code as it is as we have no way to determine what the original encoding was. Expat will handle most of the error situations itself and this feature seems to be little used by Python users based on the lack of any followup in the last 9 years to this issue.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to fix this, we should fix it properly.

If expat converts the data internally to UTF-8, then the existing code here should just work, and all that will be needed is to add a comment explaining this.

If expat only supports the four encodings mentioned, then IMO we don't need to address what happens with other encodings.

I suggest adding a tests with non-ASCII characters in the XML, checking that the "XML string" property is exactly as expected, i.e. expat properly decoded it from its original encoding and then encoded into UTF-8. Do this for all encodings supported by expat (though for ASCII include only ASCII characters). This will ensure that everything works as expected.

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

I have made the requested changes; please review again.

Thanks for making the requested changes!

@taleinat: please review the changes made to this pull request.

def test_xml_str_str(self):
reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format as so (here and below):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the style used in the rest of this codebase (I just did a search to make sure).

ISTM that PEP8 refers to non-assignment operators in the section that you quote. Perhaps it could use some clarification.

input_\
= '\nHi there'
reader.parse(StringIO(input_))
self.assertEqual(input_, self.result.getvalue())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test the result of the .getProperty(property_xml_string) calls? That's what I'd like to see tested for more than its return type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to line breaks - I am following the recommendation in Pep8. It is recommended as the new approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am waiting for your reaction to my comments before I push the latest changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.

Yes, but the original source is not reconstructed via the .getProperty() method that we are testing (unless I'm missing something).

@@ -1311,6 +1312,50 @@ def test_nsattrs_wattr(self):
self.assertEqual(attrs.getQNameByName((ns_uri, "attr")), "ns:attr")


# ===========================================================================

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Could you remove these comments, they don't need to be in the source code.
Thank you

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the rest of this test module, you'll see that it is in keeping with its existing style. IMO this is fine.

def test_xml_str_str(self):
reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\
= b'\nHi there'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with the max length

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of Pep8 was that external teams could decide to allow 100 colums but that for Python development itself ony 80 columns should be used. The comments you are talking about were there before I created modifications to test_sax.py and I was asked by the reviewer to only touch my additions.

I did sign the CLA yesterday but it needs to be reviewed by the PSF first before they accept it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this codebase, code lines should be at most 79 characters long.

And there is an other point, you didn't sign the CLA. it's a requirement for the review.

Thank you

#
# Sax parser property tests
#
# Currently only tests the condition reported in issue bpo-6686

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note (the second line) shouldn't be in this comment block though, it should be in the class itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me at something that discusses this idiom? I don't understand the rationale.

I have finally got to the bottom of this issue. The situation is as follows:

It is necessary, not just desirable, to provide the encoding when decoding the InputContext property. I
wrote test cases that provided input in all of the supported encodings and it failed on all except Ascii
and UTF-8.
The InputContext buffer is always in the original encoding of the source that is passed to the parser.
Expat documents that the use of the InputContext data is problematic. The reason is that this is that Expat returns the contents of the current raw buffer that are associated with the current element being processed. This will be incomplete in the following cases:
- The data for the element crosses buffer boundaries. In this case only the end of the element will be
  available.
- The data for the element needs several buffers. Again, only the portion of the element in the current
  buffer will be available.
Python does not document this limitation.
The standard implementation of the SAX parser in expatreader.py does not support retrieving the information used for determining the encoding used by Expat.
Prior to release 1.95 of Expat, it did not have the ability to expose this information either. The current
release is 2.01. Since this, Expat allows the user to provide a handler that will receive the version,
encoding and standalone information from the XML declaration. This permits the user to determine
the encoding used by Expat.
Pyexpat.c supports allowing a user to provide this handler to Expat.
The Python parser expat.py supports this handler and thus allows the handler to be specified to
Expat, but the expatreader.py SAX parser does not support this capability. This means that it is
impossible for the SAX parser to know what encoding was used by the Expat parser.
This means that it is not possible for the Python Sax parser to decode the buffer provided in
InputContext.

Given all this this is my assessment of what should be done:

The current implementation of expatreader.py correctly handles the InputContext.
The documentation is wrong and needs to be corrected. It is correct to return a bytes object, rather
than a string object.
The documentation should also be updated to contain the warning about the problematic aspects of
this data that is included in the Expat documentation.
I ran the coverage tool against test_sax.py and it showed coverage ranging from 10% to 80%. To me
this implies that it should be upgraded to improve coverage and to test the setting and retrieval of
properties and features which are completely ignored by the current implementation. This work should
be done as a separate Python issue.

What do you think?

@jfgossage

It is necessary, not just desirable, to provide the encoding when decoding the InputContext property. I
wrote test cases that provided input in all of the supported encodings and it failed on all except Ascii
and UTF-8.

This is great work! Please make sure to add such tests to this PR at some point.

What do you think?

I think that XML is inherently a string-based format rather than a bytes-based format. Therefore, returning the "XML string" property as bytes seems wrong to me, especially considering that users have no good way of retrieving the appropriate encoding with which to decode it.

The Python parser expat.py supports this handler and thus allows the handler to be specified to
Expat, but the expatreader.py SAX parser does not support this capability.

Can we rectify this situation? It seems like adding such support should be relatively simple.

Sorry to take so long but I needed to do some more research. Let me summarize the streaming XML support situation in Python. There are two streaming XML parsers provided by Python, the SAX parser and the Expat parser parser.expat. The Sax parser is a faithful implementation of the SAX Api. The Expat parser provides a super-set of SAX with additional Api's that make it easier to use. Since it's availability I would expect that most new development was don using the Expat parser and the Sax parser remains basically to support old software that was written using it.

If you are going to write new software in Python using only standard Python support you are almost certainly going to use the Expat parser. If you have a free choice you will probably use Lxml which is even better and more performant. The SAX parser exists basically to support legacy applications and the documentation for the InputContext property clearly states that you get the data in the same encoding as you originally supplied it.

Also, even with the Expat parser, it would be up to the user to implement the code needed to determine the encoding that was actually used. The parser defines an interface for the needed handler, but it is up to the user to provide an implementation for the handler.

So if we are to implement your suggestion we would need to do the following:

In the Python implementation of Sax the additional handlers are are referenced via a table that is passed to the lexical_handler property. Currently this table does not have a slot defined for the Expat XML decl handler which will give us the encoding specified in the XML declaration.
The next problem is that even if this extension is implemented, the handler only returns data to the user, not to the Python implementation. So we need to find a way to allow the Python implementation to snoop on the handler calls and this capability does not exist.
If we were to implement these two capabilities it would then be easy to implement your suggestion.

My personal feeling is that this is far too much work to support what is ultimately a flawed construct. I think that if we are to proceed along this path we need a wider consensus than we two can provide and that the content of this PR should be examined by one or more senior members of the Python community. The basic position is that we are now looking at an enhancement, not simply a bug fix.

@jfgossage

Sorry to take so long but I needed to do some more research.

No need to apologize, this is great work.

My personal feeling is that this is far too much work to support what is ultimately a flawed construct. I think that if we are to proceed along this path we need a wider consensus than we two can provide and that the content of this PR should be examined by one or more senior members of the Python community. The basic position is that we are now looking at an enhancement, not simply a bug fix.

I agree. How about a different path, then?

What do you think about changing pyexpat.xmlparser.GetInputContext() to return a string rather than a bytes object? The pyexpat.xmlparser objects know what the relevant encoding is.

This too would require review from a more senior core dev, but that's fine IMO to get a good solution in place.

I think we have to accept that the code is working as the developer if not the documenter envisaged it. This makes your proposal an enhancement and I have one big problem with it. Changing the return type from bytes to a string is a breaking change. Anyone who is actually using this is accustomed to receiving a bytes object and their code will probably break when they try to use the new string object.

What I propose is the following:

Treat bpo-6686 as a documentation bug and update the documentation accordingly, leaving the code unchanged.
Create a new enhancement issue suggesting the following changes:
- Modify reset_lex_handler_prop to handle the Expat xml_decl_handler.
- Add a xml_decl_handler that will retrieve the encoding information from the XML declaration and
  store it in the parser. This means the parser will know what encooding, if any, that the document
  specified.
- Add an API to 'xmlreader.py' and expatreader.py to return the encoding actually used given the
  following algorithm:
  * If an encoding is specified in the InputSource return it.
  * If there is an encoding from the XML declaration return it.
  * Return UTF-8 as the default encoding if nothing specified.
This will allow the user to determine what encoding was assumed for the source by Expat and allow
correctly decoding the InputContext to a string.

Doing things this way ensures that no breaking changes are introduced, that changes are held to a minimum and gives the community the opportunity for normal discussion of the proposal.

I was also bothered by the clunkiness of the solution I proposed. The biggest problem I had was that handlers in Expat are designed to allow users to receive data from Expat, not for internal use within Expat. Expat already supports a handler that makes this information available to the user, the XML_decl_handler and it is supported in Python by the Expat parser but not by the SAX parser.

Using the XML_decl_handler internally was also the solution of the originator of this issue.

The SAX parser does provide a mechanism, property_lexical_handler, that exposes all the additional handlers supported by Expat with the exception of the XML_decl_handler. If we were to extend expatreader.py and the XMLReader interface to support to support this handler, then users can access the encoding information needed to decode the property_xml_string in the correct place, which is in the users' code.

I don't think that the Python Expat support should decode the property_xml_string internally since it is clearly documented in Expat that the raw XML is returned and in Python terms this means bytes. Also the user may want to get this in bytes.

In summary, the XML-decl_handler is Expat's way of exposing this information to the user and I consider it either a deficiency or an oversight that it was not supported in the SAX parser.

If you accept this solution than I feel we should treat the original issue as a documentation issue and bring the documentation in harmony with the implementation. We can then create a new issue to cover the exposure of the XML_decl_handler to the user in a SAX compatible manner.

One more reason why property_xml_string should be returned in the original encoding. The most likely place where this property might be used is in an error handler that wants to look at the original XML. In this case, especially if the error is associated with decoding, it is essential that the text be in the original encoding.

@jfgossage, after taking some time to think this over and get a fresh perspective, I completely agree with your latest comments. Let's close this PR and fix the documentation.

I'm not even sure an enhancement proposal is necessary: This hasn't been requested so far, so I'm not sure it is worth the effort to implement for purely theoretical use cases.

I'm going to close this as it is from an unknown repository. Based on the last comment from Tal, it probably needs to be changed to a documentation change so the current PR should not be replaced as it currently is.