bpo-6686: Fix Lib.xml.sax.expatreader.GetProperty to return a string object · Pull Request #9715 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation36 Commits8 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

ghost

jonathan and others added 4 commits

October 3, 2018 20:56

(xml.sax.handler.property_xml_string) returns bytes

@JonathanGossage

@JonathanGossage

@JonathanGossage

@the-knights-who-say-ni

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@taleinat taleinat changed the titleIssue6686 bpo-6686: Fix Lib.xml.sax.expatreader.GetProperty to return a string object

Oct 5, 2018

taleinat

Contributor

@taleinat taleinat left a comment • Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

In addition to my inline comments, IMO checking that the type of the .getProperty() output is not enough. We should check that the output is precisely what we expect given the input. One way to implement this is to have the handler keep all of the outputs, e.g. in a list, and then inspect the output after the parsing is done.

Finally, while I think the name SaxPropertyTest is good, it currently only tests one type of property. I feel that we should include a comment mentioning that it currently doesn't test other property types.

self.test_harness = test_harness
self.reader = reader
def startElement(self, name, attr): # @UnusedVariable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the # @UnusedVariable comment, we don't use these in this codebase.

@@ -177,7 +177,7 @@ def getProperty(self, name):
elif name == property_xml_string:
if self._parser:
if hasattr(self._parser, "GetInputContext"):
return self._parser.GetInputContext()
return self._parser.GetInputContext().decode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.decode() uses the UTF-8 encoding by default. Shouldn't this be using the encoding used by the parser?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this one. Will have a new commit in a little while.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns out to be much more complicated than appears on the surface. Here is a description of what happens:

This is what will happen in various situations:

My recommendation is to leave our code as it is as we have no way to determine what the original encoding was. Expat will handle most of the error situations itself and this feature seems to be little used by Python users based on the lack of any followup in the last 9 years to this issue.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to fix this, we should fix it properly.

If expat converts the data internally to UTF-8, then the existing code here should just work, and all that will be needed is to add a comment explaining this.

If expat only supports the four encodings mentioned, then IMO we don't need to address what happens with other encodings.

I suggest adding a tests with non-ASCII characters in the XML, checking that the "XML string" property is exactly as expected, i.e. expat properly decoded it from its original encoding and then encoded into UTF-8. Do this for all encodings supported by expat (though for ASCII include only ASCII characters). This will ensure that everything works as expected.

@bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@JonathanGossage

@ghost

I have made the requested changes; please review again.

@bedevere-bot

Thanks for making the requested changes!

@taleinat: please review the changes made to this pull request.

taleinat

def test_xml_str_str(self):
reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format as so (here and below):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the style used in the rest of this codebase (I just did a search to make sure).

ISTM that PEP8 refers to non-assignment operators in the section that you quote. Perhaps it could use some clarification.

input_\
= '\nHi there'
reader.parse(StringIO(input_))
self.assertEqual(input_, self.result.getvalue())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test the result of the .getProperty(property_xml_string) calls? That's what I'd like to see tested for more than its return type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to line breaks - I am following the recommendation in Pep8. It is recommended as the new approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am waiting for your reaction to my comments before I push the latest changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.

Yes, but the original source is not reconstructed via the .getProperty() method that we are testing (unless I'm missing something).

@bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

matrixise

@@ -1311,6 +1312,50 @@ def test_nsattrs_wattr(self):
self.assertEqual(attrs.getQNameByName((ns_uri, "attr")), "ns:attr")
# ===========================================================================

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Could you remove these comments, they don't need to be in the source code.
Thank you

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the rest of this test module, you'll see that it is in keeping with its existing style. IMO this is fine.

def test_xml_str_str(self):
reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reader = create_parser()
reader.setContentHandler(PropertyContentHandler(self, reader))
input_\
= b'\nHi there'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with the max length

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of Pep8 was that external teams could decide to allow 100 colums but that for Python development itself ony 80 columns should be used. The comments you are talking about were there before I created modifications to test_sax.py and I was asked by the reviewer to only touch my additions.

I did sign the CLA yesterday but it needs to be reviewed by the PSF first before they accept it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this codebase, code lines should be at most 79 characters long.

@matrixise

And there is an other point, you didn't sign the CLA. it's a requirement for the review.

Thank you

taleinat

#
# Sax parser property tests
#
# Currently only tests the condition reported in issue bpo-6686

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note (the second line) shouldn't be in this comment block though, it should be in the class itself.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me at something that discusses this idiom? I don't understand the rationale.

@ghost

I have finally got to the bottom of this issue. The situation is as follows:

Given all this this is my assessment of what should be done:

What do you think?

@taleinat

@jfgossage

It is necessary, not just desirable, to provide the encoding when decoding the InputContext property. I
wrote test cases that provided input in all of the supported encodings and it failed on all except Ascii
and UTF-8.

This is great work! Please make sure to add such tests to this PR at some point.

What do you think?

I think that XML is inherently a string-based format rather than a bytes-based format. Therefore, returning the "XML string" property as bytes seems wrong to me, especially considering that users have no good way of retrieving the appropriate encoding with which to decode it.

The Python parser expat.py supports this handler and thus allows the handler to be specified to
Expat, but the expatreader.py SAX parser does not support this capability.

Can we rectify this situation? It seems like adding such support should be relatively simple.

@ghost

Sorry to take so long but I needed to do some more research. Let me summarize the streaming XML support situation in Python. There are two streaming XML parsers provided by Python, the SAX parser and the Expat parser parser.expat. The Sax parser is a faithful implementation of the SAX Api. The Expat parser provides a super-set of SAX with additional Api's that make it easier to use. Since it's availability I would expect that most new development was don using the Expat parser and the Sax parser remains basically to support old software that was written using it.

If you are going to write new software in Python using only standard Python support you are almost certainly going to use the Expat parser. If you have a free choice you will probably use Lxml which is even better and more performant. The SAX parser exists basically to support legacy applications and the documentation for the InputContext property clearly states that you get the data in the same encoding as you originally supplied it.

Also, even with the Expat parser, it would be up to the user to implement the code needed to determine the encoding that was actually used. The parser defines an interface for the needed handler, but it is up to the user to provide an implementation for the handler.

So if we are to implement your suggestion we would need to do the following:

My personal feeling is that this is far too much work to support what is ultimately a flawed construct. I think that if we are to proceed along this path we need a wider consensus than we two can provide and that the content of this PR should be examined by one or more senior members of the Python community. The basic position is that we are now looking at an enhancement, not simply a bug fix.

@taleinat

@jfgossage

Sorry to take so long but I needed to do some more research.

No need to apologize, this is great work.

My personal feeling is that this is far too much work to support what is ultimately a flawed construct. I think that if we are to proceed along this path we need a wider consensus than we two can provide and that the content of this PR should be examined by one or more senior members of the Python community. The basic position is that we are now looking at an enhancement, not simply a bug fix.

I agree. How about a different path, then?

What do you think about changing pyexpat.xmlparser.GetInputContext() to return a string rather than a bytes object? The pyexpat.xmlparser objects know what the relevant encoding is.

This too would require review from a more senior core dev, but that's fine IMO to get a good solution in place.

@ghost

I think we have to accept that the code is working as the developer if not the documenter envisaged it. This makes your proposal an enhancement and I have one big problem with it. Changing the return type from bytes to a string is a breaking change. Anyone who is actually using this is accustomed to receiving a bytes object and their code will probably break when they try to use the new string object.

What I propose is the following:

Doing things this way ensures that no breaking changes are introduced, that changes are held to a minimum and gives the community the opportunity for normal discussion of the proposal.

@ghost

I was also bothered by the clunkiness of the solution I proposed. The biggest problem I had was that handlers in Expat are designed to allow users to receive data from Expat, not for internal use within Expat. Expat already supports a handler that makes this information available to the user, the XML_decl_handler and it is supported in Python by the Expat parser but not by the SAX parser.

Using the XML_decl_handler internally was also the solution of the originator of this issue.

The SAX parser does provide a mechanism, property_lexical_handler, that exposes all the additional handlers supported by Expat with the exception of the XML_decl_handler. If we were to extend expatreader.py and the XMLReader interface to support to support this handler, then users can access the encoding information needed to decode the property_xml_string in the correct place, which is in the users' code.

I don't think that the Python Expat support should decode the property_xml_string internally since it is clearly documented in Expat that the raw XML is returned and in Python terms this means bytes. Also the user may want to get this in bytes.

In summary, the XML-decl_handler is Expat's way of exposing this information to the user and I consider it either a deficiency or an oversight that it was not supported in the SAX parser.

If you accept this solution than I feel we should treat the original issue as a documentation issue and bring the documentation in harmony with the implementation. We can then create a new issue to cover the exposure of the XML_decl_handler to the user in a SAX compatible manner.

@JonathanGossage

@JonathanGossage

@ghost

One more reason why property_xml_string should be returned in the original encoding. The most likely place where this property might be used is in an error handler that wants to look at the original XML. In this case, especially if the error is associated with decoding, it is essential that the text be in the original encoding.

@JonathanGossage

@taleinat

@jfgossage, after taking some time to think this over and get a fresh perspective, I completely agree with your latest comments. Let's close this PR and fix the documentation.

I'm not even sure an enhancement proposal is necessary: This hasn't been requested so far, so I'm not sure it is worth the effort to implement for purely theoretical use cases.

@csabella

I'm going to close this as it is from an unknown repository. Based on the last comment from Tal, it probably needs to be changed to a documentation change so the current PR should not be replaced as it currently is.