bpo-22454: Add shlex.join() (the opposite of shlex.split()) by bbayles · Pull Request #7605 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation30 Commits7 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This PR adds shlex.join(), using the implementation suggested in the issue.

This is the opposite of shlex.split(), and is intended to prevent the naive use of ' '.join(list_of_shell_tokens), which can be unsafe.

https://bugs.python.org/issue22454

I do like having the inverse of shlex.split in the standard library however do be careful with the documentation. The returned value is not safe to execute when built up from user input just as using shlex.split(user_input)[0] and feeding it to a shell is not safe :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment on the docs, otherwise this implementation looks fine to me. No comment on whether or not it should be added to the module, since I don't know the shlex use cases enough to weigh in.

With regards to the test strategy, I think it's probably fine as is, but I've given a suggestion anyway because I've been going back and forth on it. You are only testing the fact that join is an inverse of split, not testing the behavior of join itself (which would be useful even if split did not exist), but honestly it's probably fine to define join in terms of split and then say that it's sufficient to test split and the fact that join is the inverse of split. Still, I don't think a few explicit join tests would hurt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a note to Doc/whatsnew/3.8.rst?

Thank you!

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

I think this is a great idea! Always nice to have symmetry in an API. Speaking of which, one thing I noticed about this implementation is that it doesn't mirror the new posix flag in the split API. So you end up with slightly overzealous quoting:

from shlex import join, split
join(split('test ê test'))
"test 'ê' test"

Not sure if you're up to adding another flag, but figured I'd mention.

@berkerpeksag, I think I've made the changes you requested.

@mahmoud, do you have an implementation suggestion? The quote function doesn't have the posix flag, and this is just ' '.join(quote(arg) for arg in split_command).

I guess the problem is more going the other way, right?

split("test 'ê' test", posix=True) ['test', 'ê', 'test']

split("test 'ê' test", posix=False) ['test', "'ê'", 'test']

split("test 'ê' test", posix=False) ['test', "'ê'", 'test'] join(split("test 'ê' test", posix=False)) 'test ''"'"'ê'"'"'' test'

I'm not sure how much of a bug this is, but it does mean that split(join(split(x, posix=False)), posix=False) is not an idempotent operation. I'm not sure how much it needs to be, though.

@pganssle Bingo! Agreed that full roundtrippability is nice to have, but not necessarily a blocker. (when I've needed this, I've done exactly what join() does now, but that was before this new posix flag).

@bbayles This may be outside the scope of this change, but I think it makes sense to add the posix flag for symmetry.

If there's an appetite for that, then I think we could make the string constants inside the shlex class into global constants, shared between shlex and quote.

I haven't done the venn diagram, but there may be other corners lurking in quote's "exclusive" regex of unsafe chars (re.compile(r'[^\w@%+=:,./-]', re.ASCII)) and shlex's "inclusive" specification of safe chars:

    self.wordchars = ('abcdfeghijklmnopqrstuvwxyz'
                      'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_')
    if self.posix:
        self.wordchars += ('ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'
                           'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ')

You can see how, instead of the regex, you could check if any of the word characters aren't those characters. Similar thing could happen for punctuation_chars.

I'll defer to @berkerpeksag on whether to increase the scope of this PR.

I think we want join to default to using the POSIX behavior, right? So if quote defaults to the correct behavior with posix=True (which it seems to), then I think posix=False can be added to join in a separate PR.

If quote defaults to posix=False, then I think it needs to be added into this PR (since we want the default for join to be posix=True, and it would be best if this were not a backwards-incompatible change).

I think we need to add a posix parameter because split()'s posix defaults to True and users would expect join(split(cmd)) == cmd.

@berkerpeksag I'm not sure that join(split(cmd)) == cmd will ever be possible for all cmd when posix=True, because split is lossy in POSIX mode:

shlex.split("one 'two' three") ['one', 'two', 'three']

shlex.split("one two three") ['one', 'two', 'three']

If you turn off POSIX mode, the splitting is no longer lossy (at least in this example):

shlex.split("one 'two' three", posix=False) ['one', "'two'", 'three']

shlex.split("one two three", posix=False) ['one', 'two', 'three']

So right now, you have the situation where:

split(join(split(cmd, posix=True)), posix=True) == split(cmd, posix=True)

Which is I think as good as we can hope for with a default behavior. I think we can go ahead with this and add a posix parameter (defaulting to True) to both join and quote as a separate feature.

Is this accurate @mahmoud?

@pganssle Just caught this, and yeah I agree. The sooner we get the functionality, the fewer ' '.join(args) will be written. Corner cases in a new ticket sounds good to me, too!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other comments notwithstanding, this looks like a solid shlex addition I wish would've existed long ago. Thanks for adding it!

@berkerpeksag The feature freeze is coming up pretty soon, have you had a chance to review Mahmoud and my comments about the feature? It would be nice to get this merged for Python 3.8.

I agree that shlex.split() is lossy in POSIX mode and I'm fine with it. I'd merge this PR now if the default mode of shlex.split() wasn't POSIX (it has been the default mode for more than a decade)

"Let's merge this and we'll fix other cases later" never works in practice and I'm not comfortable adding an unfinished API to the standard library.

Let's see what @vsajip thinks as he has been working on maintaining the shlex module lately.

I agree that shlex.split() is lossy in POSIX mode and I'm fine with it. I'd merge this PR now if the default mode of shlex.split() wasn't POSIX (it has been the default mode for more than a decade)

If I understand you correctly, it sounds like you have the opposite understanding from me as to what needs to be done. When I said that shlex.split is lossy in POSIX mode, what I mean is that this PR already does as well as it possibly can do with the default (POSIX) mode.

The new feature that we're blocking on is support for lossless round trips when posix=False is specified, and since the default is posix=True, I think it's a sufficiently clean separation that adding support for it can be considered a separate feature (especially considering that quote also has no way to specify posix=False).

I'm not available to increase the scope of this PR currently. So if shlex.join needs to be more than it is here, then I'll close this and allow someone else to contribute.

DinoV pushed a commit to DinoV/cpython that referenced this pull request

Jan 14, 2020