bpo-29659: Expose copyfileobj() length arg for public use by goodboy · Pull Request #328 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation48 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

goodboy

Corresponds to issue bpo-29659

When copying large files the copy time can be sufficiently
shortened by increasing the memory buffer used in the copyfileobj()
routine. Expose the length argument from copyfileobj() upwards
for use throughout the module.

https://bugs.python.org/issue29659

@the-knights-who-say-ni

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow these steps to rectify the issue:

  1. Sign the PSF contributor agreement. The "bugs.python.org username" requested by the form is the "Login name" field in "Your Details" at b.p.o
  2. Wait at least one US business day and then check the "Contributor form received entry under "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
  3. Reply here saying you have completed the above steps

Thanks again to your contribution and we look forward to looking at it!

@Mariatta Mariatta changed the titleExpose copyfileobj() length arg for public use bpo-29659: Expose copyfileobj() length arg for public use

Feb 27, 2017

@goodboy

vstinner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must update Doc/library/shutil.rst: document the new parameter, but also add a ".. versionchanged:: 3.7" section to document the change.

An in-memory buffer size can be set with `length`.
"""
length = length or 16*1024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest:

if not length:
length = 16 * 1024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

def copyfileobj(fsrc, fdst, length=None):
"""Copy data from file-like object `fsrc` to file-like object `fdst`.
An in-memory buffer size can be set with `length`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should document the default size: 16 kB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do 👍

"""Copy data from src to dst.
If follow_symlinks is not set and src is a symbolic link, a new
symlink will be created instead of copying the file it points to.
An in memory buffer size can be set with `length`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: document default size.

@goodboy

@Haypo I believe I've addressed everything!
If there's anything I'm missing please let me know :)

vstinner

@@ -74,6 +74,9 @@ Directory and files operations
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is
a subclass of the latter, this change is backward compatible.
.. versionchanged:: 3.7
Expose the `length` arg from `copyfileobj`; default is 16 KB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn'd say "expose ..." but just "Added length parameter", and versionadded is not the right place to document "default is ...", put it in the documentation body.

You didn't update copyfile() signature. Same remark for the 2 other functions.

Moreover, you should also document the length in the function document, not only in versionadded. The "documentation" can just ask to refer to copyfileobj().

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot yeah not sure how I missed the signatures...

def copyfileobj(fsrc, fdst, length=None):
"""Copy data from file-like object `fsrc` to file-like object `fdst`.
An in-memory buffer size can be set with `length`; the default is 16 KB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention the unit, "size in bytes"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Haypo to clarify, do you prefer:
'An in-memory buffer size can be set with length; the default size in bytes is 16*1024',
or
"An in-memory buffer size in bytes can be set with length; the default is 16 KB.
?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"An in-memory buffer size in bytes can be set withlength; the default is 16 KB."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encourage you to use KiB, which other parts of the documentation already use. If you write 16 KB it could easily be interpreted as 16,000 B (lower-case k means 1000, as in km for kilometre).

@goodboy

@Haypo hopefully 3rd times the charm ;)

If you'd like me to squash the history to one commit let me know!

terryjreedy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.

Required: new or revised tests that use the expanded api (and fail without patch).

@goodboy

@terryjreedy the only problem with that is that we'll need to have the if length checks in all consuming functions as opposed to only once in copyfileobj no? Maybe you have a better approach?

Yes I totally agree about the test. I guess it goes here?

@vstinner

berkerpeksag

@@ -74,6 +80,9 @@ Directory and files operations
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is
a subclass of the latter, this change is backward compatible.
.. versionchanged:: 3.7
Added `length` parameter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation should use three spaces here:

.. versionchanged:: 3.7 Added the length parameter.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`length` -> *length*

(and please finish the sentence with a full stop.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

When copying large files the copy time can be sufficiently shortened by increasing the memory buffer used in the copyfileobj() routine. Expose the length argument from copyfileobj() upwards for use throughout the module.

vadmium

Note that if the current file position of the *fsrc* object is not 0,
only the contents from the current file position to the end of the file
will be copied.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO “Note that” is redundant. In fact the whole first line could be dropped, and “Only the contents from the current file position . . .” could be put back in the first paragraph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

vstinner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please address pending comments before I can approve the change ;-)

"""Copy data from file-like object `fsrc` to file-like object `fdst`.
An in-memory buffer size in bytes can be set with `length`; the default is
16 KB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: KiB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

16 KB.
"""
if not length:
length = 16*1024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8: add spaces around operators, "16 * 1024"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -251,10 +262,13 @@ def copy2(src, dst, *, follow_symlinks=True):
If follow_symlinks is false, symlinks won't be followed. This
resembles GNU's "cp -P src dst".
An in-memory buffer size in bytes can be set with `length`; the default is
16 KB.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this empty line.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vstinner

Note: I created PR #4293 to replace KB with KiB in the Python code base ;-)

@goodboy

I think everything has been addressed except for @terryjreedy's:

Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.

I don't know how to keep the default value in copyfileobj but also support it as a default argument in all client functions without length = 16 * 1024 if length is None else length in each case. Personally I think that's a bit uglier. I can also stick the default value in each client function to be length=16 * 1024 and always pass it through transparently if that is preferred; it just means changing the value in multiple places if it is changed in the future. I had originally opted for one change in one place if it was to be modified later.

Whatever you all think is best I'll do 👍

vstinner

src = os.path.join(src_dir, 'foo')
write_file(src, 'foo' * 10 ** 6)
# buffer with 20 Kib
fn(src, dst_dir, length=20 * 1024)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use a smaller file: write('x' * 100) and then length=20.

By the way, I suggest to use tempfile.NamedTemporaryFile rather than temporary files:

            with tempfile.NamedTemporaryFile() as src:
                with tempfile.NamedTemporaryFile() as dst:
                    write_file(src.name, b'x' * 100, binary=True)
                    fn(src.name, dst.name, length=20)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe thanks @vstinner! You did the work for me 👍

AraHaan

@@ -73,8 +73,14 @@ class RegistryError(Exception):
and unpacking registries fails"""
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
def copyfileobj(fsrc, fdst, length=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def copyfileobj(fsrc, fdst, length=16*1024):

I personally prefer this way in my code to concentrate 2 lines worth of code into one.

if not length:
    length = 16 * 1024

It looks a little nicer and not to mention saves lines of code that basically do the same thing.
And yes I am also guilty of doing it this way and then people started hating me for it and calling my a bad programmer for it.

@goodboy

Just to clear this up since it seems my original reasoning isn't showing well and the drive-by comments are piling up.

Afaict there are 3 ways to deal with the default value of length:

If one of these solutions is preferred over the others after acknowledging the trade-offs I'm fine to do it, but just simply saying length=16*1024 is less code doesn't deal with how to handle copy and copy2 and the extra (or less clear) code they may contain.

@goodboy

@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp: line?

@vstinner

@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp: line?

ERROR: test_copy_w_different_length (test.test_shutil.TestShutil)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\cpython\lib\test\test_shutil.py", line 1417, in test_copy_w_different_length
    write_file(src.name, b'x' * 100, binary=True)
  File "C:\projects\cpython\lib\test\test_shutil.py", line 60, in write_file
    with open(path, 'wb' if binary else 'w') as fp:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpj1vhtx1b'

Sorry, I don't know why the test failed.

@AraHaan

Looks to me like an normal permission error as python seems to not have permissions to write to the temp folder? Maybe somehow have python write to somewhere that does not require administrator permissions? It would be nice if python had an nice function that you provide the path and it checks if you have permissions to write to there to avoid having to try / except permission errors. 🤔 That function would be the perfect thing to fix this test on Windows.

@goodboy

Ok so besides the outstanding:

Is there anything else in the patch that needs to be addressed?

@AraHaan

vstinner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike length=-1 "feature".

16 KiB.
"""
if not length:
length = 16 * 1024
while 1:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it's ok to use a loop if the length is negative. I suggest to have a special case for negative value calling read() (no parameter) only once.

The integer *length*, if given, is the buffer size; the default value
in bytes is 16 KiB. A negative *length* value means to copy the data without
looping over the source data in chunks; by default the data is read in
chunks to avoid uncontrolled memory consumption.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A negative length value means to copy the data without looping over the source data in chunks"

I dislike this definition. In practice, negative means "unlimited" buffer size: the whole input file is loaded into memory.

I'm not sure that it's a good practice to try to load files of unknown size into memory.

I suggest to remove this feature which seems more like a side effect than a carefully designed API.

If you want to get fast copy, pass a very large length like 1 GB. But if Python starts to load 1 TB into memory, it's likely to crash the system... At least, to slow down the system, a lot.

@bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@vstinner

klange pushed a commit to toaruos/cpython that referenced this pull request

Sep 15, 2021

@gvanrossum

jaraco pushed a commit that referenced this pull request

Dec 2, 2022

@dependabot-preview @Mariatta