bpo-29659: Expose copyfileobj() length
arg for public use by goodboy · Pull Request #328 · python/cpython (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation48 Commits3 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Corresponds to issue bpo-29659
When copying large files the copy time can be sufficiently
shortened by increasing the memory buffer used in the copyfileobj()
routine. Expose the length
argument from copyfileobj()
upwards
for use throughout the module.
https://bugs.python.org/issue29659
Hello, and thanks for your contribution!
I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).
Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow these steps to rectify the issue:
- Sign the PSF contributor agreement. The "bugs.python.org username" requested by the form is the "Login name" field in "Your Details" at b.p.o
- Wait at least one US business day and then check the "Contributor form received entry under "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
- Reply here saying you have completed the above steps
Thanks again to your contribution and we look forward to looking at it!
Mariatta changed the title
Expose copyfileobj() bpo-29659: Expose copyfileobj() length
arg for public uselength
arg for public use
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must update Doc/library/shutil.rst: document the new parameter, but also add a ".. versionchanged:: 3.7" section to document the change.
An in-memory buffer size can be set with `length`. |
""" |
length = length or 16*1024 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest:
if not length:
length = 16 * 1024
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
def copyfileobj(fsrc, fdst, length=None): |
---|
"""Copy data from file-like object `fsrc` to file-like object `fdst`. |
An in-memory buffer size can be set with `length`. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should document the default size: 16 kB.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do 👍
"""Copy data from src to dst. |
---|
If follow_symlinks is not set and src is a symbolic link, a new |
symlink will be created instead of copying the file it points to. |
An in memory buffer size can be set with `length`. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: document default size.
@Haypo I believe I've addressed everything!
If there's anything I'm missing please let me know :)
@@ -74,6 +74,9 @@ Directory and files operations |
---|
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is |
a subclass of the latter, this change is backward compatible. |
.. versionchanged:: 3.7 |
Expose the `length` arg from `copyfileobj`; default is 16 KB. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn'd say "expose ..." but just "Added length parameter", and versionadded is not the right place to document "default is ...", put it in the documentation body.
You didn't update copyfile() signature. Same remark for the 2 other functions.
Moreover, you should also document the length in the function document, not only in versionadded. The "documentation" can just ask to refer to copyfileobj().
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot yeah not sure how I missed the signatures...
def copyfileobj(fsrc, fdst, length=None): |
---|
"""Copy data from file-like object `fsrc` to file-like object `fdst`. |
An in-memory buffer size can be set with `length`; the default is 16 KB. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please mention the unit, "size in bytes"
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Haypo to clarify, do you prefer:'An in-memory buffer size can be set with
length; the default size in bytes is 16*1024'
,
or"An in-memory buffer size in bytes can be set with
length; the default is 16 KB.
?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"An in-memory buffer size in bytes can be set withlength; the default is 16 KB."
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encourage you to use KiB, which other parts of the documentation already use. If you write 16 KB it could easily be interpreted as 16,000 B (lower-case k means 1000, as in km for kilometre).
@Haypo hopefully 3rd times the charm ;)
If you'd like me to squash the history to one commit let me know!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.
Required: new or revised tests that use the expanded api (and fail without patch).
@terryjreedy the only problem with that is that we'll need to have the if length
checks in all consuming functions as opposed to only once in copyfileobj
no? Maybe you have a better approach?
Yes I totally agree about the test. I guess it goes here?
@@ -74,6 +80,9 @@ Directory and files operations |
---|
Raise :exc:`SameFileError` instead of :exc:`Error`. Since the former is |
a subclass of the latter, this change is backward compatible. |
.. versionchanged:: 3.7 |
Added `length` parameter |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation should use three spaces here:
.. versionchanged:: 3.7 Added the length parameter.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`length` -> *length*
(and please finish the sentence with a full stop.)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
When copying large files the copy time can be sufficiently
shortened by increasing the memory buffer used in the copyfileobj()
routine. Expose the length
argument from copyfileobj()
upwards
for use throughout the module.
Note that if the current file position of the *fsrc* object is not 0, |
only the contents from the current file position to the end of the file |
will be copied. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO “Note that” is redundant. In fact the whole first line could be dropped, and “Only the contents from the current file position . . .” could be put back in the first paragraph.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please address pending comments before I can approve the change ;-)
"""Copy data from file-like object `fsrc` to file-like object `fdst`. |
---|
An in-memory buffer size in bytes can be set with `length`; the default is |
16 KB. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: KiB
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
16 KB. |
---|
""" |
if not length: |
length = 16*1024 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP 8: add spaces around operators, "16 * 1024"
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -251,10 +262,13 @@ def copy2(src, dst, *, follow_symlinks=True): |
---|
If follow_symlinks is false, symlinks won't be followed. This |
resembles GNU's "cp -P src dst". |
An in-memory buffer size in bytes can be set with `length`; the default is |
16 KB. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this empty line.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Note: I created PR #4293 to replace KB with KiB in the Python code base ;-)
I think everything has been addressed except for @terryjreedy's:
Suggestion: leave API of copyfileobj should alone as 'length=16*1024' to document default. For other functions, say something like 'If length not passed, copyfileobj uses its default.' We might change it someday.
I don't know how to keep the default value in copyfileobj
but also support it as a default argument in all client functions without length = 16 * 1024 if length is None else length
in each case. Personally I think that's a bit uglier. I can also stick the default value in each client function to be length=16 * 1024
and always pass it through transparently if that is preferred; it just means changing the value in multiple places if it is changed in the future. I had originally opted for one change in one place if it was to be modified later.
Whatever you all think is best I'll do 👍
src = os.path.join(src_dir, 'foo') |
---|
write_file(src, 'foo' * 10 ** 6) |
# buffer with 20 Kib |
fn(src, dst_dir, length=20 * 1024) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to use a smaller file: write('x' * 100) and then length=20.
By the way, I suggest to use tempfile.NamedTemporaryFile rather than temporary files:
with tempfile.NamedTemporaryFile() as src:
with tempfile.NamedTemporaryFile() as dst:
write_file(src.name, b'x' * 100, binary=True)
fn(src.name, dst.name, length=20)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe thanks @vstinner! You did the work for me 👍
@@ -73,8 +73,14 @@ class RegistryError(Exception): |
---|
and unpacking registries fails""" |
def copyfileobj(fsrc, fdst, length=16*1024): |
"""copy data from file-like object fsrc to file-like object fdst""" |
def copyfileobj(fsrc, fdst, length=None): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def copyfileobj(fsrc, fdst, length=16*1024):
I personally prefer this way in my code to concentrate 2 lines worth of code into one.
if not length:
length = 16 * 1024
It looks a little nicer and not to mention saves lines of code that basically do the same thing.
And yes I am also guilty of doing it this way and then people started hating me for it and calling my a bad programmer for it.
Just to clear this up since it seems my original reasoning isn't showing well and the drive-by comments are piling up.
Afaict there are 3 ways to deal with the default value of length
:
- stick
length=16*1024
in all three function signatures forcopyfileobj
,copy
, andcopy2
- this means any time the default needs to be changed it will be done in 3 places
- this is the least amount of lines of code
- keep it the way it is (i.e.
copy
andcopy2
definelength=None
in their signatures and one function,copyfileobj
, has the currentif length is None:
check)- one more line of code then the previous solution
- have to change the
length
in one place in the future length
is an explicit default argument incopy
andcopy2
- make
copy
andcopy2
take**kwargs
and pass them through to the internal call tocopyfileobj
- same lines of code as the first solution
- doesn't explicitly document
length
in each of the thecopy
,copy2
signatures - one place to change the default value in the future
If one of these solutions is preferred over the others after acknowledging the trade-offs I'm fine to do it, but just simply saying length=16*1024
is less code doesn't deal with how to handle copy
and copy2
and the extra (or less clear) code they may contain.
@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp:
line?
@vstinner looks like the appveyor build failed due to the with open(path, 'wb' if binary else 'w') as fp: line?
ERROR: test_copy_w_different_length (test.test_shutil.TestShutil)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\projects\cpython\lib\test\test_shutil.py", line 1417, in test_copy_w_different_length
write_file(src.name, b'x' * 100, binary=True)
File "C:\projects\cpython\lib\test\test_shutil.py", line 60, in write_file
with open(path, 'wb' if binary else 'w') as fp:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\appveyor\\AppData\\Local\\Temp\\1\\tmpj1vhtx1b'
Sorry, I don't know why the test failed.
Looks to me like an normal permission error as python seems to not have permissions to write to the temp folder? Maybe somehow have python write to somewhere that does not require administrator permissions? It would be nice if python had an nice function that you provide the path and it checks if you have permissions to write to there to avoid having to try / except permission errors. 🤔 That function would be the perfect thing to fix this test on Windows.
Ok so besides the outstanding:
- missing news entry
- failing test
Is there anything else in the patch that needs to be addressed?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike length=-1 "feature".
16 KiB. |
---|
""" |
if not length: |
length = 16 * 1024 |
while 1: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it's ok to use a loop if the length is negative. I suggest to have a special case for negative value calling read() (no parameter) only once.
The integer *length*, if given, is the buffer size; the default value |
---|
in bytes is 16 KiB. A negative *length* value means to copy the data without |
looping over the source data in chunks; by default the data is read in |
chunks to avoid uncontrolled memory consumption. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A negative length value means to copy the data without looping over the source data in chunks"
I dislike this definition. In practice, negative means "unlimited" buffer size: the whole input file is loaded into memory.
I'm not sure that it's a good practice to try to load files of unknown size into memory.
I suggest to remove this feature which seems more like a side effect than a carefully designed API.
If you want to get fast copy, pass a very large length like 1 GB. But if Python starts to load 1 TB into memory, it's likely to crash the system... At least, to slow down the system, a lot.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again
. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
klange pushed a commit to toaruos/cpython that referenced this pull request
jaraco pushed a commit that referenced this pull request