CLN: memory-mapping code by twoertwein · Pull Request #44766 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation9 Commits3 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Rebased on #44761
No need for codecs.getincrementaldecoder
as io.TextWrapperIO
will do that (and we can use io.TextWrapperIO
because mmap is wrapped inside _IOWrapper
). io.TextWrapperIO
also provides __next__
for us :)
Probably will need some benchmarking with utf-8/non-utf8 files.
df = tm.makeDataFrame() |
---|
df.to_csv(path, mode="w+b") |
tm.assert_frame_equal(df, pd.read_csv(path, index_col=0)) |
def test_binary_mode(): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test and test_warning_missing_utf_bom
had nothing to do with mmap but were in its test class.
# add one entry with a sepcial character |
encoding_ = encoding or "utf-8" |
leonardo = "Léonardo".encode(encoding_, errors="ignore").decode(encoding_) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This more strict test version of the test would have failed on master with the python engine.
@@ -699,15 +699,15 @@ def test_encoding_mmap(memory_map): |
---|
GH 23254. |
""" |
encoding = "iso8859_1" |
data = BytesIO(" 1 A Ä 2\n".encode(encoding)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test wasn't using memory_map because it silently failed.
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻
Comment last updated at 2021-12-05 23:25:04 UTC
This PR should speed up all read_csv(..., engine="c", encoding="utf-8")
calls when a binary handle is used (this should now be faster than using text handles), by applying the same shortcut as done for memory_map
in #43787.
Using TextIOWrapper seems to be slower than the previous solution :( Will revert most of the changes in this PR.
Labels
IO issues that don't fit into a more specific label
2 participants