CLN: memory-mapping code by twoertwein · Pull Request #44766 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation9 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

twoertwein

Rebased on #44761

No need for codecs.getincrementaldecoder as io.TextWrapperIO will do that (and we can use io.TextWrapperIO because mmap is wrapped inside _IOWrapper). io.TextWrapperIO also provides __next__ for us :)

Probably will need some benchmarking with utf-8/non-utf8 files.

twoertwein

twoertwein

df = tm.makeDataFrame()
df.to_csv(path, mode="w+b")
tm.assert_frame_equal(df, pd.read_csv(path, index_col=0))
def test_binary_mode():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test and test_warning_missing_utf_bom had nothing to do with mmap but were in its test class.

twoertwein

twoertwein

# add one entry with a sepcial character
encoding_ = encoding or "utf-8"
leonardo = "Léonardo".encode(encoding_, errors="ignore").decode(encoding_)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more strict test version of the test would have failed on master with the python engine.

@twoertwein

twoertwein

@@ -699,15 +699,15 @@ def test_encoding_mmap(memory_map):
GH 23254.
"""
encoding = "iso8859_1"
data = BytesIO(" 1 A Ä 2\n".encode(encoding))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test wasn't using memory_map because it silently failed.

@twoertwein

@pep8speaks

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-05 23:25:04 UTC

@twoertwein

@twoertwein

This PR should speed up all read_csv(..., engine="c", encoding="utf-8") calls when a binary handle is used (this should now be faster than using text handles), by applying the same shortcut as done for memory_map in #43787.

Using TextIOWrapper seems to be slower than the previous solution :( Will revert most of the changes in this PR.

Labels

Clean IO Data

IO issues that don't fit into a more specific label

2 participants

@twoertwein @pep8speaks