DOC, ENH: Support memory_map for Python engine by gfyoung · Pull Request #13381 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Title is self-explanatory.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is doc-string updated?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes? I just added it.
Current coverage is 84.24%
@@ master #13381 diff @@
Files 138 138
Lines 50763 50775 +12
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 42756 42773 +17
- Misses 8007 8002 -5
Partials 0 0
Powered by Codecov. Last updated by 158ae5b...1f6aae9
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this more specific? Which exceptions do you need to catch here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really whatever can be thrown by the mmap constructor, and while I could certainly guess what some of them might be, not really sure how safe that is. This is also a private method, so I wouldn't foresee much abuse of this catch-all statement.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shoyer here, I think you shouldn't be catching things. Let the exception bubble up; this is a low-level handler and shouldn't hide things.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests would help narrow this down! e.g. first test should be a non-existant file, 2nd a file-like which cannot be memory mapped, e.g. StringIO (I think).
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's inconsistent with what's done on the C engine side, where if the lower-level function calls fail (for whatever reason), we just return NULL for the source as you can see here. Note that you can pass in StringIO to read_csv with memory_map=True and nothing blows up on the C engine side.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're referring to the file object, I think I am already (I wrote a parameters section for that reason). Or are you referring to the catch-all try-except block?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung If you scroll down in parser.pyx a little further, you notice if the pointer is NULL we raise IOError.
I don't think we should ignore arbitrary errors, certainly not if this is disabled by default. Even then, never use blanket except: clauses -- except Exception: is much safer.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer : that's not the right NULL - this is the one I'm referring to here. The one you're pointing at is when all other attempts to create a source have failed.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I guess you're right. Still, let's only catch Exception and make a note about why we do this.
I'm not sure that we need to consider the current "catch anything" behavior as API but I guess it's better safe than sorry.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I have no issues with adding except Exception, so I'll do that.
Do you have any benchmarks showing positive performance gains?
If this is an unambiguous win, we should enable it by default rather than adding yet another option.
I can run benchmarks but remember this is AFAICT a performance vs memory tradeoff given what memory_map does. I am inclined to leave this still as an option, though how many people use this option I am not sure.
Ran basic %timeit with CSV file sizes of 10000 rows and 1000000 rows, and the performance flip-flops between the two. I think the I/O overhead starts to count as the file sizes get larger and larger, but I don't know if there is a specific cutoff. I think this would indicate leaving it as an option until there are more conclusive results.
@shoyer this is an already existing option, @gfyoung is just documenting. Though point taken that it doesn't exist for the python engine, but I think the consistency outweight having to special case options and think about which engine I am using.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to add this, then pls setup some tests for it in io.common/tests/test_common.py
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests added.
I would also add that this PR does provide support for memory_map in the Python engine via Python's mmap library.
Just to clarify, this is already supported by the C engine via **kwargs, but is currently undocumented?
I'm OK with documenting it and adding it as another argument then -- we already have quite a few arguments for read_csv, adding another one won't make things much worse.
@shoyer : If you look at the signature here, you will see that it is already in the signature explicitly. However, no explanation of it is given. Also, read @jreback comment above.
Travis was having build issues, so I'm rerunning tests. @jreback could you cancel this old build here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blank line is missing before 'Returns'
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jreback : Added the except Exception block as @shoyer requested, and Travis is giving the green light. Ready to merge if there are no other concerns.
gfyoung deleted the memory-map-python-engine branch