[Python-Dev] PEP 428: stat caching undesirable? (original) (raw)
Pieter Nagel pieter at nagel.co.za
Wed May 1 13:22:20 CEST 2013
- Previous message: [Python-Dev] PEP 428: stat caching undesirable?
- Next message: [Python-Dev] PEP 428: stat caching undesirable?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Antoine and Nick have convinced me that stat() calls can be a performance issue.
I am still concerned about the best way to balance that against the need to keep the API simple, though.
I'm still worried about the current behaviour that some path can answer True to is_file() in a long-running process just because it had been a file last week.
In my experience there are use cases where most stat() calls one makes (including indirectly via is_file() and friends) want up-to-date data. There is also the risk of obtaining a Path object that already had its stat() value cached some time ago without your knowledge (i.e. if the Path was created for you by a walkdir type function that in its turn also called is_file() before returning the result).
And needing to precede each is_file() etc. call with a restat() call whose return value is not even used introduces undesirable temporal coupling between the restat() and is_file() call.
I see a few alternative solution, not mutually exclusive:
- Change the signature of stat(), and everything that indirectly uses stat(), to take an optional 'fresh' keyword argument (or some synonym). Then stat(fresh=True) becomes synonymous with the current restat(), and the latter can be removed. Queries like is_file(fresh=True) will be implemented by forwarding fresh to the underlying stat() call they are implemented on.
What the default for 'fresh' should be, can be debated, but I'd argue for the sake of naive code that fresh should default to True, and then code that is aware of stat() caching can use fresh=False as required.
- The root of the issue is keeping the cached stat() value indefinitely.
Therefore, limit the duration for which the cached value is valid. The challenge is to find a way to express how long the value should be cached, without needing to call time.monotonic() or the like that presumable are also OS calls that will release the GIL.
One way would be to compute the number of virtual machine instructions executed since the stat() call was cached, and set the limit there. Is that still possible, now that sys.setcheckinterval() has been gutted?
- Leave it up to performance critical code, such as the import machinery, or walkdirs that Nick mentioned, to do their own caching, and simplify the filepath API for the simple case.
But one can still make life easier for code like that, by adding is_file() and friends on the stat result object as I suggested.
But this almost sounds like a PEP of its own, because although pahtlib will benefit by it, it is actually an orthogonal issue.
It raises all kinds of issues: should the signature be statresult.isfile() to match os.path, or statresult.is_file() to match PEP 428?
-- Pieter Nagel
- Previous message: [Python-Dev] PEP 428: stat caching undesirable?
- Next message: [Python-Dev] PEP 428: stat caching undesirable?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]