[Python-Dev] PEP 428: stat caching undesirable? (original) (raw)

Pieter Nagel pieter at nagel.co.za
Wed May 1 09:32:28 CEST 2013


Hi all,

I write as a python lover for over 13 years who's always wanted something like PEP 428 in Python.

I am concerned about the caching of stat() results as currently defined in the PEP. This means that all behaviour built on top of stat(), such as p.is_dir(), p.is_file(), p.st_size and the like can indefinitely hold on to stale data until restat() is called, and I consider this confusing.

Perhaps in recognition of this, p.exists() is implemented differently, and it does restat() internally (although the PEP does not document this).

If this behaviour is maintained, then at the very least this makes the API more complicated to document: some calls cache as a side effect, others update the cache as a side effect, and others, such as lstat(), don't cache at all.

This also introduces a divergence of behaviour between os.path.isfile() and p.is_file(), that is confusing and will also need to be documented.

I'm concerned about scenarios like users of the library polling, for example, for some file to appear, and being confused about why the arguably more sloppy poll for p.exists() works while a poll for p.is_file(), which expresses intent better, never terminates.

In theory the caching mechanism could be further refined to only hold onto cached results for a limited amount of time, but I would argue this is unnecessary complexity, and caching should just be removed, along with restat().

Isn't the whole notion that stat() need to be cached for performance issues somewhat of a historical relic of older OS's and filesystem performance? AFAIK linux already has stat() caching as a side-effect of the filesystem layer's metadata caching. How does Windows and Mac OS fare here? Are there benchmarks proving that this is serious enough to complicate the API?

If the ability to cache stat() calls is deemed important enough, how about a different API where is_file(), is_dir() and the like are added as methods on the result object that stat() returns? Then one can hold onto a stat() result as a temporary object and ask it multiple questions without doing another OS call, and is_file() etc. on the Path object can be documented as being forwarders to the stat() result just as p.st_size is currently - except that I believe they should forward to a fresh, uncached stat() call every time.

I write directly to this list instead raising it to Antoine Pitrou in private just because I don't want to make extra work for him to first receive my feedback and the re-raise it on this list. If this is wrong or disrespectful, I apologize.

-- Pieter Nagel



More information about the Python-Dev mailing list