Make pathlib extensible (original) (raw)
February 24, 2020, 12:15am 1
Hi all,
I saw that Python 3.8 adds zipfile.Path
, which provides a partial pathlib-compatible interface to zip files. This implementation doesn’t actually use any pathlib
machinery, rather just mimics its interface. Previously I’ve noted packages like artifactory that do subclass from pathlib
classes, but the implementation is necessarily messy due to pathlib.Path
not being intended to be subclassed in user code.
But what if we changed that?
The pathlib
source code already contains notions of flavours (implementing posix or windows path semantics) and accessors, but pathlib
contains only a single accessor implementation, and in the last few years pathlib
has accumulated code that bypasses the accessor.
I propose that we change instances where pathlib.Path
directly calls methods like os.close()
and os.getcwd()
to use the accessor instead, drop the underscores from _Accessor
and _NormalAccessor
, and document the accessor interface.
I’ve played around with these ideas and published a package called pathlab that shows what might be possible. It turns out implementing the equivalent of os.stat()
and os.listdir()
in an accessor gets you pretty far!
Some questions arise:
- Is this something we want users to do? For me it seems obvious that a common interface to the local filesystem, zip and tar files, FTP, WebDAV etc would be desirable, but I’m keen to hear feedback!
- Should we provide a
pathlib.StatResult
class for use in users’Accessor.stat()
implementations?os.stat_result
is a little tricky to instantiate as-is. - What does this mean for
os.DirEntry
, if anything? Its interface is a subset ofpathlib.Path
, with the exception of the addition of apath
attribute. - How do we suggest users bind an accessor instance (which, unlike a local filesystem accessor, requires some context like a URL or an open file pointer) to a path type? In my implementation I’m doing some things in
Accessor.__new__()
to create a newPath
subclass with the accessor instance bound in. But perhaps there’s a better way… - In Accessors, do we want to stick to the
os
API as much as possible, or perform some sort of simplification (i.e.unlink()
andrmdir()
, or justdelete()
?) - How do we support transfers between different kinds of filesystems?
with path1.open('rb'), path2.open('wb') ...
would work but would not preserve permissions/metadata/etc, so would there by scope to add aPath.transfer()
method?
Penny for your thoughts! Hope this isn’t too crazy.
The changes needed to pathlib.Path
can be seen here, albeit implemented in a subclass for now.
Barney
pitrou (Antoine Pitrou) February 24, 2020, 9:14am 2
@barneygale That’s… interesting. The original intent for the “accessor” thing was to have a variant that did all accesses under a filesystem tree in a race condition-free way using openat
and friends. It turned out to be much too hairy to actually implement, so was entirely abandoned, but the accessor abstraction was left there.
It would be nice if you could directly modify pathlib
and post the diff (or PR) somewhere. Some parts of pathlib
use shortcurts for speed (especially Path
construction), and I’m curious if you can retain the shortcuts while respecting the accessor abstraction.
encukou (Petr Viktorin) February 24, 2020, 11:14am 3
I would advise making it easier to subclass PurePath
, with the subclass supplying the filesystem access methods just like pathlib.Path
does.
Why? Subclasses aren’t likely to really support the full Path API; for example home()
or expanduser()
doesn’t quite make sense in a zip file, or a filesystem that’s always read-only (like ISO) might not implement any of the mutating methods. The subclass could customize __new__
quite trivially. Also, Path
is a path on the local filesystem; if making subclass of it would represent something else, I’d call it a violation of the Liskov substitution principle.
For methods like glob
and {read,write}_{text,_bytes}
, pathlib
could expose some common helper functions for subclasses to use.
ISTM that some non-trivial thought (or just checking?) would also need to go into what happens when calling joinpath
on/with disparate Path
objects.
(FWIW, I’ve played around with similar ideas working on githpathlib, a partially pathlib-compatible library for accessing Git repos – i.e. you can read data from a commit directly from Git’s database, without having to checkout all of the commit’s files on disk.)
barneygale (Barney Gale) February 24, 2020, 3:33pm 4
Thank you both for the thoughtful and detailed feedback!
It would be nice if you could directly modify
pathlib
and post the diff (or PR) somewhere. Some parts ofpathlib
use shortcurts for speed (especiallyPath
construction), and I’m curious if you can retain the shortcuts while respecting the accessor abstraction.
I will give this a go! Should have a branch within a week or two.
I would advise making it easier to subclass
PurePath
, with the subclass supplying the filesystem access methods just likepathlib.Path
does.
For me, a lot of the appeal of pathlib is that you get a lot of functionality out of few underlying os
calls. Yes, you can create your own PurePath
subclass with methods is_file
, is_dir
, is_symlink
, is_char_device
, is_block_device
, is_mount
etc, but it’s much simpler to implement Accessor.stat()
instead.
Why? Subclasses aren’t likely to really support the full Path API; for example
home()
orexpanduser()
doesn’t quite make sense in a zip file, or a filesystem that’s always read-only (like ISO) might not implement any of the mutating methods.
Similarly, Path.owner()
and Path.group()
raise NotImplementedError
on Windows. ISO files aren’t always read-only AFAIK - a quick test with file-roller
shows I can edit ISO files. I don’t think there’s a clear division between a full path API and a partial one, unless we’re treating POSIX as gospel.
Also,
Path
is a path on the local filesystem; if making subclass of it would represent something else, I’d call it a violation of the Liskov substitution principle.
True. Suggest we provide AbstractPath
or UserPath
or something, with Path
as a subclass.
ISTM that some non-trivial thought (or just checking?) would also need to go into what happens when calling
joinpath
on/with disparatePath
objects.
Indeed! This isn’t something I’ve given much consideration yet. Will have a think.
Thanks for the gitpathlib link, cool stuff! I’d also draw folks attention to the s3path package.
pf_moore (Paul Moore) February 24, 2020, 4:44pm 5
FWIW, pyfilesystem is a much more comprehensive “extensible filesystem abstraction” API. It’s been round for some time (longer than pathlib, I believe) and probably has some ideas worth investigating. But making pathlib more extensible is definitely an idea worth exploring in its own right.
barneygale (Barney Gale) February 25, 2020, 1:44am 6
I wasn’t previously aware of pyfilesystem, looks really handy with lots of fine-grained control. The essential methods to implement are:
getinfo()
Get info regarding a file or directory.listdir()
Get a list of resources in a directory.makedir()
Make a directory.openbin()
Open a binary file.remove()
Remove a file.removedir()
Remove a directory.setinfo()
Set resource information.
There are also non-essential methods which “have a default implementation in the base class, but may be overridden if you can supply a more optimal version.”
The author(s) have designed a nice, fine-grained API for FS operations, and so avoid some of the kinks like emulating os.stat()
. That said, I’d suggest that modeling the accessor API on os
has certain advantages:
- The
os
module (and perhaps Unix APIs more generally) are already somewhat familiar to users - I worry that attempting to design a filesystem API that properly defines all reasonable capabilities and variations in support while remaining accessible is a particularly difficult task.
- It’s what
pathlib
does already, so we don’t need to do much refactoring
My proposal is that we keep the (more-or-less) os
-like API for accessors, and perhaps provide some default implementations (e.g. Accessor.listdir()
calls Accessor.scandir()
). That said, this is my first attempt and contributing to cpython and I’m not sure how big we’re allowed to dream!
encukou (Petr Viktorin) February 25, 2020, 12:58pm 7
I suggest you copy pathlib and its tests, and publish it on PyPI with a new name (maybe pathlab.pathlib
?). Then, add the new features, tests, documentation, test it in the real world, and after that propose changes into pathlib.
This way, you’re not blocked by anyone until the very end, it can be tested easily, and the improvements will be available even for older versions of Python.
(This is not meant to push you away! It’s a comfortable branch-merge workflow for big changes. Core devs also use similar workflows, see for example importlib-metadata, asyncio/tulip. Or compileall improvements from my colleagues.)
barneygale (Barney Gale) February 25, 2020, 2:18pm 8
Great, thank you! I suppose my main aim with this thread was to gauge support for the idea and get help identifying possible implementation roadblocks. Thanks for the help, all. I think I’m right in saying there’s some cautious interest in the idea. I’ll now set up a branch of pathlib and write an initial implementation. This will probably lack some of the possible niceties mentioned in my original post - like a user-oriented version of os.stat_result
.
barneygale (Barney Gale) February 25, 2020, 9:33pm 9
For anyone interested in following my progress, my first attempt is available in a branch here: https://github.com/barneygale/cpython/tree/pathlib-abc. Currently lacking: tests, docs, and niceties like a one-liner to generate a UserPath
subclass with a particular accessor instance bound.
barneygale (Barney Gale) February 29, 2020, 6:02am 10
Hi all,
I think I’ve now implemented as much as I can without further design input. You can see my changes here.
Noddy example of how this can be used:
import pathlib
import stat
import os
class MyPath(pathlib.UserPosixPath):
pass
class MyAccessor(pathlib.AbstractAccessor):
path_type = MyPath
def __init__(self, children):
self.children = children
def stat(self, path, *, follow_symlinks=True):
return os.stat_result((stat.S_IFDIR, 0, 0, 0, 0, 0, 0, 0, 0, 0))
def scandir(self, path):
for child in self.children:
yield self.path_type(path, child)
def main():
accessor = MyAccessor(('foo', 'bar'))
MyPath = accessor.path_type
for p in (MyPath('/'), MyPath('/mlem')):
print(p)
print(repr(p))
print(list(p.iterdir()))
print(p.exists())
print(p.is_file())
print(p.is_dir())
main()
Changes:
- Added
UserPath
,UserPosixPath
andUserWindowsPath
classes. These are intended to be subclassed by users. - Renamed
_Accessor
toAbstractAccessor
- Renamed
_NormalAccessor
toNormalAccessor
- Renamed
_normal_accessor
tonormal_accessor
- Added
AbstractAccessor.path_type
class attribute. In user subclasses ofAbstractAccessor
, this can be set to (a subclass of)UserPosixPath
orUserWindowsPath
. After an accessor is instantiated,accessor.path_type
points to a subclass associated with the accessor instance. I’m not 100% sold on this tbh. - Added
accessor.getcwd()
- Added
accessor.expanduser()
(replacesflavour.gethomedir()
) - Added
accessor.fsencode()
- Added
accessor.fspath()
, which is called to implement__fspath__
and might perform a download/extraction in subclasses. - Added
accessor.owner()
andaccessor.group()
- Added
accessor.touch()
- Changed
accessor.open()
such that it is now expected to function likeio.open()
and notos.open()
- Renamed
accessor.link_to()
toaccessor.link()
(for consistency withos
) - Removed
accessor.listdir()
(now usesscandir()
) - Removed
accessor.lstat()
(now usesstat()
) - Removed
accessor.lchmod()
(now useschmod()
) - Removed support for using a
Path
object as a context manager (bpo-39682)
I’d appreciate some feedback on the work so far - does it look generally OK? Am I breaking things I shouldn’t be?
barneygale (Barney Gale) March 10, 2020, 3:36pm 11
Hi all,
I’ve put in a number of pull requests that lay some groundwork. There’s a small spreadsheet summarizing the changes here. Most of the changes are geared towards moving impure functionality into the _NormalAccessor
class.
Question: how should __fspath__()
work in Path
subclasses? Is it reasonable to delegate to the accessor (with a default implementation of str(path)
), or should __fspath__()
stay as it is?
My initial feeling was that __fspath__()
should return a path that obeys local filesystem semantics, and therefore it’s reasonable to delegate to the accessor in case the accessor-writer wants to download/extract the file and return a local filesystem path. But re-reading PEP 519, and considering that PurePosixPath and PureWindowsPath are path-like irrespective of local system, I’m now leaning towards __fspath__()
always being implemented as return str(self)
, as it is currently.
Does this sound right? And does the same argument apply to Path.__bytes__()
?
Thanks
pitrou (Antoine Pitrou) March 10, 2020, 6:48pm 12
Well… that’s ultimately the implementer’s responsibility, but I view __fspath__
as a light-weight, side-effect free inquiry. If it’s necessary to do a background copy to return a local filesystem path, then perhaps you shouldn’t implement __fspath__
at all, and instead let the user open the file (or a middle layer) explicitly for reading.
pitrou (Antoine Pitrou) March 10, 2020, 6:50pm 13
It does look generally ok, though validation will come through actually running the test suite against those changes.
One question about your example: are you expecting users to have knowledge of the Accessor
class or subclass? For me, that was purely an implementation detail.
barneygale (Barney Gale) March 10, 2020, 8:38pm 14
Thank you!
On __fspath__()
: I think I agree, thanks. I’ll withdraw the relevant bug report + PR.
On exposing Accessor
, I had two ideas:
- Expose and document
Accessor
, and make this the principal class to subclass for implementing a customPath
subclass. It’s a kinda handy to keep all the filesystem-accessing stuff separate, and keepPath
objects as an immutable view. It also means users don’t have to deal with any__new__
magic, because accessors are pretty boring :-). But it does mean users would probably need to subclass bothAccessor
andPath
(or to call some other factory function, likeMyPath = accessor.path_type
in the example above), which might be asking too much. - Move all
Accessor
methods intoPath
, but this might make it awkward to keep state around, and I’m not currently a fan of this idea.
I appreciate this wasn’t the original intention for accessors, but they seem to fit this problem pretty nicely, and provide a path forward without any big rewrites.
pitrou (Antoine Pitrou) March 10, 2020, 8:51pm 15
I agree with making Accessor
a hook for subclass implementers. Your example seemed to show it being invoked by the end user, though, which doesn’t sound necessary.
barneygale (Barney Gale) March 10, 2020, 8:55pm 16
I’m imagining that a single Accessor
instance is created by the user for a particular filesystem, e.g.:
accessor = S3Accessor(
username='aws_user',
token='aws_token',
bucket='aws_bucket')
S3Path = accessor.path_type # Evaluates to a `Path` subclass with this particular accessor instance bound
root = S3Path('/')
assert root.exists()
pitrou (Antoine Pitrou) March 10, 2020, 9:15pm 17
I’m a bit skeptical about this. My first hunch is that implementers can create another public-facing abstraction, for example a FileSystem
object. If we expose Accessor
to the public, people will inevitably be using it directly and we may end up more constrained that it we make it an implementer-only API (where we can probably be less strict when it comes to e.g. API stability).
barneygale (Barney Gale) March 10, 2020, 10:10pm 18
Very fair. If it helps, I was planning to only expose AbstractAccessor
, and keep _NormalAccessor
private.
Authors of libraries that extend pathlib may still choose not to expose their custom accessor to user code, so a (theoretical) s3lib
library could be used like this:
S3Path = s3lib.make_path_type(bucket='...')
root = S3Path('/')
and under-the-hood:
def make_path_type(bucket):
class S3Path(pathlib.UserPath):
_accessor = _S3Accessor(bucket)
return S3Path
Marco_Sulla (Marco Sulla) March 10, 2020, 11:28pm 19
What about an ABC class?
barneygale (Barney Gale) March 29, 2020, 8:12pm 20
Some renames that might make the API clearer:
_Accessor
→AbstractFileSystem
_NormalAccessor
→LocalFileSystem
I’d suggest both AbstractFileSystem
and LocalFileSystem
gets a __new__
method that prevents direct instantiation (subclass + instantiate would be fine) to guard against direct usage, per @pitrou
The key question for me is:
I have this
FileSystem
object stored in amyfs
variable, how can I get aPath
type that uses it?
I personally haven’t come up with anything good here. Options include:
- Leave it entirely up to the user or library to construct a
Path
type with theirFileSystem
instance attached myfs.Path(...)
-Path
is an attribute that stores a subclass ofpathlib.Path
with our FS instance bound.pathlib.make_path_type(myfs)
pathlib.Path(..., fs=myfs)
- Do away with
FileSystem
(i.e. accessors) altogether, and merge their methods intoPath
? - Something else?
Thanks!