Symlink (and other) handling of archives · Issue #5919 · pypa/pip (original) (raw)

Following on from #5848 I've been looking into pip's handling of symlinks in zips (and tars). I've got some experiments to PR, but I think it might be more useful to have a quick conversation about what pip actually wants to do with symlinks in archives it handles, if anything.

The current state

At present, pip handles symlinks present in tar archives (largely because Python's TarFile handles them). By contrast, pip doesn't handle symlinks in zips properly (again, largely because Python's ZipFile doesn't); it extracts them as a regular file containing the name of the target. However, there is enough information in the file-attributes in a zip to accurately reconstruct symlinks (demonstrably as the infozip tools do this).

Then there's the OS differences. On UNIX-like platforms, symlinks (in tars) will be re-constructed normally. However, Windows presents certain challenges:

  1. Symlink support only appeared in Vista (is XP still a concern?)
  2. More importantly, it's a privileged operation (there's a capability for it, but it's only granted to administrators by default)

Hence, if pip is running as Administrator on a recent (Vista onwards) version of Windows, symlinks (in tars) will be re-constructed as symlinks. Otherwise, they'll be extracted as a regular file containing the target's contents (i.e. the target file will be duplicated under the symlink's name).

Questions

  1. Do we want pip to treat symlinks in all archives equally? I'm assuming that the current disparity between tars and zips should be corrected. Which leads onto...
  2. Do we want pip to handle symlinks at all? Given that support on Windows is unlikely to work (without administrative privileges, and I don't think it's normal to run pip with administrative privileges on Windows?), there's several options:

Other stuff

Basically, I think unzip_file and untar_file need a bit of a re-work to make them both consistent with each other. While I'm at it there's a few other things that I'd like to fix:

  1. I'm not particularly happy that untar_file handles symlinks by calling an "internal" method of TarFile (_extract_member 1); ideally I'd like to re-work that to avoid such calls. I admit it's unlikely to change (it's been there almost unchanged since 2.7) but still, I don't like relying on undocumented methods
  2. ZipFile.extract 2 fixes illegal characters in filenames when extracting in Windows (oddly TarFile.extract doesn't). Might be useful to add this functionality too - especially if the "always work" option is selected (as presumably the expectation there is that all archives should extract successfully on all platforms)
  3. What about hard-links, FIFOs, and devices (all perfectly valid in a tar, and in some cases, zips)? My gut instinct is: throw an error, or possibly just a warning?
  4. Finally, I'd like to throw in some protection against absolute paths in archives (like ZipFile.extract 3). Currently pip doesn't guard against this, and given it's occasionally run as root on UNIX-like systems that's a bit dangerous