Use tarfile filters (PEP 706) when unpacking? · Issue #12111 · pypa/pip (original) (raw)
Description
Hello,
This is a continuation of #11464, which unfortunately turned sour.
New security releases of Python add a filter
argument to tarfile.extract
, which allows filtering common security issues.
Python 3.12 will raise a DeprecationWarning
if filter is not specified. I assume pip
will want to avoid the warning.
Python 3.14 will change the default to tarfile.data_filter.
What are your thoughts on how to best handle this? Happy to send a PR after a discussion.
Complicating factors:
pip
has its own elaborate code that's similar todata_filter
, with different details that might need to be preserved- for symlinks,
pip
reaches into CPython internals (tar._extract_member
), which may avoid both the filtering and the warning.
Possible solutions I see:
- Use
data_filter
on Pythons that support it. Accept behaviour differences -- AFAIK the biggest one is inmode
handling:data_filter
setsrwx?-??-?
if executable orrw-?--?--
if not (?
=keep info from tarball)pip
uses the OS default, then and adds all 3 executable bits if any are set in the tarball
- Use
data_filter
if available to raise exceptions (files/links outside target, device files), use existing mechanism if the filter doesn't fail - Use the
fully_trusted
filter to silence warnings
In the previous issue, @pradyunsg said:
Our security model is that sdists involve arbitrary code execution, so they can do basically anything in the build process.
My point was that I think we should (and do) protect against certain classes of issues though, such as dumping files in arbitrary locations.
The latter is currently not true: pip's unpacking is vulnerable to symlink attacks. The attached reproducer makes pip download evil.tar.gz
dump a file in /tmp/poc
, without running arbitrary code.
As said before, given pip's security model this isn't a security issue -- AFAIK it could only become one if the function is copied/used as documented elsewhere.
Expected behavior
No response
pip version
main
Python version
3.12
OS
*nix, any
How to Reproduce
mkevil.py
:
import tarfile
def mkinfo(name, **kwargs): tarinfo = tarfile.TarInfo(name=name) for name, value in kwargs.items(): setattr(tarinfo, name, value) return tarinfo
with tarfile.open('evil.tar.gz', 'w:gz') as tf: tf.addfile(mkinfo('./pyproject.toml')) tf.addfile(mkinfo('./tmp', type=tarfile.SYMTYPE, linkname='../../../../../../../../tmp')) tf.addfile(mkinfo('./tmp/poc'))
python mkevil.py
python -m pip download evil.tar.gz
ls ls /tmp/poc
Output
$ python mkevil.py
$ python -m pip download evil.tar.gz Looking in indexes: http://127.0.0.1:3141/root/pypi/+simple Processing ./evil.tar.gz File was already downloaded /home/pviktori/dev/one-offs/pocs/tarfile-pip/evil.tar.gz Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Successfully downloaded UNKNOWN
$ ls /tmp/poc # created by pip download
/tmp/poc
Code of Conduct
- I agree to follow the PSF Code of Conduct.
+1 on ensuring that when unpacking tarfiles, pip never either creates files outside of the target directory, or creates links to files outside of the target directory. I don't believe there is any valid form of sdist that needs the capability to do this (and if there is, I'd support a PEP modifying the sdist spec to make it illegal 🙂)
I'd be slightly uncomfortable if we had to make such protection only available in some Python versions, but I could live with it.
I have no opinion regarding differences in mode handling. As a Windows user I don't think it affects me, and in general, I don't think we should be trying to preserve mode bits anyway. I suspect someone, somewhere, will have a sdist that expects a particular mode bit to be preserved from source tree, through sdist and wheel, to final install (unless some other part of the chain already breaks this!) Again, I'd be happy to support a PEP that made such expectations illegal, though 😉
This latter is my biggest concern. The sdist format is hugely under-specified, and people could easily be using it in ways we don't know about and wouldn't necessarily sanction. There's a risk that we could end up needing to write a PEP to tighten up the standards (pip's policy is that we implement standard-defined behaviour, so we wouldn't implement significant restrictions unilaterally).
I'd be slightly uncomfortable if we had to make such protection only available in some Python versions, but I could live with it.
It's now in the latest security releases of all supported Python versions.
unless some other part of the chain already breaks this!
It does: Python's zipfile
ignores mode on extraction. (So do non-Windows filesystems and git
.)
So I guess the worry would be that the wheel-generating code of a Unix-only project writes file contents differently depending on file modes.
Guess I should mention another potential incompatibility: historically tarfile
honors owner user/group if run as the superuser, pip
and data_filter
ignore it.
This was referenced
Jul 3, 2023
The sdist format is hugely under-specified, and people could easily be using it in ways we don't know about and wouldn't necessarily sanction. There's a risk that we could end up needing to write a PEP to tighten up the standards (pip's policy is that we implement standard-defined behaviour, so we wouldn't implement significant restrictions unilaterally).
Here's the PEP proposal: https://discuss.python.org/t/using-tarfile-data-filter-for-source-distribution-extraction/28928
github-actions bot locked as resolved and limited conversation to collaborators