Policies for tarfile.extractall, a.k.a. fixing CVE-2007-4559 (original) (raw)

Hello,
You might have heard of CVE-2007-4559, the tarfile directory traversal vulnerability. Discussion about it has been going on and off since 2007, but no conclusion was reached.

tarfile’s previous maintainer, Lars Gustäbel, wrote a post on the issue before stepping down. I agree with the conclusion:

I understand the community’s desire to fix this issue no matter how. Unfortunately, fixing these kinds of issues is not trivial and requires a lot of effort.

Anyway, I was asked to fix this on our systems. I need to fix the issue, rather soon. And I’d rather fix it in CPython, and fix it well, so everyone can benefit.
I spent the last few days (literally) reading up on this, and here’s what I came up with. It’s long (sorry) and roughly PEP-shaped – probably appropriate after 15 years of discussions.

And it starts from a different angle than fixing a CVE.

Motivation

The tar format is user for several use cases, many of which have different needs. For example:

A backup of a UNIX workstation should faithfully preserve all kinds of details like file permissions, symlinks to system configuration, and various kinds of special files.
When unpacking a data bundle, it’s much more important that the unpacking will not have unintended consequences – like exposing password file by symlinking it to a public place.

To support all the use cases, tar has very many features. But in many cases, it’s best to ignore or disallow some of them when extracting an archive.

Python allows extracting tar archives using tarfile.TarFile.extractall(), whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly.
And so many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.

Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. But, TarFile.extractall looks benign in a code review.

Tarfile extraction is also exposed via shutil.unpack_archive, which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting to extracting archives without prior inspection, even though the docs again warn against it.

It could be argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.

Rationale

How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility. TarFile.extractall() is what people reach for when they need to extract a tarball. Its default behaviour needs to change.

What would be the best behaviour? That depends on the use case. So, I propose adding several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:

Current behavior: trusting the archive. Suitable e.g. as a building block for libraries that do the check themselves, or extracting an archive you just made yourself.
Unpacking a UNIX archive: roughly following GNU tar (e.g. stripping leading / from filenames).
Unpacking a general data archive: the shutil.unpack_archive use case.

Eventually, after a deprecation period, the last option – the most limited but most secure one – would become the default, and the others would be easy to specify.

Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:

Call TarFile.getmembers()
Verify or modify each member’s TarInfo
Pass the result to extractall’s members

However, there are some issues with this approach:

TarInfo does not officially allow modification. You can modify it, but you can get surprising results since TarInfo objects are cached/shared.
Calling getmembers can be expensive and it requires a seekable archive.
When verifying members in advance, it may be necessary to track how each member would have changed the disk, e.g. how symlinks are being set up. This is hard. We can’t expect users to do it.

To solve these issues we’ll:

Provide a supported way to modify TarInfo when needed, without unnecessary work when not. A replace method, similar to what dataclass or namedtuple have, should do the trick.
Provide a way to hook into extractall’s loop to fitler/modify members as they are processed.
Require that this hook is called just before extracting each member, so it can scan the current state of the disk. This will greatly simplify the implementation of policies (both in stdlib and user code), at the cost of not being able to do a precise “dry run”.

The default policies described above will be implemented using the API for user-defined policies, so they can be used as building blocks or examples.

Full disclosure & redistributor info

I work for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. We’re planning to carry vendor patches to:

Allow configuring the defaults system-wide, and
Change the default as soon as possible, even in older Python versions.

The proposal should make this easy to do, and it allows users to query the settings.

Specification

Modifying and forgetting member metadata

The TarInfo class will gain a new method, replace(), which will work similarly to dataclasses.replace. It will return a deep copy of the TarInfo object with attributes replaced as specified by keyword arguments:

name
mtime
mode
linkname
uid
gid
uname
gname

Any of these, except name and linkname, will be allowed to be set to None.
When extract or extractall encounters such a None, it will not set that piece of metadata, leaving it as default (current user/group, mode based on umask, current time).
When addfile encounters such a None, it will raise an error. (It could also not store the attribute, if the format allows it, but that’s a possible future enhancement.)

The documentation will mention why the method is there: TarInfo objects retreived from TarFile.getmembers() are “live”, and modifiying them directly will affect subsequent unrelated operations.

Policies

TarFile.extract and TarFile.extractall methods will grow a policy parameter, which take a function with the signature:

policy(member: TarInfo) -> TarInfo|None

(XXX would filter be a better name than policy?)

When used it will be called on each member as it is extracted, and extraction will work with the result. On None the member will be skipped.

The function can also raise an exception. This can, depending on Tarfile.errorlevel, abort the extraction or cause the member to be skipped.

We will also provide a set of defaults for common use cases. In addition to a function, the policy argument will take one of the following strings: (XXX better names welcome)

'fully_trusted': Current behavior: honor the metadata as is. Should be used if the user trusts the archive completely, or implements their own complex verification.
'tar': Roughly follow defaults of the GNU tar command (when run as a normal user without --absolute-names):
- Strip leading / from filenames
- Refuse to extract files with a .. component in the filename
- Refuse to extract files whose absolute path (after following symlinks) would end up outside the destination. (Note that GNU tar instead delays creating some links.)
- Clear high mode bits (setuid, setgid, sticky) and group/other write bits (S_IWGRP|S_IWOTH). This is an approximation of tar’s default, which limits the mode by the current umask setting.
'data': Extract a “data” archive, disallowing common attack vectors but limiting functionality. In addition to tar:
- Refuse to extract links (hard or soft) which end up linking to a path outside of the destination.
- Refuse to extract device files (incl. pipes)
- For regular files and hard links:
  * Set the owner read and write permissions (S_IRUSR|S_IWUSR).
  (By now only the executable bits depend on information in the archive.)
  * Remove the group & other executable permission (S_IXGRP|S_IXOTH) if the user doesn’t have it (S_IXUSR).
- For other files (directories), ignore mode entirely (set it to None).
- Ignore user and group info (set uid, gid, uname, gname to None).
'warn': Like 'fully-trusted', but emit a DeprecationWarning for each member that would be changed or removed under the 'data' policy.

The default will be 'warn' in Python 3.12-3.13, and change to 'data' in Python 3.14.

The corresponding functions will be available as tarfile.fully_trusted_policy(), tarfile.tar_policy(), etc, so they can be easily used in custom policies.

A new exception, PolicyError, will be added. It’ll have several new subclasses: one for each of the refusal reasons above. PolicyError’s member attribute will contain the relevant TarInfo.

In the lists above, “refusing" to extract a file means that a PolicyError will be raised. If the TarFile.errorlevel is 1 or more, this will abort the extraction. With errorlevel=0, the error will be logged and the member will be ignored, but extraction will continue.
Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.

(The default errorlevel is currently 1, not 0 as the documentation says. The docs will be fixed.)

A new method, TarInfo.apply_policy(self, policy=...), will raise PolicyError or return a (possibly modified) TarInfo to be extracted.
The docs will warn that calling this for all members does not do a “dry run” extraction (previously extracted files might make changes on disk that affect later members, e.g. set up symlinks).

Introspection

The module will get new constants DEFAULT_POLICY (name of the default policy, i.e. 'warn' for now) and POLICIES (a dict of available policy values).
Users are encouraged to look for these, rather than the Python version, to check whether policy is supported.
Applications and system integrators may wish to change DEFAULT_POLICY to suit their requirements.

Hints for further verification

Even with the proposed changes, tarfile will not be suited for extracting untrusted file without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.

New docs will tell users to consider:

extracting to a new empty directory,
checking filenames against an allow-list of characters (to filter out control characters, confusables, etc.),
checking that filenames have expected extensions (discouraging files that execute when you “click on them”, and extension-less files like Windows special device names),
limiting the total size of extracted data, size of individual files, and number of files,
checking for files that would be shadowed on case-insensitive filesystems.

Also, the docs will note that:

tar files commonly contain multiple versions of the same file: later ones are expected to overwrite earlier ones on extraction.
tarfile does not protect against issues with “live” data, e.g. an attacker tinkering with the destination directory while extraction (or adding) is going on (see GNU tar manual for more info)

This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with zipfile or shutil (which is out of scope for this proposal).

Shutil

shutil.unpack_archive will use tarfile’s default policy: 'warn' for a deprecation period, then 'data'.

Rejected ideas

SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (tarfile: Traversal attack vulnerability · Issue #65308 · python/cpython · GitHub). There are two major issues with this approach:

The name is misleading. General achive operations can never be made “safe” from all kinds of unwanted behavior, without impacting legitimate use cases.
It does not solve the problem of unsafe defaults.

However, many of the ideas behind SafeTarFile can be used.

Add absolute_path option to tarfile

A minimal change to check the “CVE resolved” box doesn’t go far enough to protect the unaware, nor to empower the dilligent and curious.

Thanks

This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, and many others.