Policies for tarfile.extractall, a.k.a. fixing CVE-2007-4559 (original) (raw)

Hello,
You might have heard of CVE-2007-4559, the tarfile directory traversal vulnerability. Discussion about it has been going on and off since 2007, but no conclusion was reached.

tarfile’s previous maintainer, Lars Gustäbel, wrote a post on the issue before stepping down. I agree with the conclusion:

I understand the community’s desire to fix this issue no matter how. Unfortunately, fixing these kinds of issues is not trivial and requires a lot of effort.

Anyway, I was asked to fix this on our systems. I need to fix the issue, rather soon. And I’d rather fix it in CPython, and fix it well, so everyone can benefit.
I spent the last few days (literally) reading up on this, and here’s what I came up with. It’s long (sorry) and roughly PEP-shaped – probably appropriate after 15 years of discussions.

And it starts from a different angle than fixing a CVE.

Motivation

The tar format is user for several use cases, many of which have different needs. For example:

To support all the use cases, tar has very many features. But in many cases, it’s best to ignore or disallow some of them when extracting an archive.

Python allows extracting tar archives using tarfile.TarFile.extractall(), whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly.
And so many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.

Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. But, TarFile.extractall looks benign in a code review.

Tarfile extraction is also exposed via shutil.unpack_archive, which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting to extracting archives without prior inspection, even though the docs again warn against it.

It could be argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.

Rationale

How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility. TarFile.extractall() is what people reach for when they need to extract a tarball. Its default behaviour needs to change.

What would be the best behaviour? That depends on the use case. So, I propose adding several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:

Eventually, after a deprecation period, the last option – the most limited but most secure one – would become the default, and the others would be easy to specify.

Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:

However, there are some issues with this approach:

To solve these issues we’ll:

The default policies described above will be implemented using the API for user-defined policies, so they can be used as building blocks or examples.

Full disclosure & redistributor info

I work for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. We’re planning to carry vendor patches to:

The proposal should make this easy to do, and it allows users to query the settings.

Specification

Modifying and forgetting member metadata

The TarInfo class will gain a new method, replace(), which will work similarly to dataclasses.replace. It will return a deep copy of the TarInfo object with attributes replaced as specified by keyword arguments:

Any of these, except name and linkname, will be allowed to be set to None.
When extract or extractall encounters such a None, it will not set that piece of metadata, leaving it as default (current user/group, mode based on umask, current time).
When addfile encounters such a None, it will raise an error. (It could also not store the attribute, if the format allows it, but that’s a possible future enhancement.)

The documentation will mention why the method is there: TarInfo objects retreived from TarFile.getmembers() are “live”, and modifiying them directly will affect subsequent unrelated operations.

Policies

TarFile.extract and TarFile.extractall methods will grow a policy parameter, which take a function with the signature:

policy(member: TarInfo) -> TarInfo|None

(XXX would filter be a better name than policy?)

When used it will be called on each member as it is extracted, and extraction will work with the result. On None the member will be skipped.

The function can also raise an exception. This can, depending on Tarfile.errorlevel, abort the extraction or cause the member to be skipped.

We will also provide a set of defaults for common use cases. In addition to a function, the policy argument will take one of the following strings: (XXX better names welcome)

The default will be 'warn' in Python 3.12-3.13, and change to 'data' in Python 3.14.

The corresponding functions will be available as tarfile.fully_trusted_policy(), tarfile.tar_policy(), etc, so they can be easily used in custom policies.

A new exception, PolicyError, will be added. It’ll have several new subclasses: one for each of the refusal reasons above. PolicyError’s member attribute will contain the relevant TarInfo.

In the lists above, “refusing" to extract a file means that a PolicyError will be raised. If the TarFile.errorlevel is 1 or more, this will abort the extraction. With errorlevel=0, the error will be logged and the member will be ignored, but extraction will continue.
Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.

(The default errorlevel is currently 1, not 0 as the documentation says. The docs will be fixed.)

A new method, TarInfo.apply_policy(self, policy=...), will raise PolicyError or return a (possibly modified) TarInfo to be extracted.
The docs will warn that calling this for all members does not do a “dry run” extraction (previously extracted files might make changes on disk that affect later members, e.g. set up symlinks).

Introspection

The module will get new constants DEFAULT_POLICY (name of the default policy, i.e. 'warn' for now) and POLICIES (a dict of available policy values).
Users are encouraged to look for these, rather than the Python version, to check whether policy is supported.
Applications and system integrators may wish to change DEFAULT_POLICY to suit their requirements.

Hints for further verification

Even with the proposed changes, tarfile will not be suited for extracting untrusted file without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.

New docs will tell users to consider:

Also, the docs will note that:

This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with zipfile or shutil (which is out of scope for this proposal).

Shutil

shutil.unpack_archive will use tarfile’s default policy: 'warn' for a deprecation period, then 'data'.

Rejected ideas

SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (tarfile: Traversal attack vulnerability · Issue #65308 · python/cpython · GitHub). There are two major issues with this approach:

However, many of the ideas behind SafeTarFile can be used.

Add absolute_path option to tarfile

A minimal change to check the “CVE resolved” box doesn’t go far enough to protect the unaware, nor to empower the dilligent and curious.

Thanks

This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, and many others.