Issue 27946: elementtree calls PyDict_GetItem without owning a reference to the dict (original) (raw)

I would like to describe an issue in the _elementtree module, and then propose a fix which would prevent this type of bug everywhere in the codebase.

The issue exists in _elementtree_Element_get_impl in /Modules/_elementtree.c. Here is the code:

static PyObject *
_elementtree_Element_get_impl(ElementObject *self, PyObject *key,
                              PyObject *default_value)
{
    PyObject* value;

    if (...)
        value = default_value;
    else {
        value = PyDict_GetItem(self->extra->attrib, key);
        ...
    }
    ...
}

We look up "key" in the dictionary, that is, in self->extra->attrib. This is done with the call to PyDict_GetItem(self->extra->attrib, key).

We need to hash the "key" object in order to find it in the dictionary. This sometimes requires calling the key's hash function, i.e., it might call on to python code.

What happens if the python code gets the dictionary (self->extra->attrib) freed? Then PyDict_GetItem will use it after it has been freed.

I've attached a proof-of-concept script which causes a use-after-free through _elementtree_Element_get_impl due to this issue.

We could fix this by calling Py_INCREF on self->extra->attrib before calling PyDict_GetItem. But grepping for "PyDict_GetItem.*->" shows that there are many places in the codebase where this situation occurs. Some of these may not have much impact, but some of them likely will.

All these bugs could be fixed at once by changing PyDict_GetItem to call Py_INCREF on the dictionary before it hashes the key.

Here's a PoC for the _elementtree module.

--- begin script ---

import _elementtree as et

state = { "tag": "tag", "_children": [], "attrib": "attr", "tail": "tail", "text": "text", }

class X: def hash(self): e.setstate(state) # this frees e->extra->attrib return 13

e = et.Element("elem", {1: 2}) e.get(X())

--- end script ---

Running it (64-bits Python 3.5.2, --with-pydebug) causes a use-after-free with control of the program counter:

(gdb) r ./poc13.py Starting program: /home/xx/Python-3.5.2/python ./poc13.py

Program received signal SIGSEGV, Segmentation fault. 0x00000000004939af in PyDict_GetItem (op=0x7ffff6d5b1a8, key=0x7ffff6d528e0) at Objects/dictobject.c:1079 1079 ep = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr); (gdb) p mp->ma_keys->dk_lookup $8 = (dict_lookup_func) 0x7b7b7b7b7b7b7b7b

The issue is no longer reproduced by the original test because of the cache for dict key tables. But it is not gone, and can be reproduced with modified test.

There may be many similar bugs in the Python core end extensions. Adding incref/decref in PyDict_GetItem and similar C API functions could fix many of them, but perhaps not all, and it would hit performance. I suppose modt of uses of PyDict_GetItem are safe.