Containers, pseudo TTYs, and backward compatibility [LWN.net] (original) (raw)

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, pleasebuy a subscription and make the next set of articles possible.

There is no doubt that the addition of container technologies to Linux has created a lot of value, allowing workloads to be effectively and efficiently isolated from each other. Implementing these technologies presents a number of challenges, particularly as much of Linux and Unix was designed to use singletons: objects of which there could never ever be more than one, such as host names, network routing tables, or process-ID namespaces. Containers require this design approach to be revised as they need multiple instances of these objects. A singleton that has been causing problems recently is the set of pseudo terminals (TTYs).

A pseudo TTY(or "PTY") is a pair of devices — a slave and a master — that provide a special sort of communication channel. The slave device behaves much like the device representing the VT100 or ADM-3A "dumb terminal" that we all have on our desks ... or that we might have had a few decades ago. It can read and write text as though it were a physical terminal, it can enable or disable echo of typed characters, etc. The master acts more like the person sitting in front of that dumb terminal. Writing to the master is exactly like typing on a terminal. If echo is enabled, then everything written can immediately be read back, and writing a backspace effectively causes the previous character typed to be forgotten. Modern computers typically have very few, if any, physical terminals, but potentially lots of PTYs to support text-based interfaces as provided by terminal emulators (such as xterm or gnome-terminal) and remote access interfaces like SSH.

Opening a pseudo TTY

The history of pseudo TTYs contains the sort of mix of clever ideas and unfortunate choices that we've come to expect in fast-moving technology. The original implementation provided a fixed number of master/slave pairs which, like all other devices, had permanent device nodes in/dev. /dev/ptyp9 would be a master device, for example, and/dev/ttyp9 would be the matching slave device. An application or service that needed a PTY would try to open each master device in turn until it succeeded with one; it would then have exclusive access to that PTY. The application would change the ownership of the slave to match the user it was providing access to and hand the slave to whatever command shell or similar program was appropriate. If a non-privileged application needed to allocate a PTY it would need a setuid helper program to update the ownership of the slave device node in /dev.

While this worked, it was far from elegant, particularly as the number of PTYs configured on systems headed into the hundreds and the search for a free master device became a greater waste of time. So, with the"Unix98" Single Unix Specification (SUS), a new approach was adopted. An abstract interface was defined to allow the writing of portable code without imposing a single mechanism on all implementations, as the committee was not able to agree on any one universal mechanism. In February 1998, Linux 2.1.87 brought support for the /dev/ptmx multiplexing master device. Opening this device provides access to an otherwise unused pseudo TTY master and allows the matching slave to be identified using anioctl(). This makes implementations ofposix_openpt() and ptsname() quite straightforward.

In April of that year, Linux 2.1.93 added a new virtual filesystem called devpts that is normally mounted at /dev/pts. Whenever a new master/slave pair is created, a device node for the slave is created in that virtual filesystem. This device node is given an owner and group matching the owner and group of the process that opened /dev/ptmx, though either can be overridden by mount options. With this there is no need for a setuid helper program. At least there shouldn't be.

There is just one little problem: SUS requiresthat the group ID of the slave device should not be that of the creating process, but rather some definite, though unspecified, value. The GNU C Library (glibc) takes responsibility for implementing this requirement; it quite reasonably chooses the ID of the group named "tty" (often 5) to fill this role. If the devpts filesystem is mounted with options gid=5,mode=620, this group ID and the required access mode will be used and glibc will be happy. If not, glibc will (if so configured) run a setuid helper found at /usr/libexec/pt_chown.

As Eric Biederman discovered,xen-create-imagemounts devpts in a chroot while creating a new root filesystem, and does so without these options. Just why this is interesting will become clear a little later.

Seeing the singletons

This design for PTYs created two related singletons: the master multiplexer/dev/ptmx and the slave virtual filesystem/dev/pts. Abstracting a singleton object to be different in different containers has been done multiple times and the process is well understood. When there are two distinct but related singletons as we have here, there is more complexity that must be carefully managed. These details were thought to have been addressed back in 2009 when container support was added to the pseudo TTY subsystem.

With this change, it became possible to mount distinct instances of the devpts filesystem, each with its own set of pseudo TTYs. A new "ptmx" file was created inside the mounted devpts filesystem instance; opening thispts/ptmx file would always create slave nodes in the same filesystem instance. It was expected that /dev/ptmx would be changed to be a symbolic link to pts/ptmx; containers could then just mount their own devpts filesystem and, as it was now a self-contained entity, everything would be happy. Unfortunately not everyone got the memo. While some container libraries configure/dev/ptmx like this, the practice isn't universal.

The last piece of the puzzle is that a device node for ptmxthat is created explicitly with mknod, rather than created implicitly in adevpts instance, is still a singleton, so there must be a unique, global devpts filesystem where slave nodes are created when the singleton ptmx node is opened. To ensure backward compatibility, an attempt to mount a devptsfilesystem will normally mount this single-instance unless thenewinstance mount option is provided. This way, old installations get what they expect, new code has control and can get what is needed. It seems like a reasonably clean, if slightly inelegant, solution.

Unfortunately there is a problem, and here at last we find out why that setuid helper program is relevant. Setuid programs are always a little bit risky — it is important that they cannot be tricked into doing the wrong thing, so they must be provided with complete information in ways that cannot easily be forged. The setuid pt_chown tool is given the master side of the new PTY as an open file descriptor and the user ID to change its ownership to as the process's real UID. It then needs to find the slave node, which can be done using ptsname(). In a system with multiple devpts instances mounted, the information pt_chown gives to ptsname() is no longer complete as it does not identify which devpts instance to use; that can lead to unfortunate consequences.

What is your real name?

In Linux 3.9, it became possible for an unprivileged user to mount a newdevpts instance in a private user namespace. If that user ensured this mount was still visible in the global namespace, a program running there would be able to open the new ptmx device and get a file descriptor of a PTY master with any arbitrary index number. This would be quite distinct from any PTY in the global /dev/pts/ but, crucially, ptsname() doesn't know that.ptsname() simplycalls the ioctl() to find the index number for the PTY, and constructs a path name in /dev/pts/. So if the master file descriptor could be passed to a setuid pt_chown, it would change the ownership of a PTY that was, in all probability, owned by someone else. The ability to take ownership of a PTY connected to a root shell, for example, has obvious value to somebody wishing to compromise the system.

The obvious response to this would be to get rid ofpt_chown, simply because it is a setuid program that isn't needed — providing devpts was mounted with the proper options. Unfortunately this isn't obvious to everyone. Pulling other threads of this story together, if you run xen-create-image, it will mount the devpts filesystem within its chroot environment with no options. This mounts the singleton instance, and imposes the default options on it (which are not the preferred options). If, as is normally the case, the singleton instance is already mounted outside the chroot at /dev/pts with options likegid=5,mode=620, the default options will be implicitly imposed there too — overwriting the previous options. Without pt_chownthis will result in slave PTYs getting the wrong group owner, which breaks a number of applications.

The right solution would be to fix xen-create-image. Biederman reported that, rather than taking that approach, "some distributions have been working around this problem by continuing to install a setuid root pt_chown binary that will be called by glibc to fix the permissions." In the absence of multiple-instancedevpts, this may be clumsy but it works and has no obvious security problems. With the introduction of multiple-instance devpts and allowing unprivileged mounts in user namespaces, a potential security issue has arisen. This problem came up as a result of changes in the kernel, so it is up to the kernel developers to address the issue, even though it is only there because of questionable practices in user-space code.

Just to be clear, this problem only affects installations with a setuidpt_chown, with a v3.9 or later Linux kernel, and with the non-default CONFIG_USER_NS configuration option enabled.

Search for a solution

The best solution would address the problem without any change to user space. This would require a kernel change so that the current pt_chown program either failed to recognize the given file descriptor as representing a PTY master, or failed when it tried to change ownership. Neither of these are possible with any sort of elegance.pt_chown identifies the master by using the TIOCGPTNioctl which just returns a number used to identify the slave. Any change to this would be likely to break some other program too.

As no kernel change could be sufficient, some glibc change is required. It would probably be possible to change ptsname() to perform more checks before performing the ownership change, such as ensuring that the inode and device numbers of the passed file descriptor match that which is reported by lstat("/dev/ptmx"). It may even be possible to make this work reliably on all kernels and all Linux distributions. But this far from certain and the approach only serves to perpetuate an undesirable setuid program: nobody really wants that.

So the preferred approach is a glibc configuration change rather than a code change: deprecate pt_chown and convince all distributions to remove it. This suggests that a way needs to be found to support the somewhat strange usage in xen-create-image without the need for pt_chown.

Biederman's plan for this is to discard the "singleton" instance ofdevpts completely. Once implemented, every mount ofdevpts will be a new instance, so whenxen-create-image performs the mount with default options, that won't affect the mount options on the system /dev/pts. If all distributions had already changed /dev/ptmx to be a symlink topts/ptmx in all cases, this would be a trivial change and nobody would notice. But that change has not been universally made.

Since the separate ptmx device node (created viamknod() in /dev) is still widely in use, it must be changed so that, instead of using the singletondevpts (which will no longer exist), it uses the right one for the particular context in which it is accessed. When Beiderman's new version of ptmx is opened, it will look for the name "pts" in the same directory thatptmx was found and see if that is a mount point of thedevpts filesystem. If it is, a new PTY will be allocated in that filesystem. If it isn't, an error will be returned.

This is undoubtedly an odd behavior for a character-special device to have. We are used to symbolic links behaving differently when found in different directories, but not character devices. There was asuggestion that the ptmx device could make itself look like a symbolic link, but this turned out to be much more easily said than done, and it is not clear that this would be a more elegant solution, just a different one.

To summarize the intent behind these changes: enabling a regularptmxdevice to find and use a nearby mount of devpts makes it possible to get rid of the singleton version of the devpts filesystem and have every mount create a new instance. Once this is done, the unusual usage in xen-create-image (or any other unusual usage that might be out there) will only have local effect and cannot impose non-standard options on the "system" /dev/pts. Then there will no longer be any excuse to install pt_chown, so we can strongly encourage distributions and users to remove it. At that point, the security problems that arise when enabling both CONFIG_USER_NS andpt_chown will no longer be an issue.

Might this break something else?

This change, first proposed by Biederman in December, has not had an easy path to the kernel. Once it became clear that some sort of semantic change was needed, the question arose as to which changes might be safe and which changes might break things. Linus Torvalds's dictumthat we mustn't break user space does not mean that we cannot change the behavior of the kernel, only that we cannot change a behavior that is reasonably being depended on. There was some disagreementas to exactly what could or could not be changed in this case.

To his great credit, Biederman has assembled a considerable array of different distributions to test his changes on. The most recent patch makes the claim that:

This has been verified to work properly on openwrt-15.05, centos5, centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3, ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1, slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the caveat that on centos6 and on slackware-14.1 that there wind up being two instances of the devpts filesystem mounted on /dev/pts, the lower copy does not end up getting used.

which provides quite a high level of confidence that existing behaviors aren't broken.

This last patch is considerably smaller than some earlier attempts, in part because Torvalds committed some clean-up patcheshimself to make the code more approachable. It was a little too latefor Greg Kroah-Hartman — as TTY maintainer — to accept it for the 4.7 cycle, but it seems likely that this new approach to devptswhere every mount is a new instance will land for Linux 4.8. The next step will, presumably, be to actively encourage those distributions that currently ship a setuid pt_chown to stop doing so.

Index entries for this article
Kernel Containers
GuestArticles Brown, Neil