Avoiding sysfs surprises (original) (raw)
One of the nice (and increasingly important) features of the 2.5 device model is sysfs. This virtual filesystem exports a view of the system's structure to user space; it also provides a nice control interface - and
/proc
replacement - by allowing attributes to be attached to sysfs entries. Sysfs is not without its traps, however, and many kernel developers are just now beginning to realize the sort of care that is necessary to avoid making mistakes.
The hardware supported by Linux is increasingly dynamic; devices can appear and disappear at any time. The sysfs filesystem adjusts itself in response to hardware events by creating and removing directories associated with devices, classes, and other objects. Kernel code typically implements this functionality by allocating (and registering) device structures and other objects when a device is plugged in, and deleting those structures when the device is removed. It tends to work quite well.
But consider the following possible sequence of events:
- A user plugs in a shiny new hotplug PCI frobnicator.
- The driver creates a device structure and registers it; as a result, the directory /sys/devices/pci0/00:11.0/ (or some such) gets created and filled with attributes.
- A user process moves into that directory, opens one of the attribute files, but doesn't get around to reading it yet.
- The user, having done enough frobnication for one day, unplugs the device.
- The driver unregisters and frees the device structures.
All seems well, except for the small problem of that user process. By sitting in the directory, it maintains a reference there. The open attribute file is yet another reference. If the driver has truly cleaned up and freed the devices, the user process will be holding structures with pointers into freed memory. An attempt to read the (already open) attribute file at this point is almost certain to crash the system.
The above scenario is not hypothetical; a fair number of such conditions exist in the 2.5 kernel now. That is why this issue (titled "kobject refcounting") appears in the 2.6 must-fix list. It truly must be fixed.
The infrastructure exists to handle these problems, but it must be used properly to be effective. The solution lies in the same place as the problem - the kobject structure. The 2.5.72 version of this structure looks like:
struct kobject { char name[KOBJ_NAME_LEN]; atomic_t refcount; struct list_head entry; struct kobject *parent; struct kset *kset; struct kobj_type *ktype; struct dentry *dentry; };
Entries in sysfs are closely tied to kobjects; there is a kobject associated with each directory in the filesystem. When a process moves into a sysfs directory or opens a sysfs file, the associated kobject has its refcount field incremented. As long as the reference count is above zero, the kobject cannot be deleted.
The same kobjects, of course, are embedded deeply within the structures used to represent devices and other system objects. So a nonzero reference count in a kobject means that the entire device structure (and, perhaps, the module infrastructure supporting it) is still in use. Safely putting things into sysfs is really just a matter of not deleting objects until their reference counts hits zero.
Of course, that is easily said, but the current mechanism for implementing such a policy is not entirely obvious. An example might help, so we'll look at the block subsystem, which does things right. Disks, within the kernel, are represented by the gendiskstructure. The function used to create a gendisk isalloc_disk(), which, after allocating and initializing agendisk structure (which contains a kobject), executes this mysterious line of code:
kobj_set_kset_s(disk,block_subsys);
This line tweaks the kobject within disk (the gendiskstructure) to make it a part of block_subsys. The block subsystem structure, in turn, contains a pointer to a kobj_type structure, which, in this case, looks like:
static struct kobj_type ktype_block = { .release = disk_release, .sysfs_ops = &disk_sysfs_ops, .default_attrs = default_attrs, };
We'll come back to this structure in a moment. For now, suffice to say that it identifies the kobject (and the gendisk structure that contains it) as something belonging to the block code, and provides some methods implementing the object's operations.
The function which puts a new disk into the system is add_disk(); it creates the associated sysfs structure, and increments the disk's reference count. The disk then goes through its lifecycle, with the reference count going up and down as it is mounted and unmounted, and as its sysfs files are accessed. Should the disk disappear, the driver will do some cleanup and call del_gendisk() to return thegendisk structure to the system.
del_gendisk() does not actually free the structures, however. It removes the sysfs entries and generally shuts things down; it then finishes by decrementing the reference count. That operation releases the reference which was first obtained in add_disk(). The driver also must release its own reference with put_disk(). These operations may drop the reference count to zero - if nobody else is holding a reference to the disk. But there is no way to know ahead of time.
Sooner or later, however, the last reference will go away. The function which actually decrements the count (kobject_put()) tests that count for zero. If no references remain, kobject_put() will go back to the kobj_type structure associated with the kobject (the ktype_block we saw above, in the case of a gendisk) and call the release() method found there. That method, knowing that nobody is referring to the object, can actually remove it from the system.
That is how sysfs objects must be managed. They must have a destructor associated with them, by way of the kobj_typestructure, and that destructor must understand the higher-level objects that it is dealing with. With this mechanism in place, objects will continue to exist as long as references to them are held.
Of course, things can get more complicated than that. If, for example, a module adds attributes to sysfs entries, that module cannot be removed until it is certain that all of the relevant references have gone away. It gets even worse if kernel code tries to attach attributes to objects which it does not own; in that case it can be very hard to get everything right. It may eventually prove necessary to rework some of the sysfs interfaces to make it easier to avoid mistakes, but that seems unlikely for 2.5 at this point. In the mean time, connecting the pieces together correctly can be an intimidating task the first time around, but the alternative is to put denial of service vulnerabilities into the kernel.