Kernel development [LWN.net] (original) (raw)
Kernel release status
The current 2.6 prepatch is 2.6.24-rc8, released by Linus on January 15. It contains a fair number of fixes but not much else. "So I'm pretty sure this is the last -rc, and the final 2.6.24 will probably be out next weekend or so. But in the meantime, let's give this a final shakedown, and see if we can fix any last regressions still." See the long-format changelog for the details.
As of this writing, a very small number of fixes has been merged post-rc8.
There have been no -mm releases over the last week.
The current stable 2.6 kernel is 2.6.23.14, released (along with 2.6.22.16) on January 14. These releases contain a single patch: a fix for the filesystem security vulnerability discussed on this week's Security Page.
For older kernels: 2.6.16.58 was released on January 16 with several fixes.
Quote of the week
I wonder what a tiny, SANE register-based bytecode interface might look like. Have a single page shared between kernel and userland, for each thread. Userland fills that page with bytecode, for a virtual machines with 256 registers -- where instructions roughly equate to syscalls.
The common case -- a single syscall like open(2) -- would be a single byte bytecode, plus a couple VM register stores. The result is stored in another VM register.
But this format enables more complex cases, where userland programs can pass strings of syscalls into the kernel, and let them execute until some exceptional condition occurs. Results would be stored in VM registers (or userland addresses stored in VM registers...).
-- Jeff Garzik
A better btrfs
Chris Mason has recently released Btrfs v0.10, which contains a number of interesting new features. In general, Btrfs has come a long way since LWN first wrote about it last June. Btrfs may, in some years, be the filesystem most of us are using - at least, for those of us who will still be using rotating storage then. So it bears watching.
Btrfs, remember, is an entire new filesystem being developed by Chris Mason. It is a copy-on-write system which is capable of quickly creating snapshots of the state of the filesystem at any time. The snapshotting is so fast, in fact, that it is used as the Btrfs transactional mechanism, eliminating the need for a separate journal. It supports subvolumes - essentially the existence of multiple, independent filesystems on the same device. Btrfs is designed for speed, and also provides checksumming for all stored data.
Some kernel patches show up and quickly find their way into production use. For example, one year ago, nobody (outside of the -ck list, perhaps) was talking about fair scheduling; but, as of this writing, the CFS scheduler has been shipping for a few months. KVM also went from initial posting to merged over the course of about two kernel release cycles. Filesystems do not work that way, though. Filesystem developers tend to be a cautious, conservative bunch; those who aren't that way tend not to survive their first few encounters with users who have lost data. This is all a way of saying that, even though Btrfs is advancing quickly, one should not plan on using it in any sort of production role for a while yet. As if to drive that point home, Btrfs still crashes the system when the filesystem runs out of space. The v0.10 patch, like its predecessors, also changes the on-disk format.
The on-disk format change is one of the key features in this version of the Btrfs patch. The format now includes back references on almost all objects in the filesystem. As a result, it is now easy to answer questions like "to which file does this block belong?" Back references have a few uses, not the least of which is the addition of some redundant information which can be used to check the integrity of the filesystem. If a file claims to own a set of blocks which, in turn, claim to belong to a different file, then something is clearly wrong. Back references can also be used to quickly determine which files are affected when disk blocks turn bad.
Most users, however, will be more interested in another new feature which has been enabled by the existence of back references: online resizing. It is now possible to change the size of a Btrfs filesystem while it is mounted and busy - this includes shrinking the filesystem. If the Btrfs code has to give up some space, it can now quickly find the affected files and move the necessary blocks out of the way. So Btrfs should work nicely with the device mapper code, growing or shrinking filesystems as conditions require.
Another interesting feature in v0.10 is the associated in-place ext3 converter. It is now possible to non-destructively convert an existing ext3 filesystem to Btrfs - and to go back if need be. The converter works by stashing a copy of the ext3 metadata found at the beginning of the disk, then creating a parallel directory tree in the free space on the filesystem. So the entire ext3 filesystem remains on the disk, taking up some space but preserving a fallback should Btrfs not work out. The actual file data is shared between the two filesystems; since Btrfs does copy-on-write, the original ext3 filesystem remains even after the Btrfs filesystem has been changed. Switching to Btrfs forevermore is a simple matter of deleting the ext3 subvolume, recovering the extra disk space in the process.
Finally, the copy-on-write mechanism can be turned off now with a mount option. For certain types of workloads, copy-on-write just slows things down without providing any real advantages. Since (1) one of those workloads is relational database management, and (2) Chris works for Oracle, the only surprise here is that this option took as long as it did to arrive. If multiple snapshots reference a given file, though, copy-on-write is still performed; otherwise it would not be possible to keep the snapshots independent of each other.
For those who are curious about where Btrfs will go from here, Chris has posted a timeline describing what he plans to accomplish over the coming year. Next on the list would appear to be "storage pools," allowing a Btrfs filesystem to span multiple devices. Once that's in place, striping and mirroring will be implemented within the filesystem. Longer-term projects include per-directory snapshots, fine-grained locking (the filesystem currently uses a single, global lock), built-in incremental backup support, and online filesystem checking. Fixing that pesky out-of-space problem isn't on the list, but one assumes Chris has it in the back of his mind somewhere.
Unprivileged mounts
By Jonathan Corbet
January 15, 2008
There are a number of filesystem-related patches aimed at the upcoming 2.6.25 merge window; one of those is the unprivileged mount patch by Miklos Szeredi. This patch enables an unprivileged user process to call the
mount()
system call and - in certain circumstances - have that call actually succeed. It could eventually lead to a situation where users have more flexibility to create their own environments and the setuid
mount
utility is no longer needed.
This patch adds a new field (uid) to the vfsmountstructure, allowing the kernel to keep track of the owner of a specific filesystem mount. The system administrator can give ownership of a specific mount to a user with the new MNT_SETUSER flag. A common pattern might be to bind-mount a user's home directory on top of itself, giving the user the ownership of that mount. Once that has been done, the user is allowed to freely mount other filesystems below that mount point - with a couple of conditions:
- There is a system-wide limit on the number of allowed user mounts; once that limit is hit, no more unprivileged mounts will be allowed until somebody unmounts something. The current patch has no provision for per-user or per-group mount limits, but such a feature would not be particularly hard to add should the need arise.
- The filesystem type must be marked as being safe for unprivileged mounts. Miklos notes that a filesystem must go through "a thorough audit" before this flag can be set with any confidence. The patch, as posted, marks the fuse filesystem (which allows for the creation of filesystems implemented in user space) as being safe; fuse was designed for this mode of operation in the first place. Bind mounts are also allowed, with some additional conditions.
If the system allows the mount, the flags allowing for setuid and device files will be forcibly cleared - unless the user has the requisite capabilities anyway. Users are allowed to unmount filesystems they own, again without privilege, but cannot unmount any others. Another new mount flag (MNT_NOMNT) marks a specific filesystem as being the end of the line - no unprivileged submounts are allowed below it. The end result of [PULL QUOTE: One might well wonder why this change to the mount() system call is called for, given that users have been able to do unprivileged mounts for years. END QUOTE] all this should be a mechanism by which users can organize their filesystem hierarchies without any need for administrative privileges, and without the risk of compromising system security.
One might well wonder why this change to the mount() system call is called for, given that users have been able to do unprivileged mounts for years. The answer is that the current mechanism has a couple of shortcomings. Every potential unprivileged mount must be explicitly enabled via a line in /etc/fstab. That works well for simple situations, such as allowing a user to mount a CD or a USB storage device. When users start wanting to do more complicated things, like mounting their own special fuse filesystems, the /etc/fstab mechanism breaks down. There is a separate, setuid program which grants the right to make unprivileged fuse mounts, but it represents a workaround rather than a proper solution.
The current user mount mechanism also requires that the mountutility be installed setuid root. Every setuid binary is a potential security hole, so there is value in eliminating privileged programs when possible. The unprivileged mount patch offers the possibility of eliminating the setuid mount program while simultaneously leaving policy control in the hands of the system administrator. So, unless something surprising comes up, chances are good that this capability will appear in the 2.6.25 kernel.
ext3 metaclustering
By Jonathan Corbet
January 16, 2008
The ext3 system uses the classic Unix block pointer method for keeping track of the blocks in each file. For a given file, the on-disk inode structure contains space for twelve block numbers; they point to the first twelve blocks in the file - the first 48KB of space. If the file is larger than that, a 13th pointer contains the address of the first indirect block; this block contains another 1024 (on a 4K block filesystem) block pointers. Should that not suffice, there's a 14th pointer for the double-indirect block - each entry in that block is the address of an indirect block. And if even that is not enough, there's a 15th entry pointing to a triple-indirect block full of pointers to double-indirect blocks.
This is a very efficient representation for small files - the kinds of files Unix systems typically held, once upon a time. In current times, when one can forget about that directory full of DVD images and never even notice the lost space, it does not work quite as well - there is a lot of overhead for all of those individual block pointers, and a large data structure to manage. That is why removing a large file on an ext3 filesystem can take a long time - the system has to chase down all of those indirect blocks, which, in turn, forces a lot of disk activity and head seeks. For this reason, contemporary filesystems tend to use extent-based mechanisms to associate blocks with files, but that is not really an option for ext3.
An additional problem with all those indirect blocks is that filesystem checkers must locate and verify them all. That, again, causes a lot of head seeking and makes fsck run slowly. Slow filesystem checking was the motivation behind this patch from Abhishek Rai which attempts to improve performance on filesystems with a lot of indirect blocks.
The approach taken is relatively simple: the patch just tries to group indirect block allocations together on the disk. The current ext3 code will allocate indirect blocks when they are needed to account for data blocks being added to the file; they are usually placed adjacent to those data blocks. One might think that this placement would speed subsequent accesses to the file, but that is not necessarily so; the reading or writing of the indirect block will tend to happen at a different time than operations on the data blocks. What this placement does accomplish, though, is the distribution of the indirect blocks all over the disk. So a process which must examine all of the indirect blocks associated with a file must cause the disk to do a lot of head seeks.
The "metaclustering" approach works by reserving a set of contiguous blocks at the end of each block group. Whenever an indirect block is needed, the filesystem tries to get one from this dedicated area first. The end result is that all of the indirect blocks are located next to each other. Should somebody need to read a number of those blocks without being interested in the contents of the data blocks, they can grab them all quickly with minimal seeking. Filesystem checkers, as it happens, need to do exactly that - as does the file removal process. The patch did not come with benchmarks, but the speedup that comes from the elimination of all those seeks should be significant.
Even so, Andrew Morton questioned the needfor this patch, worrying that its benefits do not justify the risks that comes with modifying an established, heavily-used filesystem:
In any decent environment, people will fsck their ext3 filesystems during planned downtime, and the benefit of reducing that downtime from 6 hours/machine to 2 hours/machine is probably fairly small, given that there is no service interruption.
Others disagreed, though, noting that it's the unplanned filesystem checks which are often the most time-critical. That includes the delightful "maximal mount count" boot-time check which, in your editor's experience, always happens when one is trying to get set up to give a talk somewhere. So this patch might just find eventual acceptance - it should be relatively low-risk and does not require any on-disk format changes. This is a filesystem patch, though, so nobody will be in any hurry to get it into the mainline before a lot of testing and review has been done.
State of the unionfs
By Jonathan Corbet
January 15, 2008
LWN last looked at the unionfs filesystem almost exactly one year ago. Things have been relatively quiet on the unionfs front during much of that time, but unionfs has not gone away. Now the unionfs developers are back with an improved version and a determined push to get the code into 2.6.25. So another look seems indicated.
The core idea behind unionfs is to allow multiple, independent filesystems to be merged into a single, coherent whole. As an example, consider a user with a distribution install DVD full of packages, a small disk, and painfully slow bandwidth. It would be nice to keep the DVD-stored packages around for future installation. What is also nice, though, is to be able to keep a directory full of updates from the distributor and use those, when they exist, in favor of the read-only DVD version. Using unionfs, this user could mount the DVD read-only, then mount a writable filesystem (for the updates) on top of the DVD. Updated packages go into the writable filesystem, but all of the available packages are visible, together, in the unified view. To avoid confusion, the user could delete obsoleted packages, at which point they would no longer be visible in the unionfs filesystem, even though they cannot actually be deleted from the underlying DVD. Thus unionfs allows the creation of an apparently writable filesystem on a read-only base; many other applications are possible as well.
If a user rewrites a file which is stored on a read-only "branch" of a union filesystem, the response is relatively straightforward: the newly-written file is stored on a higher-priority, writable branch. If no such branch exists, the operation fails. Dealing with the deletion of a file from a read-only branch is trickier, though. In this case, unionfs will create a "whiteout" in the form of a special file (starting with.wh.) on a writable branch. Some reviewers have disliked this approach since it will clutter the upper branch with those special files over time. But it is hard to come up with another way to handle deletion, especially if (as is the case here) your goal is to keep core VFS changes to an absolute minimum.
That hasn't kept the unionfs developers from trying, though. Off to the side, they have a version of unionfs which maintains a small, special-purpose partition of its own (on writable storage). Metadata (whiteouts, in particular) is stored to this special unionfs partition and no longer clutters the component filesystems. There are other advantages to the dedicated partition scheme, including the ability to include one unionfs as a branch in a second union; see the unionfs ODF document for more information on this approach, which the developers hope to slowly migrate into the version they are currently proposing for the mainline.
Another persistent problem with unionfs has been coping with modifications made directly to the component branches without going through the union. The January, 2007 version of the patch came packaged with some dire warnings: direct modification of unionfs branches could lead to system crashes and data loss. Given that filesystems which have been bundled into a union still exist independently, they will always present a tempting target for modification, even when there is not a specific reason (wanting to put files onto a specific component filesystem, for example). So a unionfs implementation which cannot handle such modifications sets a trap for every user who uses it.
The developers claim to have solved this problem in the current version of the patch. Now, almost every entry into the unionfs code causes it to check the modification times for the relevant file in all layers of the union. If the file turns out to have been changed, unionfs will forget about the file and reload the information from scratch, causing the most current version of the file (or directory) to be visible to the user. This approach solves the problem in a relatively efficient manner, with one exception: unionfs cannot tell when a process modifies a file which it has mapped into its address space with mmap(). So, in that case, changes may not be visible to processes accessing the affected file through the unionfs.
In both cases, the unionfs developers would really prefer to have better support from the VFS. Some operating systems have provided native support for whiteouts, but Linux lacks that support. There is also no way for a filesystem at the bottom of a stack of filesystems to notify the higher layers that something has been changed. Fixing either of these would require significant VFS modifications, though, and the changes might propagate down into the individual filesystem implementations as well. So nobody is expecting them to happen anytime soon.
Another significant change in unionfs is the elimination of theioctl() interface for the management of branches. All changes to an existing unionfs are now done using the remount option of themount command. This change eliminates the need for a separate utility for unionfs configuration and makes it possible to do complicated changes in an atomic manner.
The end result of all this is that the unionfs hackers think that the time has come to put the code into the mainline. There, it would become the second supported stacking filesystem (the first being eCryptfs), and would help toward the long-term goal of making the VFS layer work better with stacking. Some people speak as if the merging of unionfs into 2.6.25 is a done deal, but that is not yet guaranteed. Christoph Hellwig, whose opinion on such things carries a heavy weight, is opposed to the unionfs idea:
I think we made it pretty clear that unionfs is not the way to go, and that we'll get the union mount patches clear once the per-mountpoint r/o and unprivileged mount patches series are in and stable.
Unionfs hacker Erez Zadok responds that unionfs is working - and used - now, while getting union support into the VFS is a distant prospect. So he recommends:
I think a better approach would be to start with Unionfs (a standalone file system that doesn't touch the rest of the kernel). And as Linux gradually starts supporting more and more features that help unioning/stacking in general, to change Unionfs to use those features (e.g., native whiteout support). Eventually there could be basic unioning support at the VFS level, and concurrently a file-system which offers the extra features (e.g., persistency).
When one looks at a recent posting of the union mount patch, it's hard to see them as a near-term solution. As described by its author (Bharata Rao), this work is in an early, exploratory state; there are a number of problems for which solutions are not really in sight. The union mount approach, which does the hard work in the VFS layer, may well be the right long-term approach, but it will not be in a state where it can be shipped to users anytime soon.
In the end, the problem is a hard one, and unionfs has a considerable lead toward being a real solution. That, alone, is not enough to guarantee that unionfs will make it into the 2.6.25 kernel, but it does help that cause considerably. Anybody opposing the merger of unionfs will have to explain why the union filesystem capability should not be available to Linux users in 2008.