sys_epoll - making poll fast (original) (raw)

The classic Unix way to wait for I/O events on multiple file descriptors is with the

select()

and

poll()

system calls. When a process invokes one of those calls, the kernel goes through the list of interesting file descriptors, checks to see if non-blocking I/O is available on any of them, and adds the calling process to a wait queue for each file descriptor that would block. This implementation works reasonably well when the number of file descriptors is small. But if a process is managing thousands of file descriptors, the

select()

and

poll()

calls must check every single one of them, and add the calling process to thousands of wait queues. For every single call. Needless to say, this approach does not scale very well.

Davide Libenzi and others have been working for some time on a new approach to polling that would work for thousands of files. It was originally implemented as a special device (/dev/epoll), but, on request from Linus, the new scheme was turned into a new set of epoll() system calls. These calls work in a very different way. Every call to select()or poll() is a separate event; the data structures must be set up and torn down every time. epoll, instead, requires the application to build a persistent (across calls) data structure in kernel space first. The application starts by creating a special epollfile descriptor:

epfd = epoll_create(int maxfds);

The maxfds parameter is the maximum number of file descriptors that the process expects to manage. The return value is a file descriptor to be used with the other epoll calls; it should be shut down withclose() when it is no longer needed.

Each file descriptor to be managed must be added to the specialepoll descriptor with:

int epoll_ctl(int epfd, int op, int fd, unsigned int events);

The op parameter specifies the operation to be performed (add, change, or remove the given file descriptor fd), andevents is a mask of events of interest to the process.

Once everything has been set up, the process can sit back and wait until there is something for it to do:

int epoll_wait(int epfd, struct pollfd const **events, int timeout);

The return value is the number of events (i.e. readable or writeable file descriptors) that epoll_wait() has found.

These system calls have been shown, through heavy benchmarking, to scale in constant time up to unbelievable numbers of file descriptors (some graphs can be found on this page). The persistent data structure built around the epollfile descriptor is one of the reasons for this scalability: there is no need to set it up and tear it down for every epoll_wait() call. The other half of the story is in how epoll_wait() finds the readable or writeable file descriptors. Rather than polling each file descriptor (and adding itself to wait queues), the epoll mechanism adds a callback structure onto the struct file associated with each file descriptor. When a file descriptor becomes readable or writeable, its callback(s) are called, and processes usingepoll_wait() can be notified directly. So anepoll_wait() call never needs to make a pass over the list of file descriptors it is watching.

The epoll patch is ready, and Linus has indicated that he wants to merge it. For now, epoll only works for pipes and sockets (its initial use is likely to be network services that manage large numbers of connections). Expanding its scope to other types of I/O should just be a matter of doing the work, however.