Software & Apps

In fact, the BSD kqueue is a mountain of technical debt

A side effect of the whole freenode kerfluffle that’s why I looked at IRCD again. IRC, of ​​course is a weird and interesting place, and the smaller community of people who run IRCDs are even weirder and more interesting.

However, in that community of IRCD administrators there are some incorrect opinions on system programming that have been cultivated by cargo over the years. This particular blog is about one of these bikesheds, namely kqueue vs epoll debate.

You’ve probably heard it before. It comes like this, “BSD is better for networking, because it has kqueue. Linux has nothing like kqueue, epoll doesn’t come close.” While I agree that epoll doesn’t come close, I think that’s actually a feature that leads to a more flexible and flexible design.

In the beginning…

Originally, IRCD was like most daemons used select for polling sockets for readiness, as this is the first polling API available on systems with BSD sockets. the select The syscall works by taking a set of three bitmaps, each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on. Bitmaps are read_set, write_set and err_setwhich is mapped to sockets that can be read, written or errored accordingly. Due to design flaws in the select syscalls, it can only support up to FD_SETSIZE file descriptors on most systems. This can be minimized by creating fd_set is an arbitrarily large bitmap and depends on fdmax to be the upper limit, which is what WinSock traditionally does on Windows.

the select The syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the poll syscall on System V UNIX. the poll The syscall takes an array of struct pollfd of user-defined height, and updates the flags bitmap for each one struct pollfd entry with the current status of each socket. Then you repeat the struct pollfd list. It is naturally more effective than selectwhere you need to iterate through all the file descriptors up to fdmax and test for membership in each of the three bitmaps to determine the status of each socket.

That could be argued select limited to FD_SETSIZE (which is usually 1024 sockets), while poll started to have serious scalability issues around 10240 sockets. These arbitrary benchmarks are called C1K and C10K problems accordingly. Dan Kegel has a very long post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.

Then there is a queue…

In July 2000, Jonathan Lemon introduced kqueue to FreeBSD, which quickly spread to other BSD forks as well. kqueue is a kernel-assisted event notification system using two syscalls: kqueue and kevent. the kqueue The syscall creates a kernel handle represented as a file descriptor, which a developer uses kevent to add and remove event filters. Event filters can match file descriptors, processes, filesystem paths, timers, and more.

This design allows a single-threaded server to process hundreds of thousands of connections at once, because it can register all the sockets it wants to monitor with the kernel and then lazily return to over the sockets as they have events.

Most IRCDs support kqueue in the last 15 to 20 years.

And then epoll…

In October 2002, Davide Libenzi got it HIS epoll patch integrated into Linux 2.5.44. Like kqueue, you use the epoll_create syscall to create a kernel handle representing the set of descriptors to monitor. You use the epoll_ctl syscall to add or remove descriptors from that set. And finally, you use epoll_wait wait for kernel events.

In general, the scalability aspects are the same as the application programmer: you have your sockets, you use them epoll_ctl to add it to the kernel’s epoll management, and then you wait for events, as you would do kevent.

as kqueuemost IRCDs support epoll in the last 15 years.

What is a file descriptor, anyway?

To understand the argument I will make, we need to discuss file descriptors. UNIX uses the term file descriptor many, even when referring to things that are obvious not files, such as network sockets. Outside of the UNIX world, a file descriptor is usually called a kernel handle. In fact, in Windows, the resources managed by the kernel are given the HANDLE type, which makes this relationship more obvious. Essentially, a kernel handle is an opaque reference to an object in kernel space, and the astute reader may notice some similarities to object model capabilities as a result.

Now that we know that file descriptors are actually kernel controls, we can talk about kqueue and epolland why epoll that’s exactly the right design.

The problem with event filters

The key difference between epoll and kqueue is that so kqueue operates on the idea of event filters instead of kernel handles. This means any time you want kqueue to do something new, you have to add a new type event filter.

FreeBSD currently has 10 different event filter types: EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY, EVFILT_AIO, EVFILT_VNODE, EVFILT_PROC, EVFILT_PROCDESC, EVFILT_SIGNAL, EVFILT_TIMER and EVFILT_USER. Darwin has additional event filters for monitoring Mach ports.

Except for EVFILT_READ, EVFILT_WRITE and EVFILT_EMPTYall of these different types of event filters are related to completely different kernel concerns: they don’t monitor kernel handles, but other specific subsystems than sockets.

This creates a powerful API, but one that is lacking composition.

epoll it is better to be composable

It is possible to do almost everything kqueue FreeBSD can do on Linux, but instead of having a monolithic syscall to handle allLinux needs a way to provide syscalls that allow almost anything to be represented as a kernel handle.

from epoll closely monitored kernel handlesyou can register whatever kernel handle you have on it and return events when its state changes. As a comparison to Windows, it basically means that epoll is a kernel-accelerated form of WaitForMultipleObjects in the Win32 API.

You may be wondering how this works, so here is a table that is commonly used kqueue event filters and the Linux syscall used to get a kernel handle to use epoll.

BSD event filter Linux equivalent
EVFILT_READ, EVFILT_WRITE, EVFILT_EMPTY Pass the socket to EPOLLIN and so on.
EVFILT_VNODE inotify
EVFILT_SIGNAL signalfd
EVFILT_TIMER timerfd
EVFILT_USER eventfd
EVFILT_PROC, EVFILT_PROCDESC pidfdalternative binding of processes to a cgroup and monitors cgroup.events
EVFILT_AIO aiocb.aio_fildes (treat as socket)

Hopefully, as you can see, epoll can be automatically monitored whatever type of kernel resource that does not need to be modified, due to its composable design, which makes it superior to kqueue from the perspective of having less technical debt.

interestingly, FreeBSD has additional support for Linux eventfd recentlyso it appears that they will get kqueue in this direction too. Between that and FreeBSD’s process descriptorsseems likely.

2024-12-29 15:20:00

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button