In fact, the BSD kqueue is a mountain of technical debt
A side effect of the whole freenode kerfluffle that’s why I looked at IRCD again. IRC, of course is a weird and interesting place, and the smaller community of people who run IRCDs are even weirder and more interesting.
However, in that community of IRCD administrators there are some incorrect opinions on system programming that have been cultivated by cargo over the years. This particular blog is about one of these bikesheds, namely kqueue vs epoll debate.
You’ve probably heard it before. It comes like this, “BSD is better for networking, because it has kqueue. Linux has nothing like kqueue, epoll doesn’t come close.” While I agree that epoll doesn’t come close, I think that’s actually a feature that leads to a more flexible and flexible design.
In the beginning…
Originally, IRCD was like most daemons used select
for polling sockets for readiness, as this is the first polling API available on systems with BSD sockets. the select
The syscall works by taking a set of three bitmaps, each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on. Bitmaps are read_set
, write_set
and err_set
which is mapped to sockets that can be read, written or errored accordingly. Due to design flaws in the select
syscalls, it can only support up to FD_SETSIZE
file descriptors on most systems. This can be minimized by creating fd_set
is an arbitrarily large bitmap and depends on fdmax
to be the upper limit, which is what WinSock traditionally does on Windows.
the select
The syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the poll
syscall on System V UNIX. the poll
The syscall takes an array of struct pollfd
of user-defined height, and updates the flags bitmap for each one struct pollfd
entry with the current status of each socket. Then you repeat the struct pollfd
list. It is naturally more effective than select
where you need to iterate through all the file descriptors up to fdmax
and test for membership in each of the three bitmaps to determine the status of each socket.
That could be argued select
limited to FD_SETSIZE
(which is usually 1024 sockets), while poll
started to have serious scalability issues around 10240
sockets. These arbitrary benchmarks are called C1K and C10K problems accordingly. Dan Kegel has a very long post on his website about his experiences mitigating the C10K problem in the context of running an FTP site.
Then there is a queue…
In July 2000, Jonathan Lemon introduced kqueue to FreeBSD, which quickly spread to other BSD forks as well. kqueue is a kernel-assisted event notification system using two syscalls: kqueue
and kevent
. the kqueue
The syscall creates a kernel handle represented as a file descriptor, which a developer uses kevent
to add and remove event filters. Event filters can match file descriptors, processes, filesystem paths, timers, and more.
This design allows a single-threaded server to process hundreds of thousands of connections at once, because it can register all the sockets it wants to monitor with the kernel and then lazily return to over the sockets as they have events.
Most IRCDs support kqueue
in the last 15 to 20 years.
And then epoll…
In October 2002, Davide Libenzi got it HIS epoll
patch integrated into Linux 2.5.44. Like kqueue, you use the epoll_create
syscall to create a kernel handle representing the set of descriptors to monitor. You use the epoll_ctl
syscall to add or remove descriptors from that set. And finally, you use epoll_wait
wait for kernel events.
In general, the scalability aspects are the same as the application programmer: you have your sockets, you use them epoll_ctl
to add it to the kernel’s epoll
management, and then you wait for events, as you would do kevent
.
as kqueue
most IRCDs support epoll
in the last 15 years.
What is a file descriptor, anyway?
To understand the argument I will make, we need to discuss file descriptors. UNIX uses the term file descriptor many, even when referring to things that are obvious not files, such as network sockets. Outside of the UNIX world, a file descriptor is usually called a kernel handle. In fact, in Windows, the resources managed by the kernel are given the HANDLE
type, which makes this relationship more obvious. Essentially, a kernel handle is an opaque reference to an object in kernel space, and the astute reader may notice some similarities to object model capabilities as a result.
Now that we know that file descriptors are actually kernel controls, we can talk about kqueue
and epoll
and why epoll
that’s exactly the right design.
The problem with event filters
The key difference between epoll
and kqueue
is that so kqueue
operates on the idea of event filters instead of kernel handles. This means any time you want kqueue
to do something new, you have to add a new type event filter.
FreeBSD currently has 10 different event filter types: EVFILT_READ
, EVFILT_WRITE
, EVFILT_EMPTY
, EVFILT_AIO
, EVFILT_VNODE
, EVFILT_PROC
, EVFILT_PROCDESC
, EVFILT_SIGNAL
, EVFILT_TIMER
and EVFILT_USER
. Darwin has additional event filters for monitoring Mach ports.
Except for EVFILT_READ
, EVFILT_WRITE
and EVFILT_EMPTY
all of these different types of event filters are related to completely different kernel concerns: they don’t monitor kernel handles, but other specific subsystems than sockets.
This creates a powerful API, but one that is lacking composition.
epoll
it is better to be composable
It is possible to do almost everything kqueue
FreeBSD can do on Linux, but instead of having a monolithic syscall to handle allLinux needs a way to provide syscalls that allow almost anything to be represented as a kernel handle.
from epoll
closely monitored kernel handlesyou can register whatever kernel handle you have on it and return events when its state changes. As a comparison to Windows, it basically means that epoll
is a kernel-accelerated form of WaitForMultipleObjects
in the Win32 API.
You may be wondering how this works, so here is a table that is commonly used kqueue
event filters and the Linux syscall used to get a kernel handle to use epoll
.
BSD event filter | Linux equivalent |
---|---|
EVFILT_READ , EVFILT_WRITE , EVFILT_EMPTY |
Pass the socket to EPOLLIN and so on. |
EVFILT_VNODE |
inotify |
EVFILT_SIGNAL |
signalfd |
EVFILT_TIMER |
timerfd |
EVFILT_USER |
eventfd |
EVFILT_PROC , EVFILT_PROCDESC |
pidfd alternative binding of processes to a cgroup and monitors cgroup.events |
EVFILT_AIO |
aiocb.aio_fildes (treat as socket) |
Hopefully, as you can see, epoll
can be automatically monitored whatever type of kernel resource that does not need to be modified, due to its composable design, which makes it superior to kqueue
from the perspective of having less technical debt.
interestingly, FreeBSD has additional support for Linux eventfd
recentlyso it appears that they will get kqueue
in this direction too. Between that and FreeBSD’s process descriptorsseems likely.
2024-12-29 15:20:00