Make your QEMU 10 times faster with this one awesome trick
The work on QEMU and this post was paid for by Determinate Systems, and the post was co-published by Determinate Systems blog.
NixOS uses virtual machines based on QEMU widely for running his test suite. To avoid creating a disk image for each test, the test driver usually boots with a Plan 9 File Protocol (9p) share (server implemented by QEMU) for the Nix store, which contains all the programs and config needed for testing.
I’m working on a test VM that copies a fairly large amount of data (~278k files totaling ~5.3GiB) from a 9p-mounted Nix store, and was surprised how long it took to copy it. data. On NVMe devices, I would expect it to take a few seconds or minutes, but the test actually ended up taking 2 timemost of it was spent copying files from 9p. Since it was not determined long for additional work, I decided to study it a little, and was able to reduce the duration of the test to only 7 minutes. In this post, I will describe the entire journey.
As a preface: I don’t have much experience debugging performance issues! Most of what I use is novel to me. The first thing I want to do is find out where most of the time is spent. My guess is that it is in QEMU and not in the guest, although this guess being correct is a matter of pure luck. This stack overflow question describes a problem mostly but not entirely different from mine. This led me down the wrong path of trying to use the profiler of the poora little hack consisting of gdb, some shell, and awk.
Stories of surprises and failures: Profile of the poor man
I immediately encountered a small obstacle in the way of this method. gdb said
warning: Target and debugger are in different PID namespaces; Thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
Nix uses Linux namespaces to provide builds with some isolation from the system running the build to reduce the effects that a particular machine’s environment can have on the build result (“purity”). This includes PID namespaces, which prevent processes within the namespace from touching any processes outside the namespace. gdb is not happy to be in a different PID namespace from the process it is targeting! I first tried to get my gdb inside the sandbox using nsenter
. The first surprise I encountered here was entering a PID namespace not cause utilities from procps, such as ps
, pgrep
and top
to report only processes within the new namespace:
(root@oak:~)# pgrep -a qemu
1678991 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm (...)
(root@oak:~)# nsenter --target 1678991 --pid
🗣 This spawned a new shell within the build's PID namespace
(root@oak:~)# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
root 1 0 0 Sep02 ? Ss 1:24 /run/current-system/systemd/lib/systemd/systemd
What!? That’s not the PID1 I was expecting! And I definitely haven’t run this build since September 2nd.
I will skip the details of the time the disappointment occurred, but it turned out that the /proc
that ps
and friends are still reading from the root PID namespace – even if they are no longer running in it! This can cause some funny unexpected behavior when using tools like pkill
… But that’s a problem for another day.
With my new found knowledge, I was able to solve this issue by also creating a new mount namespace and mounting as desired. proc
filesystem on top of /proc
we inherit from the root namespace.
(root@oak:~)# nsenter -t 1684606 -p -- unshare -m
🗣 Now inside the build's PID namespace (through nsenter) and a new mount namespace (created by unshare)
(root@oak:~)# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
root 1 0 0 Sep02 ? Ss 1:24 /run/current-system/systemd/lib/systemd/systemd
(root@oak:~)# mount -t proc proc /proc
(root@oak:~)# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
nixbld1 1 0 0 12:27 ? Ss 0:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
(root@oak:~)# ps -ef
UID PID PPID C STIME TTY TIME CMD
nixbld1 1 0 0 12:27 ? 00:00:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
nixbld1 6 1 0 12:27 ? 00:00:00 /nix/store/pn7863n7s2p66b0gazcylm6cccdwpzaf-python3-3.9.13/bin/python3.9 /nix/store/kdi82vgfixayxaql77j3nj7
nixbld1 7 6 99 12:27 ? 00:04:00 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm -cpu max -na
root 46 0 0 12:29 pts/5 00:00:00 -bash
root 79 46 0 12:30 pts/5 00:00:00 ps -ef
🗣 That's better!
(root@oak:~)# pid=7 ./pprof
1500 __futex_abstimed_wait_common,__new_sem_wait_slow64.constprop.1,qemu_sem_timedwait,worker_thread,qemu_thread_start,start_thread,clone3
743 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_read_continue,flatview_read,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
100 syscall,qemu_event_wait,call_rcu_thread,qemu_thread_start,start_thread,clone3
53 ioctl,kvm_vcpu_ioctl,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
45 get_fid,v9fs_read,coroutine_trampoline,__correctly_grouped_prefixwc,??
15 get_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
11 alloc_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
8 get_fid,v9fs_getattr,coroutine_trampoline,__correctly_grouped_prefixwc,??
5 clunk_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
5 alloc_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
4 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_write_continue,flatview_write,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
3 get_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
3 get_fid,v9fs_open,coroutine_trampoline,__correctly_grouped_prefixwc,??
2 clunk_fid,v9fs_clunk,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 get_fid,v9fs_readlink,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 get_fid,v9fs_readdir,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 address_space_translate_internal,flatview_do_translate.isra,address_space_map,virtqueue_map_desc,virtqueue_split_pop,virtio_blk_handle_vq,virtio_queue_notify_vq.part,aio_dispatch_handler,aio_dispatch,aio_ctx_dispatch,g_main_context_dispatch,main_loop_wait,qemu_main_loop,main
1
The most common code points are:
- Locking resources, in 1500 and 743 thread samples. Since the first result is in the worker threads, I figured that they are not interesting and they are just waiting for the work to be available.
- Waiting for things to happen inside the VM, at 100 and 53 thread samples. That doesn’t seem like a relevant issue though – I would expect a VM monitor to wait for its guest to need something most of the time.
The other numbers were so small that I (wrongly!) considered them uninteresting as well. At this point I was so frustrated with this branch of my investigation that I abandoned it.
I moved on to quickstack, wrap it upand observed that I could not get any information on the threads, before returning to the Stack Overflow question to follow the other link given in the answer.
Flame graph with perf
This is what gets me everywhere. After recording the performance data using perf record -F max -a -g -- sleep 20
while the build was running, I created a flame graph that made the source of the performance problems very apparent. The following command:
# taken from the flamegraph page linked above
perf script | stackcollapse-perf.pl |
# We're only interested in the qemu process
grep ^qemu |
# Make the graph a lot less tall by collapsing multiple consecutive unknown stack frames together
awk '/unknown/ { gsub("(\\(unknown);){1,}", "(unknown...);", $0) } { print }' |
flamegraph.pl > flamegraph.svg # and generate a flamegraph
created this nice interactive SVG graph:
The prevalence of “fid”-related functions is very clear in this graph. So I jumped into the 9p docs and QEMU source code to find that fids are numbers that refer to open files on a 9p connection, similar to file descriptors in the POSIX file API – so there one fid for each file with a visitor. tomorrow.
Let’s look at the previous implementation of get_fid
which is used by QEMU’s 9p server in its implementations stat
(retrieving file metadata), open
(opening a file), and read
(reading data from an open file), etc.:
static V9fsFidState *coroutine_fn get_fid(V9fsPDU *pdu, int32_t fid)
{
int err;
*f;
V9fsFidState *s = pdu->s;
V9fsState
// Blog note: I've omitted some parts that are irrelevant to performance here.
(f, &s->fid_list, next) {
QSIMPLEQ_FOREACHif (f->fid == fid) {
return f;
}
}
return NULL;
}
QEMU is back fid_list
to find the desired data in the fid. Finding an entry in a list by iterating has a complicated time of or(n) WHERE n is the size of the list — in this case, the list of all files opened by the visitor — so it’s expensive! Additionally, some inspection of QSIMPLEQ (QEMU simple queue) reveals that it is implemented as a linked list, a data structure that tends to exhibit bad cache locality.
Since my test is copying a lot of files from the 9p filesystem, as well as booting from it, this search always occurs:
- deer
stat
for getting file metadata (permissions in particular) - deer
open
to get a handle to a file - deer
read
for small files, or many for large files, to extract the contents of the file
That makes at least 3 search operations, for each of the 278000 files, which leads to a bad search in a hot code path. it is the cause of slowness.
What we really want is a structure where we can find entries by fid more cheaply. We don’t just use the array-based vector, which always gives us or(1) lookup, because fids are chosen by the client: we can’t rely on every newly allocated fid being the smallest idle, and need to support arbitrary 32-bit integers. I chose a hash table, as easy as that implemented in glibwhich QEMU already relies on. That gives us a or(1) best case complexity, while retaining the worst case of or(n). The exact performance characteristics in the real world are much more complex than the linked list, and there may be slightly more appropriate data structures (or hash table implementations) out there, but we’re looking a big quick win and not micro-optimization. here.
Rewriting the relevant code was surprisingly simple and uneventful, to the point that once I had it compiled, it just worked (an experience I’m not sure I’ve had before in C!). The results are amazing: my previously > 2h test was completed in 7 minutes. It also reduced the ZFS AWS image build time on NixOS from 19 minutes to 1. It was very clear to me that it needed to go with the flow.
QEMU uses a email-based workflowwhere patches are sent as emails to a mailing list to which maintainers are subscribed. I’ve never done this before, and it gets messy, as you can see by looking at the threads (v1
v3
v1a
v5
v6
) in the list of archives. There’s a lot of room for error in email-based patch submissions, and I’ve made several:
- Forgot to reply-all when replying to review comments, so that the reply is visible to all interested parties and not just the reviewer
- The version tag was missing on a resubmission (in my case, it was due to misleading docs)
- Sending a resubmission as a reply to the previous thread (that’s just because I failed to read the docs)
- Missed a bunch of patch-documenting work while using git-publish because
git send-email
failed (nixpkgs does not enable Git’s email functionality in the default package, and gitpublish unconditionally deletes working files when the send command is run).
But the reviewers, especially Christian Schoenebeck (thank you!) were helpful and patient, and finally this is done in the patch (although an apparently unrelated bug was discovered) and will soon find its way to a QEMU version near you! If you can’t wait and you’re using Nix, you can pull the patch using this overlay:
final: prev: {
qemu = prev.qemu.overrideAttrs (o: {
patches = o.patches ++ ( (prev.fetchpatch {
name = "qemu-9p-performance-fix.patch";
url = "https://gitlab.com/lheckemann/qemu/-/commit/8ab70b8958a8f9cb9bd316eecd3ccbcf05c06614.patch";
sha256 = "sha256-PSOv0dhiEq9g6B1uIbs6vbhGr7BQWCtAoLHnk4vnvVg=";
}) );
});
}
While the core of what I changed was not that complicated, this adventure had many collateral (positive!) results for me:
- I wrote Nix packages quickstack (including an upstream PR that makes the build more generic) and git published.
- I use
perf
to work on performance issues for the first time, learn how useful it is and the basics of how to use it. - I learned that different profiling methods yield different results — results from perf paint a very different picture from a poor person’s profiler; I still don’t understand why, but I will spend some time learning about them soon.
- I submitted my first patch to QEMU, knowing the code a bit.
- I submitted my second QEMU patch (trying to fix the patch submission docs).
- I learned how to submit patches via email, and what mistakes to avoid.
- I know about the 9p protocol.
- My tests are now faster!
- And I get that warm fuzzy feeling of making a valuable contribution.
My take away from this? Digging into a frustrating problem – even if I’m not familiar with the code involved and the technology I’m using – can be very rewarding!
Not only do I make my own work on tests more enjoyable by reducing revision time, but it goes to all QEMU users, and significantly reduce the load generated by the installer test of NixOS’s build farm. This is what I love about working with open source software: if there’s a problem, I have a way to get in there and fix it, and pass the benefits on to everyone who uses it.
It’s not always that great, but this kind of experience can make many less successful adventures worth it.
2024-12-17 04:56:39