Software & Apps

Make your QEMU 10 times faster with this one awesome trick

The work on QEMU and this post was paid for by Determinate Systems, and the post was co-published by Determinate Systems blog.

NixOS uses virtual machines based on QEMU widely for running his test suite. To avoid creating a disk image for each test, the test driver usually boots with a Plan 9 File Protocol (9p) share (server implemented by QEMU) for the Nix store, which contains all the programs and config needed for testing.

I’m working on a test VM that copies a fairly large amount of data (~278k files totaling ~5.3GiB) from a 9p-mounted Nix store, and was surprised how long it took to copy it. data. On NVMe devices, I would expect it to take a few seconds or minutes, but the test actually ended up taking 2 timemost of it was spent copying files from 9p. Since it was not determined long for additional work, I decided to study it a little, and was able to reduce the duration of the test to only 7 minutes. In this post, I will describe the entire journey.

As a preface: I don’t have much experience debugging performance issues! Most of what I use is novel to me. The first thing I want to do is find out where most of the time is spent. My guess is that it is in QEMU and not in the guest, although this guess being correct is a matter of pure luck. This stack overflow question describes a problem mostly but not entirely different from mine. This led me down the wrong path of trying to use the profiler of the poora little hack consisting of gdb, some shell, and awk.

Stories of surprises and failures: Profile of the poor man

I immediately encountered a small obstacle in the way of this method. gdb said

warning: Target and debugger are in different PID namespaces; Thread lists and other data are likely unreliable. Connect to gdbserver inside the container.

Nix uses Linux namespaces to provide builds with some isolation from the system running the build to reduce the effects that a particular machine’s environment can have on the build result (“purity”). This includes PID namespaces, which prevent processes within the namespace from touching any processes outside the namespace. gdb is not happy to be in a different PID namespace from the process it is targeting! I first tried to get my gdb inside the sandbox using nsenter. The first surprise I encountered here was entering a PID namespace not cause utilities from procps, such as ps, pgrep and topto report only processes within the new namespace:

(root@oak:~)# pgrep -a qemu
1678991 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm (...)

(root@oak:~)# nsenter --target 1678991 --pid

🗣 This spawned a new shell within the build's PID namespace
(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
root           1       0  0 Sep02 ?        Ss     1:24 /run/current-system/systemd/lib/systemd/systemd

What!? That’s not the PID1 I was expecting! And I definitely haven’t run this build since September 2nd.

I will skip the details of the time the disappointment occurred, but it turned out that the /proc that ps and friends are still reading from the root PID namespace – even if they are no longer running in it! This can cause some funny unexpected behavior when using tools like pkill… But that’s a problem for another day.

With my new found knowledge, I was able to solve this issue by also creating a new mount namespace and mounting as desired. proc filesystem on top of /proc we inherit from the root namespace.

(root@oak:~)# nsenter -t 1684606 -p -- unshare -m

🗣 Now inside the build's PID namespace (through nsenter) and a new mount namespace (created by unshare)
(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
root           1       0  0 Sep02 ?        Ss     1:24 /run/current-system/systemd/lib/systemd/systemd

(root@oak:~)# mount -t proc proc /proc

(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
nixbld1        1       0  0 12:27 ?        Ss     0:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh

(root@oak:~)# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
nixbld1        1       0  0 12:27 ?        00:00:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
nixbld1        6       1  0 12:27 ?        00:00:00 /nix/store/pn7863n7s2p66b0gazcylm6cccdwpzaf-python3-3.9.13/bin/python3.9 /nix/store/kdi82vgfixayxaql77j3nj7
nixbld1        7       6 99 12:27 ?        00:04:00 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm -cpu max -na
root          46       0  0 12:29 pts/5    00:00:00 -bash
root          79      46  0 12:30 pts/5    00:00:00 ps -ef
🗣 That's better!

(root@oak:~)# pid=7 ./pprof
   1500 __futex_abstimed_wait_common,__new_sem_wait_slow64.constprop.1,qemu_sem_timedwait,worker_thread,qemu_thread_start,start_thread,clone3
    743 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_read_continue,flatview_read,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
    100 syscall,qemu_event_wait,call_rcu_thread,qemu_thread_start,start_thread,clone3
     53 ioctl,kvm_vcpu_ioctl,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
     45 get_fid,v9fs_read,coroutine_trampoline,__correctly_grouped_prefixwc,??
     15 get_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
     11 alloc_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      8 get_fid,v9fs_getattr,coroutine_trampoline,__correctly_grouped_prefixwc,??
      5 clunk_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      5 alloc_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      4 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_write_continue,flatview_write,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
      3 get_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      3 get_fid,v9fs_open,coroutine_trampoline,__correctly_grouped_prefixwc,??
      2 clunk_fid,v9fs_clunk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 get_fid,v9fs_readlink,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 get_fid,v9fs_readdir,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 address_space_translate_internal,flatview_do_translate.isra,address_space_map,virtqueue_map_desc,virtqueue_split_pop,virtio_blk_handle_vq,virtio_queue_notify_vq.part,aio_dispatch_handler,aio_dispatch,aio_ctx_dispatch,g_main_context_dispatch,main_loop_wait,qemu_main_loop,main
      1

The most common code points are:

  1. Locking resources, in 1500 and 743 thread samples. Since the first result is in the worker threads, I figured that they are not interesting and they are just waiting for the work to be available.
  2. Waiting for things to happen inside the VM, at 100 and 53 thread samples. That doesn’t seem like a relevant issue though – I would expect a VM monitor to wait for its guest to need something most of the time.

The other numbers were so small that I (wrongly!) considered them uninteresting as well. At this point I was so frustrated with this branch of my investigation that I abandoned it.

I moved on to quickstack, wrap it upand observed that I could not get any information on the threads, before returning to the Stack Overflow question to follow the other link given in the answer.

Flame graph with perf

This is what gets me everywhere. After recording the performance data using perf record -F max -a -g -- sleep 20 while the build was running, I created a flame graph that made the source of the performance problems very apparent. The following command:

implemented in glibwhich QEMU already relies on. That gives us a or(1) best case complexity, while retaining the worst case of or(n). The exact performance characteristics in the real world are much more complex than the linked list, and there may be slightly more appropriate data structures (or hash table implementations) out there, but we’re looking a big quick win and not micro-optimization. here.

Rewriting the relevant code was surprisingly simple and uneventful, to the point that once I had it compiled, it just worked (an experience I’m not sure I’ve had before in C!). The results are amazing: my previously > 2h test was completed in 7 minutes. It also reduced the ZFS AWS image build time on NixOS from 19 minutes to 1. It was very clear to me that it needed to go with the flow.

QEMU uses a email-based workflowwhere patches are sent as emails to a mailing list to which maintainers are subscribed. I’ve never done this before, and it gets messy, as you can see by looking at the threads (v1
v3
v1a
v5
v6
) in the list of archives. There’s a lot of room for error in email-based patch submissions, and I’ve made several:

  • Forgot to reply-all when replying to review comments, so that the reply is visible to all interested parties and not just the reviewer
  • The version tag was missing on a resubmission (in my case, it was due to misleading docs)
  • Sending a resubmission as a reply to the previous thread (that’s just because I failed to read the docs)
  • Missed a bunch of patch-documenting work while using git-publish because git send-email failed (nixpkgs does not enable Git’s email functionality in the default package, and gitpublish unconditionally deletes working files when the send command is run).

But the reviewers, especially Christian Schoenebeck (thank you!) were helpful and patient, and finally this is done in the patch (although an apparently unrelated bug was discovered) and will soon find its way to a QEMU version near you! If you can’t wait and you’re using Nix, you can pull the patch using this overlay:

  • quickstack (including an upstream PR that makes the build more generic) and git published.
  • I use perf to work on performance issues for the first time, learn how useful it is and the basics of how to use it.
  • I learned that different profiling methods yield different results — results from perf paint a very different picture from a poor person’s profiler; I still don’t understand why, but I will spend some time learning about them soon.
  • I submitted my first patch to QEMU, knowing the code a bit.
  • I submitted my second QEMU patch (trying to fix the patch submission docs).
  • I learned how to submit patches via email, and what mistakes to avoid.
  • I know about the 9p protocol.
  • My tests are now faster!
  • And I get that warm fuzzy feeling of making a valuable contribution.

My take away from this? Digging into a frustrating problem – even if I’m not familiar with the code involved and the technology I’m using – can be very rewarding!

Not only do I make my own work on tests more enjoyable by reducing revision time, but it goes to all QEMU users, and significantly reduce the load generated by the installer test of NixOS’s build farm. This is what I love about working with open source software: if there’s a problem, I have a way to get in there and fix it, and pass the benefits on to everyone who uses it.

It’s not always that great, but this kind of experience can make many less successful adventures worth it.

2024-12-17 04:56:39

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button