Make your QEMU 10 times faster with this one awesome trick

AljwadhDecember 17, 2024

0 1,589 8 minutes read

The work on QEMU and this post was paid for by Determinate Systems, and the post was co-published by Determinate Systems blog.

NixOS uses virtual machines based on QEMU widely for running his test suite. To avoid creating a disk image for each test, the test driver usually boots with a Plan 9 File Protocol (9p) share (server implemented by QEMU) for the Nix store, which contains all the programs and config needed for testing.

I’m working on a test VM that copies a fairly large amount of data (~278k files totaling ~5.3GiB) from a 9p-mounted Nix store, and was surprised how long it took to copy it. data. On NVMe devices, I would expect it to take a few seconds or minutes, but the test actually ended up taking 2 timemost of it was spent copying files from 9p. Since it was not determined long for additional work, I decided to study it a little, and was able to reduce the duration of the test to only 7 minutes. In this post, I will describe the entire journey.

As a preface: I don’t have much experience debugging performance issues! Most of what I use is novel to me. The first thing I want to do is find out where most of the time is spent. My guess is that it is in QEMU and not in the guest, although this guess being correct is a matter of pure luck. This stack overflow question describes a problem mostly but not entirely different from mine. This led me down the wrong path of trying to use the profiler of the poora little hack consisting of gdb, some shell, and awk.

Stories of surprises and failures: Profile of the poor man

I immediately encountered a small obstacle in the way of this method. gdb said

warning: Target and debugger are in different PID namespaces; Thread lists and other data are likely unreliable. Connect to gdbserver inside the container.

Nix uses Linux namespaces to provide builds with some isolation from the system running the build to reduce the effects that a particular machine’s environment can have on the build result (“purity”). This includes PID namespaces, which prevent processes within the namespace from touching any processes outside the namespace. gdb is not happy to be in a different PID namespace from the process it is targeting! I first tried to get my gdb inside the sandbox using nsenter. The first surprise I encountered here was entering a PID namespace not cause utilities from procps, such as ps, pgrep and topto report only processes within the new namespace:

(root@oak:~)# pgrep -a qemu
1678991 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm (...)

(root@oak:~)# nsenter --target 1678991 --pid

🗣 This spawned a new shell within the build's PID namespace
(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
root           1       0  0 Sep02 ?        Ss     1:24 /run/current-system/systemd/lib/systemd/systemd

What!? That’s not the PID1 I was expecting! And I definitely haven’t run this build since September 2nd.

I will skip the details of the time the disappointment occurred, but it turned out that the /proc that ps and friends are still reading from the root PID namespace – even if they are no longer running in it! This can cause some funny unexpected behavior when using tools like pkill… But that’s a problem for another day.

With my new found knowledge, I was able to solve this issue by also creating a new mount namespace and mounting as desired. proc filesystem on top of /proc we inherit from the root namespace.

(root@oak:~)# nsenter -t 1684606 -p -- unshare -m

🗣 Now inside the build's PID namespace (through nsenter) and a new mount namespace (created by unshare)
(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
root           1       0  0 Sep02 ?        Ss     1:24 /run/current-system/systemd/lib/systemd/systemd

(root@oak:~)# mount -t proc proc /proc

(root@oak:~)# ps -f 1
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
nixbld1        1       0  0 12:27 ?        Ss     0:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh

(root@oak:~)# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
nixbld1        1       0  0 12:27 ?        00:00:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
nixbld1        6       1  0 12:27 ?        00:00:00 /nix/store/pn7863n7s2p66b0gazcylm6cccdwpzaf-python3-3.9.13/bin/python3.9 /nix/store/kdi82vgfixayxaql77j3nj7
nixbld1        7       6 99 12:27 ?        00:04:00 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm -cpu max -na
root          46       0  0 12:29 pts/5    00:00:00 -bash
root          79      46  0 12:30 pts/5    00:00:00 ps -ef
🗣 That's better!

(root@oak:~)# pid=7 ./pprof
   1500 __futex_abstimed_wait_common,__new_sem_wait_slow64.constprop.1,qemu_sem_timedwait,worker_thread,qemu_thread_start,start_thread,clone3
    743 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_read_continue,flatview_read,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
    100 syscall,qemu_event_wait,call_rcu_thread,qemu_thread_start,start_thread,clone3
     53 ioctl,kvm_vcpu_ioctl,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
     45 get_fid,v9fs_read,coroutine_trampoline,__correctly_grouped_prefixwc,??
     15 get_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
     11 alloc_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      8 get_fid,v9fs_getattr,coroutine_trampoline,__correctly_grouped_prefixwc,??
      5 clunk_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      5 alloc_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      4 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_write_continue,flatview_write,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
      3 get_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      3 get_fid,v9fs_open,coroutine_trampoline,__correctly_grouped_prefixwc,??
      2 clunk_fid,v9fs_clunk,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 get_fid,v9fs_readlink,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 get_fid,v9fs_readdir,coroutine_trampoline,__correctly_grouped_prefixwc,??
      1 address_space_translate_internal,flatview_do_translate.isra,address_space_map,virtqueue_map_desc,virtqueue_split_pop,virtio_blk_handle_vq,virtio_queue_notify_vq.part,aio_dispatch_handler,aio_dispatch,aio_ctx_dispatch,g_main_context_dispatch,main_loop_wait,qemu_main_loop,main
      1

The most common code points are:

Locking resources, in 1500 and 743 thread samples. Since the first result is in the worker threads, I figured that they are not interesting and they are just waiting for the work to be available.
Waiting for things to happen inside the VM, at 100 and 53 thread samples. That doesn’t seem like a relevant issue though – I would expect a VM monitor to wait for its guest to need something most of the time.

The other numbers were so small that I (wrongly!) considered them uninteresting as well. At this point I was so frustrated with this branch of my investigation that I abandoned it.

I moved on to quickstack, wrap it upand observed that I could not get any information on the threads, before returning to the Stack Overflow question to follow the other link given in the answer.

Flame graph with perf

This is what gets me everywhere. After recording the performance data using perf record -F max -a -g -- sleep 20 while the build was running, I created a flame graph that made the source of the performance problems very apparent. The following command:

# taken from the flamegraph page linked above
perf script | stackcollapse-perf.pl |
  # We're only interested in the qemu process
  grep ^qemu |
  # Make the graph a lot less tall by collapsing multiple consecutive unknown stack frames together
  awk '/unknown/ { gsub("(\\(unknown);){1,}", "(unknown...);", $0) } { print }' |
  flamegraph.pl > flamegraph.svg # and generate a flamegraph

created this nice interactive SVG graph:

The prevalence of “fid”-related functions is very clear in this graph. So I jumped into the 9p docs and QEMU source code to find that fids are numbers that refer to open files on a 9p connection, similar to file descriptors in the POSIX file API – so there one fid for each file with a visitor. tomorrow.

Let’s look at the previous implementation of get_fidwhich is used by QEMU’s 9p server in its implementations stat (retrieving file metadata), open (opening a file), and read (reading data from an open file), etc.:

static V9fsFidState *coroutine_fn get_fid(V9fsPDU *pdu, int32_t fid)
{
    int err;
    V9fsFidState *f;
    V9fsState *s = pdu->s;

    // Blog note: I've omitted some parts that are irrelevant to performance here.
    QSIMPLEQ_FOREACH(f, &s->fid_list, next) {
        if (f->fid == fid) {
            return f;
        }
    }
    return NULL;
}

QEMU is back fid_list to find the desired data in the fid. Finding an entry in a list by iterating has a complicated time of or(n) WHERE n is the size of the list — in this case, the list of all files opened by the visitor — so it’s expensive! Additionally, some inspection of QSIMPLEQ (QEMU simple queue) reveals that it is implemented as a linked list, a data structure that tends to exhibit bad cache locality.

Since my test is copying a lot of files from the 9p filesystem, as well as booting from it, this search always occurs:

deer stat for getting file metadata (permissions in particular)
deer open to get a handle to a file
deer read for small files, or many for large files, to extract the contents of the file

That makes at least 3 search operations, for each of the 278000 files, which leads to a bad search in a hot code path. it is the cause of slowness.

implemented in glibwhich QEMU already relies on. That gives us a or(1) best case complexity, while retaining the worst case of or(n). The exact performance characteristics in the real world are much more complex than the linked list, and there may be slightly more appropriate data structures (or hash table implementations) out there, but we’re looking a big quick win and not micro-optimization. here.

Rewriting the relevant code was surprisingly simple and uneventful, to the point that once I had it compiled, it just worked (an experience I’m not sure I’ve had before in C!). The results are amazing: my previously > 2h test was completed in 7 minutes. It also reduced the ZFS AWS image build time on NixOS from 19 minutes to 1. It was very clear to me that it needed to go with the flow.

QEMU uses a email-based workflowwhere patches are sent as emails to a mailing list to which maintainers are subscribed. I’ve never done this before, and it gets messy, as you can see by looking at the threads (v1
v3
v1a
v5
v6
) in the list of archives. There’s a lot of room for error in email-based patch submissions, and I’ve made several:

Forgot to reply-all when replying to review comments, so that the reply is visible to all interested parties and not just the reviewer
The version tag was missing on a resubmission (in my case, it was due to misleading docs)
Sending a resubmission as a reply to the previous thread (that’s just because I failed to read the docs)
Missed a bunch of patch-documenting work while using git-publish because git send-email failed (nixpkgs does not enable Git’s email functionality in the default package, and gitpublish unconditionally deletes working files when the send command is run).

But the reviewers, especially Christian Schoenebeck (thank you!) were helpful and patient, and finally this is done in the patch (although an apparently unrelated bug was discovered) and will soon find its way to a QEMU version near you! If you can’t wait and you’re using Nix, you can pull the patch using this overlay:

final: prev: {
  qemu = prev.qemu.overrideAttrs (o: {
    patches = o.patches ++ ( (prev.fetchpatch {
      name = "qemu-9p-performance-fix.patch";
      url = "https://gitlab.com/lheckemann/qemu/-/commit/8ab70b8958a8f9cb9bd316eecd3ccbcf05c06614.patch";
      sha256 = "sha256-PSOv0dhiEq9g6B1uIbs6vbhGr7BQWCtAoLHnk4vnvVg=";
    }) );
  });
}

quickstack (including an upstream PR that makes the build more generic) and git published.
I use perf to work on performance issues for the first time, learn how useful it is and the basics of how to use it.
I learned that different profiling methods yield different results — results from perf paint a very different picture from a poor person’s profiler; I still don’t understand why, but I will spend some time learning about them soon.
I submitted my first patch to QEMU, knowing the code a bit.
I submitted my second QEMU patch (trying to fix the patch submission docs).
I learned how to submit patches via email, and what mistakes to avoid.
I know about the 9p protocol.
My tests are now faster!
And I get that warm fuzzy feeling of making a valuable contribution.

My take away from this? Digging into a frustrating problem – even if I’m not familiar with the code involved and the technology I’m using – can be very rewarding!

Not only do I make my own work on tests more enjoyable by reducing revision time, but it goes to all QEMU users, and significantly reduce the load generated by the installer test of NixOS’s build farm. This is what I love about working with open source software: if there’s a problem, I have a way to get in there and fix it, and pass the benefits on to everyone who uses it.

It’s not always that great, but this kind of experience can make many less successful adventures worth it.

2024-12-17 04:56:39

AljwadhDecember 17, 2024

0 1,589 8 minutes read

Make your QEMU 10 times faster with this one awesome trick

Stories of surprises and failures: Profile of the poor man

Flame graph with perf

Aljwadh

Leave a Reply Cancel reply

Elon Musk agrees with Tweet saying Americans aren’t smart enough for tech jobs

Apple Allows Support for Satellite T-Mobile and Starlink in the iPhone

Lamar Kendrick will appear in Synth Riders experience on Apple Pro vision

The 2024 Movie Monster State of the Union

Thousands of people are evacuating in LA as wildfires and extreme winds hit Southern California

Google phone can change your response to your calls

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)

Stories of surprises and failures: Profile of the poor man

Flame graph with perf

Aljwadh

Bitcoin Dominance Falls, Bessent Chosen for Treasurer, PumpFun Streams Criticized

Josh Allen closes in on NFL MVP award as Buffalo Bills edge Detroit Lions to underline Super Bowl credentials | NFL News

Related Articles

O3-Mini is not crazy to simulate computations

What did we forget | Alexander Mikhylan

The Numbers are Leaves | christo.sh

In praise of a hundred ideas on the page – Tracy Durnell’s Mind Garden

Leave a Reply Cancel reply

Google phone can change your response to your calls

Ryan Reynolds and Andrew Garfield Are Game to Return as Deadpool and Spider-Man

Your Dishwasher Is Gross. Here’s How to Clean It

Apple Music expands its live radio offerings with three new stations

Ready Player Me’s Player Zero sees momentum for Web3 collectible avatars

The 33 Best Shows on Apple TV+ Right Now (December 2024)