Posted on: 2020-10-01
Update 2023: We've moved many things to CEPH but ZFS is still in use. The bug hasn't occured for over a year now, so it's probably fixed
On one of our Proxmox 6 servers we recently had the issue of Icinga constantly complaining about a high load average. The load average itself is a bit of a weird metric, it measures the length of the run queue. That means how many processes are waiting to get run - this could either be because they are waiting for CPU time to become available or data to arrive from the disk. It doesn't count tasks that are deliberately sleep()ing or waiting for network I/O (unless it's an NFS mount, in which case it's considered disk I/O, not network I/O). Basically, if it's too high it tells your there's something wrong but you have to dig deeper to figure out exactly what.
So, on our system top shows a CPU utilization of around 10%, iotop just a few kilobytes up to a few megabyte of disk activity per second... What could it be? There's one command to find out - it shows the processes actually in the run queue, so we can get a better idea of what's stuck. Let's run it:
# ps r -e
PID TTY STAT TIME COMMAND
442 ? D 0:00 [z_unlinked_drai]
10102 ? D 0:01 [z_unlinked_drai]
14311 ? D 0:00 zfs recv -F -- rpool/data/subvol-100-disk-0
18114 pts/9 R+ 0:00 ps r -e
19302 ? D 0:00 zfs recv -F -- rpool/data/subvol-101-disk-0
Besides the "ps" process itself we're seeing two ZFS userspace processes and two ZFS kernel threads. They are all coming from the Proxmox replication services. It seems to be a bug in zfsonlinux - too bad :/. You can't kill processes stuck in an uninterruptible syscall or kernel threads. Reading up on it doesn't show a clear reason for the behavior. We're running our ZFS on spinning rust with two NVMe disks for l2arc and zil. Swap is also on NVMe. The only thing to do is install all updates, reboot the server and hope it doesn't happen again.
Update, March 2021: The bug in zfsonlinux still isn't fixed. We are in the process of moving many of our VMs to CEPH storage on NVMe drives for performance reasons. We keep the harddisks on ZFS for bulk storage and backups. Having fewer VMs with ZFS replication should reduce the need for constant reboots.