Re-partition Fedora Copr Hypervisors
This post documents our experience from the recent administrative work with
Fedora Copr, during which we had to re-partition the volume layout on a set of
machines. It can serve as a
reference in the future when working with LVM setups on non-trivial mdadm
arrays.
The problem
The Fedora Copr hypervisors are relatively old machines, hosted in the Fedora Infrastructure lab. They were originally used as Koji builders, then repurposed for OpenStack cloud nodes, and finally, repurposed for Fedora Copr. It’s possible that the disks on these machines are aging, or perhaps there have been recent changes in the structure of Copr build tasks. Regardless, there was a significant slowdown some time ago.
The disks were so slow that everything took ages. In extreme situations, the
disk would hang, causing every process that accessed the
disks, either on the hypervisor host or in the VM, to end up in an
uninterruptible state ‘D’. This even included the LibVirt daemon process,
which, in turn, meant that even the administrator couldn’t recover from this
situation (when I cannot remove the I/O intensive virtual machine, I cannot make
things better…). SSHD was still running on the hypervisor; however, any new
attempt to establish a new connection touched the disk and failed (for
example, /bin/bash binary needs to be loaded from the disk, but it couldn’t).
The only way out of this situation was a cold reboot.
Actually, not only was SSHD itself running, but the established SSH connections to VMs on the hypervisor were still seemingly working (processing SSH keep-alive packets doesn’t touch the disk, so it kept the controlling connection alive). But whatever action over the SSH connection hanged, obviously. This tricked our processes miserably.
The root problem was the disk layout. Despite having 8 (10 in some cases) SAS
disks, they were all part of a raid6 (raid5 in one case) software RAID, and
everything was stored on that array (including the / volume, the host’s
SWAP volume, guest disks and SWAP volumes, etc.). The Copr builders use the
Mock tmpfs feature, and some extreme builds expectedly use SWAP
extensively. Under pressure, RAID redundancy simply overloaded the disk,
eventually causing it to hang (deadlock?). Isn’t this possibly a bug in the
RAID kernel code? On Copr Backend and raid10, the default periodic RAID
checks used to hang our systems similarly (lowering dev.raid.speed_limit_max
helped us to work around that).
So, we tried multiple reboots… The solution? Changing the disk layout.
Unfortunately, we had no physical access to the lab, and remote reinstallation
was unlikely possible (at least for some of those machines without a working
console). There was no chance to hot-plug additional storage to offload /
“somewhere.” However, thanks to LVM and the raid6 redundancy, we were able to
“online” re-partition the layout so that (a) the / is now moved to its own
(different and smaller) raid6 and (b) everything else is uniformly “striped”
over all the disks to maximize parallel I/O throughput.
The old disk layout
We used to have this (multiply the lsblk output by 8 or 10, as all the
physical disks used to have the same layout):
sdj 8:144 0 446.6G 0 disk
└─sdj3 8:147 0 445.2G 0 part
└─md127 9:127 0 3.5T 0 raid6
├─vg_guests-root 253:0 0 37.3G 0 lvm /
├─vg_guests-swap 253:1 0 300G 0 lvm [SWAP]
└─vg_guests-images 253:2 0 3.2T 0 lvm /libvirt-images
Note the md127 was raid6. Guest swaps were created as sparse files in
/libvirt-images. SWAP was on the same md127. Fortunately, lvm was used,
as detailed below.
Please also note that I intentionally filtered out the boot-related raid1
partitions. These remain unchanged in this post (yes, raid1 spread across 8+
disks, resulting in high redundancy, but that’s the only small point of interest
here).
The new layout
We iterated to this (see below how). Again multiply by 8 or 10:
sdj 8:144 0 446.6G 0 disk
├─sdj3 8:147 0 40G 0 part [SWAP]
├─sdj4 8:148 0 15G 0 part
│ └─md13 9:13 0 105G 0 raid6
│ └─vg_server-root 253:0 0 37.3G 0 lvm /
└─sdj5 8:149 0 390.1G 0 part
└─md0 9:0 0 3.8T 0 raid0
└─vg_guests-images 253:1 0 1.9T 0 lvm /libvirt-images
Here, / stays on raid6 as it deserves disk redundancy to keep the machine
bootable upon a disk failure. However, note that it is on a different
vg_server volume group. The SWAP is spread across separate 40G partitions
(8 or 10 of them). Please note that when “mounted” (with swapon) with the
same priority, the kernel can stripe swap. And everything else
is on an independent, striped raid0!
How We Did This
The trick is in the RAID redundancy (there are some spare disks to use) and the possibility to migrate Volume Groups (VGs) across Physical Volumes (PVs).
-
Don’t reboot until the end.
-
Stop all the VMs on the machine.
-
Backup data on the being destroyed filesystem.
$ cp /libvirt-images/copr-builder-x86_64-20230608_112008 /dev/shm/Warning: This is not a real backup.
-
Drop the large filesystem and swap.
$ umount /libvirt-images $ swapoff -a # Keep the root LV! lvremove /dev/vg_guests/images lvremove /dev/vg_guests/swap -
For the case of accidental reboot, comment out stuff in
/etc/fstab.$ cat /etc/fstab ... # LABEL=swap none swap sw 0 0 # LABEL=vmvolumes /libvirt-images ext4 defaults 0 0 -
Rename the VG from
vg_gueststovg_server, and LV to a self-describing name.vgrename vg_guests vg_server lvrename /dev/vg_server/LogVol00 /dev/vg_server/root -
Fix
fstab.$ cat /etc/fstab ... /dev/mapper/vg_server-root / ext4 defaults 1 1 -
Cut out one of the disks from the
raid6(we choose the last one).We need to a) fail the disk and b) then remove it from the array.
$ mdadm /dev/md127 -f /dev/sdh3 $ mdadm /dev/md127 -r /dev/sdh3Check
/proc/mdstat, the array is still consistent. -
Don’t let the kernel think the volume is part of RAID.
$ mdadm --zero-superblock /dev/sdh3 -
Partition the disk.
We used
cfdisk. Dropped the 3rd partition and later created the 3rd partition again as a40Gpartition, the 4th partition as 15G (for/raid6), and the rest forraid0on the 5th. Something like:sdh 8:112 0 446.6G 0 disk ...skip boot partitions... ├─sdh3 8:115 0 40G 0 part ├─sdh4 8:116 0 15G 0 part └─sdh5 8:117 0 390.1G 0 partIf you have to use extended partitions, an additional
sdhpartition with 1K size is expected inlsblkoutput.You might need to run
partprobe.On one of the Power8 machines with multipath ON, I had to run
multipath -r. -
Move the VG onto the new partition.
Warning: No
/redundancy now. We could afford risking this, you might not. Think twice.$ vgextend vg_server /dev/sdh5 # does pvcreate in the background $ pvmove -i 10 /dev/md127 # print status every 10sThis magically evacuates our
raid6volume, and then since unused, allows us to repartition the rest of the physical disks. -
Partition the other 7 (or 9) disks the same way as in step 10.
-
Create
raid6for/.Well, let’s increase the possible disk failure by a single hot-spare.
$ mdadm --create --name=server --verbose /dev/md/server --level=6 --raid-devices=9 --spare-devices=1 /dev/sd{a,b,c,d,e,f,g,h,i,j}4Might be a good idea to wait until
/proc/mdstatreports the array is synced. -
Move the data to the new
raid6above:$ vgextend vg_server /dev/md/server Physical volume "/dev/md/server" successfully created. Volume group "vg_server" successfully extended. $ pvmove -i 10 /dev/sdh5 -
Drop the volume from VG:
$ vgreduce vg_server /dev/sdh6 -
Create a RAID-0 disk volume group and a new VG on it:
$ mdadm --create --name=guests --verbose /dev/md/guests --level=0 --raid-devices=10 /dev/sd{a,b,c,d,e,f,g,h,i,j}5 $ vgcreate vg_guests /dev/md/guests -
Create a logical volume and
/libvirt-images:$ lvcreate -n images -l 50%FREE vg_guests $ mkfs.ext4 /dev/mapper/vg_guests-images -L vmvolumes $ edit fstab # Uncomment the commented-out /libvirt-images mountpoint. $ mount /libvirt-images/ # Likely already auto-mounted by systemd. -
Create SWAP partitions:
x=0; for i in /dev/sd{a,b,c,d,e,f,g,h,i,j}3; do x=$(( x + 1 )); mkswap -L swap-$(printf "%02d\n" "$x") $i; doneAdjust
fstabwith:LABEL=swap-01 none swap sw,pri=2 0 0 LABEL=swap-02 none swap sw,pri=2 0 0 ... LABEL=swap-10 none swap sw,pri=2 0 0And “mount” the swap with:
$ swapon -a -
Fix LibVirt:
$ cp /dev/shm/copr-builder-x86_64-20230608_112008 /libvirt-images/ # "backup" restore $ restorecon -RvF /libvirt-images/Note that we had to migrate the golden image from QCOW2 to RAW format, as the new layout does not support QCOW2 format (drastic slowdown after the move from
raid6toraid0, initially scared me a lot!).$ qemu-img convert -O raw copr-builder-ppc64le-20230608_110920 image.raw $ cp --sparse=always image.raw copr-builder-ppc64le-20230608_110920 cp: overwrite 'copr-builder-ppc64le-20230608_110920'? yesTry to start some VMs.
-
Fixing mdadm config:
Actually, I’m not sure this is needed. Kernel boot process seems to auto-assemble the arrays, but it’s not good to keep the config file inconsistent.
$ mdadm --detail --scan ... ARRAY /dev/md/server metadata=1.2 name=server UUID=ecd6ef8f:d946eedc:72ba0726:d9287d44 ARRAY /dev/md/guests metadata=1.2 name=guests UUID=9558e2e0:779d8a1b:87630417:e57d7b91These lines must be in
/etc/mdadm.conf, so drop the old, non-existing entries. -
Fix GRUB:
Note that we moved the
/volume from one MD to another. This required changing the kernel cmdline.We also removed the non-existing
resume=/dev/mapper/vg_guests-swapswap partition. We can’t hibernate now, but we never needed or tried anyway.We updated the
rd.md.uuidhosting the/volume.There’s the
grubbyutility that helps with this task:$ grubby --update-kernel=ALL --args 'root=/dev/mapper/vg_server-root ro crashkernel=auto rd.md.uuid=9c39034c:d618eca6:c7f4f9e7:d05ec5d4 rd.lvm.lv=vg_server/root rd.md.uuid=79366fc3:a8df5fb4:caba2898:5c8bd9d2'Note that I had a lot of fun with GRUB2 and I had to play with
grub2-mkconfigon a few of the hypervisors, too. Not sure why. Make sure you double-check the/boot/grub2/grubenvconfig is correct, the same as/etc/default/grubor/etc/sysconfig/grub, etc. Perhaps some modifications are needed — you really risk losing your remote box (NB I myself managed to cause headaches on one of the hypervisors by omitting this whole step 😖). -
Double-check:
Double check GRUB config again. Double check
/etc/fstab. -
Reboot and enjoy!