2011-01-22

Gentoo, initrd, mdadm, and a online conversion of the root filesystem

My gentoo system froze last night kinda mysteriously. I/O to /var seemed to freeze, which is extremely odd since /var is not a standalone filesystem but rather is on / and access to other random parts of / didn't seem to go wrong. Very mysterious and I eventually got myself into a state where I had to start doing magic sysreq keys to recover. Strangely C-A-S-e (which terminates processes) cleared the problem, whatever that problem was. This also allowed the system to log a number of messages which showed various processes hung in a variety of reiserfs journal calls. Cause or effect? No way to know for sure.

While my system rebooted cleanly, it seemed like I should take this opportunity to convert away from murderfs to a filesystem with a future. Without spending too much time thinking about it, I picked ext4 (remember this is for /—my main storage filesystem remains xfs). Unfortunately I later discovered that I failed to compile ext4 support in my kernel, but that was easy enough to resolve so I will not recount that excursion further.

Because I am not an idiot, I have RAID configured for my filesystems. On this system, I am using RAID-1 for all filesystem activities. I have three identical disks, so I am actually using 3-way RAID-1, meaning I can lose a disk and still have redundancy. Because I am really not an idiot, I am not using hardware RAID. Hardware RAID is nice and all, but I hardly think it is going to give you much performance gain on RAID-1 (ignoring any mythical battery backed cache) and the control you lose by having an embedded RAID controller is pretty serious. I scrub my raid partitions every week so I can discover any disk corruption before it gets serious—something that is normally difficult or impossible with embedded controllers. But in any case, what this (software RAID-1) really means is that I can convert my root filesystem online. In fact it is pretty simple. First boot into single user mode (normally adding "single" to the kernel boot line is sufficient, but with gentoo's stupid initrd option processing, you have to "mother may I" it using "init_opts=single". But once you know this it is only a minor annoyance. I have a fake kernel boot option which documents this so that I don't have to remember or google it during disaster recovery.

mdadm --manage /dev/md0 --fail /dev/sdc1
mdadm --manage /dev/md0 --remove /dev/sdc1
mdadm --create /dev/md3 --level=1 -n 3 --metadata=0.90 /dev/sdc1 missing missing

While the commands are all mostly pretty obvious—remove a partition and then create a new raid using that partition—the last command could probably use some explanation. I decided to inform the newly created raid array that the end goal would be to have three devices. I believe this is not strictly necessary for RAID-1, but this way everything is reserved up front—so I specified 3 devices and said the other two were missing. The other interesting bit is the --metadata option. The newer metadata versions for mdadm provide more power (and specifically power that I could really have used later), but it appears that you have to have a grub which supports the newer mdadm superblock format. Unfortunately, the grub that I have does not support it (I don't really understand why Gentoo hasn't upgraded, my guess is because you would need to reinstall the boot blocks which is probably a bit tricky for some users. Certainly it is non-obvious how to do so for software RAID users. So I went ahead and specified the old-style metadata.

mkfs.ext4 /dev/md3
mount /dev/md3 /mnt/usb
mount -r -o bind / /mnt/cdrom

Well, obviously I used two random /mnt directories that I had lying around. The interesting bit here is my use of a bind mount of / so that when I copy the root filesystem, I can see the data underneath the mount points. So I neither copy nor skip /proc (for example), instead I make a faithful copy of the /proc directory hidden underneath the mount point. Speaking of faithful copies:

cd /mnt/cdrom
tar cSf - . | (cd /mnt/usb; tar xSf -)

I use S to support sparse files. This is a real danger on filesystems with /var/log on them (wtmp and lastlog, etc) and is a good idea anytime. Because I don't have a /boot on this particular system, I need to install boot blocks—well need is perhaps too strong of a word since I won't actually be booting from /dev/sdc, but it is still a good habit to get into.

touch /boot/magic-nounce
grub
find /boot/magic-nounce
root (hd2,0)
setup (hd2)
quit

As you see here, I am creating a temporary file to let me know which of the many copies of /boot I have on the system are the new ones which I need to install boot blocks for. I then run grub and ask it to find that file, and then go through the normal installation process for the identified partition. Only one more minor fixup.

emacs /etc/fstab /etc/mdadm.conf
# Replace reseirfs with ext4 for /

Yes, yes. fstab doesn't actually control the mounting of / so it really doesn't matter that the filesystem is accurate. However, best remove any contradictory information to avoid future confusion. In a fit of obsessive/compulsive behavior, I go ahead and update my manually specified raid assembly lines (by UUID) in /etc/mdadm.conf. I don't think many people bother to touch this file nor do I think it is used in my configuration, but whatever.

Now I am ready to do a test boot. At this stage, I was under (false) the impression that it generated the /dev/md# numbers in the order that it found md partitions on disk, but since I converted sdc1, I believed that it would not be detecting the new root first. I was wrong about the reason, but the effect was the same.

init 6
# Interrupt the grub auto-boot
# Replace /dev/md0 with /dev/md3 on the real_root option, add init_opts=s

This allows me to have the kernel boot retrieved from the new raid partition I created. I can validate that everything is OK while in read-only mode. I checked /proc/mdstat to ensure that the raid was created the way I assumed (it was) and then /etc/fstab for ext4 to ensure that I my / was from the correct RAID (it was). I then did an "exit" and let the system boot to multi-user. Everything seemed good. Now I am ready to start switching over.

mdadm --manage /dev/md0 --fail /dev/sda1
mdadm --manage /dev/md0 --remove /dev/sda1
mdadm --manage /dev/md3 --add /dev/sda1

I flipped the bios boot disk over to the new raid partition so that bios would be using the new boot blocks and so forth. I was still under the impression here that this would automagically set md0 to be the new partition. While the raid was resynchronizing the new device I added, I went again and installed the boot blocks

grub
find /boot/magic-nounce
root (hd0,0)
setup (hd0)
quit
watch cat /proc/mdstat

After the boot block installation, I watched (`watch cat /proc/mdstat`) the mdstat file to wait for the raid to be fully synchronized. Once that was done, ^C and a rebooting we go.

init 6
# Interrupt grub auto-boot
# add init_opts=s

After booting into single user mode, I did a quick check (of fstab) to see if I was using the correct root partition. Uh, no. I discovered that I was completely wrong about the kernel numbering by discovery order (which is good in general, just bad for my hopes for a clean and fast conversion). I then started looking at manual pages and google to try and find what the magic was. I quickly found out that there was a preferred minor device number, but…there is no way to manually specify it! (at least for the 0.90 metadata version—more recent versions have the name option which I hope overrides the saved number). Instead it uses the last minor/md number the drive was assembled under. I couldn't believe it. Sure, for partitions other than / it is no trouble to boot into single user, remove any auto-assembly, and then re-assemble the devices with the names you desire, but if you are mucking around with / you are kind of out of luck here. Fortunately there is a way to ask the kernel to assemble the devices the way you want. Hurrah!

init 6
# Interrupt grub auto-boot
# add md=0,/dev/sda1,/dev/sdc1 init_opts=s

Uh…no joy. It is clearly documented so this should work. Well, perhaps the kernel auto-assembly goes first. Fortunately, there is a way to ask the kernel to assemble the devices the way you want. Hurrah!

init 6
# Interrupt grub auto-boot
# add raid=noautodetect md=0,/dev/sda1,/dev/sdc1 init_opts=s

Uh…no joy. It is clearly documented so this should work. Hmm. Well…I notice that the initrd seems to be printing some lines about mdadm assembly. Could it be reverting the kernel's assembly in some fit of idiocy? Clearly it is not using a copy of /etc/mdadm.conf from the true root filesystem since I happened to update that beforehand with the correct UUID when I added ext4 support, so it must be undoing what the kernel did and re-doing auto-assembly. Nice. Not.

init 6
# Interrupt grub auto-boot
# Remove initrd line from current config
# Change root=/dev/ram0 to /dev/md0
# Remove init=/linuxrc real_root=/dev/md0
# Add md=0,/dev/sda1,/dev/sdc1 s

Joy! I finally am booted to /dev/md0 with the right /etc/fstab. The system appears to have renumbered the old /dev/md0 to /dev/md127. If for some reason you require an initrd to boot—say you don't have md driver loaded into kernel memory—you have two options. First is to boot off a rescue CD (like the gentoo install CD). Second is to build a new kernel with the md driver loaded (mount /dev/md3 onto /mnt/cdrom, chroot /mnt/cdrom, cd /usr/src/linux, make menuconfig, save the config and then do your normal genkernel thing).

cat /proc/mdstat
head /etc/fstab
mdadm --detail /dev/md0
mdadm --examine /dev/sda1
mdadm --examine /dev/sdc1
mdadm --detail /dev/md127
mdadm --examine /dev/sdb1

Inspecting the output of the detail and examine mdadm commands, we can see that the "Preferred Minor" number has been properly reset…for everything except /dev/sdb1. Well, I remember from the man page how to fix that.

mount -r /dev/md127 /mnt/cdrom
mdadm --examine /dev/sdb1

Everything looks quite nice. Now let's see if going through a normal boot with initrd will work (well, honestly this kinda looks like I might not need initrd, but I will ignore that thought and press on).

init 6
# Interrupt grub auto-boot
# Add init_opts=s

Still nice. fstab shows the right root, I am booted from /dev/md0. Everything OK.

head /etc/fstab
cat /proc/mdstat
exit

I press on to multi-user mode. Everything still looking good. At this point, I have a clean boot using my brand new ext4 filesystem so the last remenants of resierfs can be swept away.

mdadm --stop /dev/md127
mdadm --manage /dev/md0 --add /dev/sdb1
grub
find /boot/magic-nounce
root (hd1,0)
setup (hd1)
quit
watch cat /proc/mdstat

One my raid is resynchronized, I can rest on my laurels…though honestly are laurels comfortable to rest on? Back in the days of the roman empire sure, but now? A nice memory-foam is probably much more comfortable.