OMG: whitby is being rebooted

#210

Opened by tazjin at 2022-10-12T07·10+00

This is a tracking issue for scheduled maintenance of whitby. It has been up for quite a long time:

tazjin@whitby ~> uptime
 07:05:30  up 831 days 12:59,  1 user,  load average: 0.17, 0.31, 0.32

Rebooting whitby is tricky because the intended process includes an initrd SSH server in which we enter the disk encryption password, but we basically have never actually done this.

Before actually rebooting, the following checklist will be run:

Ensure initrd SSH keys are up to date
Ensure whitby is canonical
Ensure sanduny is canonical
Ensure that restic backups completed successfully
Ensure that a copy of the Gerrit state is available on sanduny (though not being served)
Await KVM console attachment notification from Hetzner
Check that disk encryption password is actually up-to-date

Verified with zfs load-key -n zroot and the password we have shared.
Check recent bootloader entries ¹

We will reboot with the KVM console attached and monitor the reboot.

After rebooting, we will run this checklist:

Ensure that all public-facing services are up
Ensure that network configuration came back correctly

During whitby's uptime, nixpkgs has had numerous bugs that broke the writing of bootloader entries. ↩

tazjin updated the body of this issue at 2022-10-12T07·14+00
tazjin updated the body of this issue at 2022-10-12T07·15+00
tazjin updated the body of this issue at 2022-10-12T07·27+00
tazjin updated the body of this issue at 2022-10-12T07·33+00
tazjin updated the body of this issue at 2022-10-12T07·37+00
Current whitby system generation is 393 after canonicalising at latest HEAD. This matches the latest entry in grub.cfg, making me think the bootloader is up-to-date.

tazjin at 2022-10-12T07·39+00
tazjin updated the body of this issue at 2022-10-12T07·40+00
tazjin updated the body of this issue at 2022-10-12T07·41+00
First problem: Can't get the HTML5 based KVM console to work anymore. It just shows me some green blobs, but the little thumbnail screenshot looks correct.

Hetzner's fallback thing is a JavaWS application (of course), so I'm trying to figure out how to run that right now.

tazjin at 2022-10-12T07·48+00
Current status:

That's good enough for me, we're going in.

tazjin at 2022-10-12T07·54+00
Unlocking the disk over SSH worked perfectly fine.

tazjin at 2022-10-12T08·01+00

IPv4 works (I connected over it), seems like v6 also came back up normally:

tazjin@sanduny ~> ping -6 whitby.tvl.su
PING whitby.tvl.su(whitby.tvl.fyi (2a01:4f8:242:5b21:0:feed:edef:beef)) 56 data bytes
64 bytes from whitby.tvl.fyi (2a01:4f8:242:5b21:0:feed:edef:beef): icmp_seq=1 ttl=52 time=23.3 ms

tazjin at 2022-10-12T08·01+00

tazjin updated the body of this issue at 2022-10-12T08·02+00
tazjin closed this issue at 2022-10-12T08·02+00
Minor problems that occured:
- irccat and dependent services started into failed state, restarting them fixed it
- panettone also started into a failed state, and restarting it fixed it
tazjin at 2022-10-12T08·10+00