OMG: whitby is being rebooted
This is a tracking issue for scheduled maintenance of whitby. It has been up for quite a long time:
tazjin@whitby ~> uptime 07:05:30 up 831 days 12:59, 1 user, load average: 0.17, 0.31, 0.32
Rebooting whitby is tricky because the intended process includes an initrd SSH server in which we enter the disk encryption password, but we basically have never actually done this.
Before actually rebooting, the following checklist will be run:
-
Ensure initrd SSH keys are up to date
-
Ensure whitby is canonical
-
Ensure sanduny is canonical
-
Ensure that restic backups completed successfully
-
Ensure that a copy of the Gerrit state is available on sanduny (though not being served)
-
Await KVM console attachment notification from Hetzner
-
Check that disk encryption password is actually up-to-date
Verified with
zfs load-key -n zroot
and the password we have shared. -
Check recent bootloader entries 1
We will reboot with the KVM console attached and monitor the reboot.
After rebooting, we will run this checklist:
- Ensure that all public-facing services are up
- Ensure that network configuration came back correctly
-
During whitby's uptime, nixpkgs has had numerous bugs that broke the writing of bootloader entries. ↩
- tazjin updated the body of this issue at 2022-10-12T07·14+00
- tazjin updated the body of this issue at 2022-10-12T07·15+00
- tazjin updated the body of this issue at 2022-10-12T07·27+00
- tazjin updated the body of this issue at 2022-10-12T07·33+00
- tazjin updated the body of this issue at 2022-10-12T07·37+00
Current whitby system generation is 393 after canonicalising at latest HEAD. This matches the latest entry in grub.cfg, making me think the bootloader is up-to-date.
tazjin at 2022-10-12T07·39+00
- tazjin updated the body of this issue at 2022-10-12T07·40+00
- tazjin updated the body of this issue at 2022-10-12T07·41+00
First problem: Can't get the HTML5 based KVM console to work anymore. It just shows me some green blobs, but the little thumbnail screenshot looks correct.
Hetzner's fallback thing is a JavaWS application (of course), so I'm trying to figure out how to run that right now.
tazjin at 2022-10-12T07·48+00
Current status:
That's good enough for me, we're going in.
tazjin at 2022-10-12T07·54+00
Unlocking the disk over SSH worked perfectly fine.
tazjin at 2022-10-12T08·01+00
IPv4 works (I connected over it), seems like v6 also came back up normally:
tazjin@sanduny ~> ping -6 whitby.tvl.su PING whitby.tvl.su(whitby.tvl.fyi (2a01:4f8:242:5b21:0:feed:edef:beef)) 56 data bytes 64 bytes from whitby.tvl.fyi (2a01:4f8:242:5b21:0:feed:edef:beef): icmp_seq=1 ttl=52 time=23.3 ms
tazjin at 2022-10-12T08·01+00
- tazjin updated the body of this issue at 2022-10-12T08·02+00
- tazjin closed this issue at 2022-10-12T08·02+00
Minor problems that occured:
-
irccat and dependent services started into failed state, restarting them fixed it
-
panettone also started into a failed state, and restarting it fixed it
tazjin at 2022-10-12T08·10+00
-