OMG: whitby is behaving strangely under unexpected load

#310
Opened by tazjin at 2023-09-22T16·16+00

We are having trouble with whitby today, and it seems like we're seeing a bunch of very high incoming traffic spikes that correspond to when it started, but the troubles continued even after whatever traffic was incoming stopped.

I'm kind of busy right now and haven't had time to analyse the logs to figure out what kind of traffic is coming in.

I think this might be related to nixery.dev being on whitby, as there's a huge amount of requests going there. I want to move it out to a separate machine.

  1. whitby has been rebooted, as it wasn't really responding to anything anymore (it was online though, which we could see through the buildkite agents pinging it).

    I had to attach a KVM console because I wasn't sure if the SSH key I have here (not an RSA key) would work. It ended up being necessary to use the SSH key, as I couldn't connect to the initrd SSHD at all, didn't debug it.

    tazjin at 2023-09-22T16·17+00

  2. This hasn't reoccured, and from the logs there wasn't a clear indication as to what went wrong. Either way, as a preventative measure we'll start splitting some services out to separate machines. nixery.dev is already on a VM now, so whichever companies decide that they can rely on that in a production setup can overload that VM instead.

    tazjin at 2023-10-04T20·29+00

  3. tazjin closed this issue at 2023-10-04T20·29+00