OMG: all CI builds suspended due to Buildkite "minutes" quota

#186
Opened by tazjin at 2022-09-08T13·27+00

A while ago Buildkite switched the way their billing works to a new kind of "build minutes quota" based systems. We didn't do anything when this happened because we are on the open-source free plan, and in fact we can't even see the Buildkite billing page.

Yesterday we merged a large chain of CLs (about 40 in a row) which caused a lot of builds to slow down for a long time (maybe their first steps ran but other stuff was still waiting on "downstream" builders), this probably caused a giant explosion in accounted-for build numbers on the Buildkite end.

It turns out that this seems to have exceeded some limit on the Buildkite side, and subsequently we have found ourselves unable to run any builds, instead receiving a message saying:

You've used all of the job minutes included in your Free plan for this month. To continue running your builds, select a different Buildkite plan.

This is very impractical. I've reached out to Buildkite, they said:

Thanks for emailing and letting us know about this, I can see the build minutes have increased significantly over the past 24 hours compared to the rest of the month. I'll take a look in to this and see what the best course of action is.

However, I think most of them are located in Australia and it's late evening there, so not sure if we'll see any more progress on this today.

In the interim, as we want to progress with the tvix-eval chains, we will likely use depot-interventions to skip CI. If this issue persists we'll have to find some other way of dealing with builds, but this will be hard due to the lack of dynamic pipelining in basically every system.

  1. Alright, Buildkite increased our limits:

    The job-minutes limit is per month. When we migrated our open-source plans to our new price model, we looked at the job minute usage from the last 6 months and set a limit above the month with the highest usage. It seems that this month you hit your limit of 40000 job-minuites (which is way high than the previous months). I increased that limit to 100.000, so hopefully, you will not have that issue again.

    We have a huge backlog now and actually getting the builds unstuck had the effect of some immediately failing under memory pressure. We should limit the concurrency of :llama: per machine somehow.

    tazjin at 2022-09-08T18·54+00

  2. This sorted itself out eventually after they increased the limit. Not much we can do to prevent it, as it's a limit that is invisible to us, but also "einem geschenkten Gaul schaut man nicht ins Maul".

    tazjin at 2022-09-11T17·13+00

  3. tazjin closed this issue at 2022-09-11T17·14+00