Avatar of Angelo SaracenoAngelo Saraceno

Incident Report: Dec 1st, 2023

We recently experienced an outage on our platform that affected 30% of our US-West compute infrastructure and caused workloads to be unreachable. When production outages occur, it is Railway’s policy to share the public details of what occurred.

Summary

Between 16:40 UTC and 20:53 UTC, Railway’s engineers received alerts from internal monitoring that users were unable to connect to their workloads. This was due to high resource usage leading to a soft lock of all cores on the affected hosts. This led to rolling restarts of hosts as we attempted recovery on machines hosting customer workloads.

Railway engineers began intervention immediately after notification. Due to the similarity with prior incidents the Railway team has had with our underlying compute platform, we believed it was related to our vendor infrastructure — but potentially triggered by a code change we deployed.

30% of the total machine fleet within US-West was affected due to the outage. Some customer workloads were offline for periods ranging from 10 minutes to 1 hour as we worked to mitigate the issue.

The Railway infrastructure engineering team were able to attribute the root cause of this incident to a newly deployed metrics collection agent that appeared to trigger a CPU core soft lock on these hosts, despite them being well below resource thresholds. We believe the underlying issue is potentially within the Google Cloud Hypervisor. We’ve documented our findings in this blogpost.

After diagnosing and remediating the issue, the Infrastructure team was able to restore all workloads with no data loss reported.

Incident Details

Railway’s engineering team is undertaking a migration to a new metrics collection service which will deliver significant benefit to Railway’s customer base.

Earlier on Dec 1st, at 16:02 UTC, the Railway engineering began rolling out the newer metrics collection service on Railway infrastructure. This rollout had been previously validated on a subset of Railway’s infrastructure and had not triggered any warnings at that point in time.

However, during the rollout, our deployment scripts began reporting that certain hosts were unavailable and the update could not be applied. After we noticed the first couple boxes coming offline, we went back to check the deployment logs. Since our first priority was to bring workloads back, by the time we identified the issue in the logs, the incident was well under way.

As the metrics collector started up, a configuration error meant it began analyzing processes beyond its original mandate — essentially sampling all processes on a host at a high frequency. This added monitoring resulted in large transfers of memory between the kernelspace and userspace on our hosts, and even though overall resource pressure on the hosts were within nominal thresholds, triggered cascading CPU core soft locks which brought the host down. These reports are inline with a bug Google acknowledged here.

This issue only affected a subset of machines randomly throughout the fleet. The metrics collector was operating perfectly on 70% of the fleet. This meant we initially didn’t suspect it as a cause. Furthermore, the failure scenario wasn’t immediate — the impacted machines failed unpredictably in bursts with no discernible pattern in the metrics, workload patterns, or log data that we had been able to pull from the unaffected machines.

The misconfigured agent caused CPU core locks across all cores on the affected machines, which made workloads (including our own daemons through which we manage the machine) un-schedulable on the CPU.

At 17:05 UTC, Railway’s engineering team began manual restarts to recover affected hosts. Because we needed to do these restarts manually — especially after confirming that the host CPU was soft-locked by way of checking serial console logs — the process took between 10-30 minutes per host. However, though we recovered most hosts before 18:30 UTC, several others failed during this time and thus took longer to recover. Railway’s engineering team began engaging additional members of the Customer Success and Support teams to communicate the impact to affected customers across the Community Forum, Email, Twitter, and Discord.

Response and Resolution

Around 18:30 UTC, customers started reporting that active resolution measures were successful. The Railway engineering team then catalogued all failed machines and failed over additional hosts that we suspected to be impacted. While this action was ongoing, additional monitoring was conducted to ensure that the machines with customer workloads were all coming back online appropriately.

By 20:23 UTC all hosts were back online and we had total confirmation from customers that workloads were restored.

We declared the incident to be over at 20:53 UTC.

Preventative Measures

This incident resulted in the following takeaways:

  1. Improve internal deployment tooling: We need to shorten feedback time on deployments, helping us identify issues faster.
  2. Build more resilience: We need to build better mechanisms to detect and automatically recover from OS/hardware/hypervisor level failures by isolating the blast radius of changes.
  3. Removing the black box: We’re already building our cloud independent distributed storage system and making forays into bare metal. We can only deliver the level of service we intend to by controlling, and having visibility into, the entire stack — from silicon to hypervisor.

Incoming Mitigations

  1. Better System Checks: We're improving how we check the health and status of our systems to catch issues faster.
  2. Mandate Staged Rollouts for all Services: For riskier changes, we already stage rollouts. If this incident has taught us anything, it’s that there’s risk in even the most mundane of changes at our scale. Moving forward, for all services, we’ll start with 10% of our systems, then move to 25%, 50%, and finally 100%, carefully monitoring at each stage

Moving Forward

The Railway team is looking to continually improve our processes with our customers and our community partners. Although we are glad that our internal systems caught it before our customers did, we truly regret the impact. We are in active contact with our customers who faced business impact from this outage.