May 8, 2024

Incident Report: May 4th, 2024

We recently experienced an outage on our platform that affected 1% of our compute nodes and less than 1% of our workloads.

During this outage our Asia-Southeast cluster became periodically unreachable. However, since the outage itself affected more than 1% of workloads in that region, it is Railway’s policy to share the public details of what occurred.

Incident Response Timeline

At approximately 03:39 UTC, Railway’s On-Call Engineer received alerts from internal monitoring that compute capacity in our Asia-Southeast region had been exhausted. This was followed shortly after by a number of low-disk capacity alarms.

After investigation, it was discovered that all instances in the region were unable to connect to the Dataplane node instance manager. At 04:13 UTC we declared a partial outage for Asia-Southeast. It was quickly determined that all instances were experiencing unexpectedly high (above 95%) memory usage, and utilizing swap memory to cover any burst capacity requirements.

As a result of the lack of capacity within the region, deployments within the region were queued temporarily.

Initially, after rebooting the affected services, 50% of our capacity came back online. However, the other 50% not only failed to recover, but subsequently became fully unresponsive, necessitating a full restart of those unresponsive instances to regain access to them.

As a result, at 04:30 UTC the on-call engineer made the decision to mitigate the lack of capacity by deploying a new compute node within the region and manually culling certain high-memory workloads from the existing compute nodes to free up capacity. We communicated these mitigations at 05:06 UTC.

By 05:58 UTC the approach of manually culling workloads had yielded further partial recovery, however one compute node was still fully unresponsive, and any workloads on that node were still unreachable as a result.

We communicated that services were starting to recover at 06:05 UTC. At 06:07 UTC, recovery efforts now shifted to understanding the underlying issue affecting the remaining unresponsive compute node.

It was quickly discovered that the instance was irrevocably soft locked and refusing any further communication. This made further debugging efforts particularly difficult. Any attempt to replicate the issue would result in the box entering an inaccessible state for 5-10 minutes. It would then take 30 additional minutes after restarting the dataplane instance manager for the compute node to repopulate its image graph to start serving workloads as usual.

With half of compute capacity in the region now back online, at 08:10 UTC we communicated that we had achieved a partial recovery of the region and were actively working on recovering the remaining 50%.

We continued to investigate the remaining unresponsive compute node. After disabling process restore functionality, we identified a newly provisioned workload from a self service customer that was exceeding the capacity of the region. As soon as this user workload was restored, the instance would experience a significant increase of utilized memory back to above 95%.

At 10:08 UTC we identified the specific user workload in question. This workload had appropriate resource limits applied both within our database and within the Dataplane node instance manager, however statistics from the instance showed it spiking to consume all available CPU and memory capacity on the compute node whenever the workload was redeployed.

At 10:12 UTC we applied considerably more aggressive resource limits to the specific user workload and attempted a phased restart of the compute node’s user workloads, which succeeded in restoring the affected compute node.

In addition we discovered that the workload had locked our image builder nodes in the region as well, meaning deployments within the region were again queued temporarily.

By 10:25 UTC we had cycled all builder nodes in the region to new instances and deployments began to restore for all compute nodes. We communicated that a resolution was in progress at 10:27 UTC.

At 10:32 UTC it was identified that due to an earlier misconfiguration, the additional compute node introduced at 05:58 UTC was not correctly serving certain public-facing user workloads that had been deployed from 10:25 UTC onwards. A fix for this was deployed and rolled out at 10:32 UTC. We communicated that the incident was resolved at 10:57 UTC.

Root Cause Investigation

Immediately after this incident was resolved, we began to investigate the root cause of the incident. After analyzing logs and metrics data from around the incident, the image of the affected user workload, and extensive code and systems analysis, we believe the following series of events caused the outage:

At 03:30 UTC a deployment was triggered for the affected user workload to deploy multiple replicas to the Asia-Southeast region. These replicas were deployed across all available compute nodes in the region. This would not usually trigger any alarms on our other regions, since they are 75x bigger than Asia-Southeast, which is currently our smallest region.
At 03:36 UTC, the deployment began to complete and the deployed containers began to come online. Each of these containers would proceed to use the full RAM capacity allocated to them, and as a result would consume a considerable amount of the region’s available RAM capacity due to the region being considerably smaller than our other regions. At this point, on all affected compute nodes, kswapd began to use large amounts of CPU due to the high memory pressure in order to compact sparse huge pages.
Since the compute nodes this workload was deployed to were already experiencing medium-high memory pressure, this additional spike in usage caused the kernel OOM reaper to kill the daemon process, causing all compute nodes in the region to become unhealthy and triggering a page to our On-Call Engineer at 03:39 UTC since there were now no compute nodes in the region available to accept deployments.
At 05:54 UTC a deployment was again triggered for the affected user workload. During this deployment cycle, only one of the compute nodes in the region was now below the capacity threshold that we require for scheduling new deployments. As a result, all replicas were now deployed to a single compute node.
By 06:05 UTC the affected user workload had been fully deployed. This workload would proceed to use it’s full per-container capacity allocation on a single instance, and as a result consume considerably more memory than we would usually expect a service to exert on a single compute node. Since the compute node this workload was deployed to was already experiencing high memory pressure, this additional spike in high I/O usage caused the hypervisor to lock up, preventing any further access to the instance until after a hard restart.
Due to the specific timing of the deployment cycle, no individual container exceeded the resource limit, but the total amount of container resource limits we allocated was above the physical RAM allocated to the compute node. This meant that when the memory usage spiked, despite system-level memory pressure reaching 99%, no per-container limits were breached, and therefore no processes were appropriately evicted.

Communications

We issued 6 status updates during this incident via status.railway.app while additionally responding to specifically impacted customers across various touchpoints (email, Slack, Discord, Twitter, and forum posts) throughout.

As indicated by several customers in a public thread post, this communication cadence was not sufficient. More attention needs to be paid to providing updates in a centralized location on a reliable basis.

Moving forward, we intend to push all public communications through status.railway.app to provide a single source of truth for up-to-date and reliable updates. Additionally we will have a policy and automated ChatOps prompt to update affected customers every 30 minutes at a minimum.

Preventative Measures

This incident resulted in the following takeaways:

We are adjusting the configuration of the kernel OOM reaper on our instances to prioritize evicting user workloads before it evicts system-critical processes such as the Dataplane instance manager.
We are planning to implement mitigations to ensure that in the event overall system memory pressure reaches a threshold (regardless of per-container thresholds being breached), we can safely evict user workloads or migrate user workloads to other lower-pressure compute nodes
We are investigating adjusting our deployment algorithms to utilize historical metrics, allowing us to account for historical and recent resource usage when re-scheduling
We are increasing the compute node capacity in Asia-Southeast to allow for significantly more burst capacity to decrease the likelihood of all nodes being saturated, even in the event of many large workloads being deployed to the region.
We are tuning our alerting and scheduling to notify us of capacity issues sooner for regions with less compute footprint.

Moving Forward

The Railway team is looking to continually improve our processes with our customers and our community partners. We are in active contact with our customers who faced business impact from this outage.