Incident Report: Dec 13th, 2023

We recently experienced an outage that prevented deployment of new workloads from 21:45 UTC to 23:24 UTC. When production outages occur, it is Railway’s policy to share the public details of what occurred.

Summary

During a routine upgrade of our GKE (Google Kubernetes Engine) cluster which powers the dashboard, Google’s Metadata server experienced a large sagging under load which caused encrypt and decrypt requests to become severely delayed.

Shortly following this rollout, the issue was compounded by a scheduled send instructing a significant number of users to upgrade their legacy databases to our new services-with-volumes

At 20:57 UTC, we notified approximately 30% of relevant users to upgrade to the new service architecture.

As a result, we experienced a rush of users from 21:45 UTC to 23:24 UTC, during which the degraded metadata service became significantly delayed under load.

While deploys went out, they didn’t have the most up-to-date environment variables.

The Railway engineering team rolled out a change to utilize a statically assigned encrypted key as opposed to dynamic assignment from the metadata server.

After applying mitigations to restore our cluster’s metadata server, we were able to restore all variable encryption events with no data loss reported. By the end of the incident, all encrypted variables had been fully ingested and all services returned nominal.

Users didn’t experience any downtime with existing workloads. Only new deployments with modified variables were affected by this issue.

Incident Details

Railway’s Dashboard runs on a Kubernetes cluster to sample DevOps tooling.

As part of standard operations and aligned with Google's end-of-life (EOL) schedule, we initiated the rollout of GKE v1.25. This update was crucial to prevent an uncontrolled update over the holiday period. However, we observed elevated errors in our monitoring tools at 08:13 UTC.

As errors persisted, Railway engineering undertook a deeper investigation into the GKE setup. After some digging, we stumbled upon an issue on Google’s issue tracker lining up with our issues, filed and acknowledged months ago by Google. For GKE v1.25-1.27, the metadata server pods were updated to their latest version, which had a documented issue at high load. We confirmed the presence of this issue by cross-checking our metadata server version matched Google’s listed affected metadata service versions.

While the platform team addressed the degraded service, at 20:57 UTC, Railway sent a routine upgrade reminder for users using a legacy version of Plugins. This inadvertently created a spike in load as users who underwent the migration increased the load on the metadata service.

The customer communications team immediately halted any further migration notices after the outage was raised at 21:45 UTC.

During the outage period, the severe delays of the metadata servers response affected the ability for Railway to encrypt/decrypt service variables. The impact of this error meant that deployments would not have the most current secrets. Additionally, our users were prevented from updating any environment variable affecting their ability to perform routine maintenance on the platform.

After several of short-term attempts to achieve resolution, such as disabling logging to reduce the pressure on the affected metadata service, Railway engineering determined that rolling out a newer version of GKE unaffected by the known issue of the previous version of GKE would bring resolution. We began this process at 22:59 UTC while communicating with all affected customers on Discord, Email, and Slack.

Response and Resolution

This upgrade stabilized the system. At 23:10 UTC, the Railway engineering started seeing variable decrypt events succeeding. The Railway engineering team then restored the logging service at 23:20 UTC.

By 23:24 UTC we restored all services to working order and received confirmation from customers that new deployments were on-lining successfully.

Preventative Measures

This incident resulted in the following takeaways:

Unify Coordination: Unify our channels for incident response internally. Relevant teams weren’t notified internally and as we’ve added an automated bot for calling incidents internally to have one centralized place for everything.
Message Smarter: In retrospect, we messaged for a migration too quickly without regard to underlying platform evolution. Although it wouldn’t have prevented the initial issue, messaging migrations slowly would have prevented the cascade that led to further severity.
Continue Re-shoring: Continue our move off Google Cloud. Railway’s user workloads are already not powered by GKE, but an orchestrator built internally. We will accelerate our move off GKE in favor of internal services.

Incoming Mitigations

Improved Incident Management: During this incident, customers received delayed information on the occurrences of the platform. We apologize for this oversight.
Latest and greatest: We expect a level of stability from Kubernetes moving forward now that all systems are current.

Moving Forward

This incident underscored the importance of careful planning and monitoring during significant infrastructure changes. While we regret the inconvenience our customers were caused, we are committed to strengthening our systems and processes. We appreciate the understanding and support of our community as we continue to improve our services.

We are in full communication with all affected customers impacted by this outage. Again, we apologize for this outage's impact on our customers and their users.