Angelo Saraceno

Jan 31, 2024

Incident Report: January 31st, 2024

We recently experienced an outage that prevented access to workloads hosted on Railway if a connection required new workloads from 18:13 UTC to 23:46 UTC on January 31, 2024. When production outages occur, it is Railway’s policy to share the public details of what occurred.

Due to the previous DDoS event, we’re extending the Plugin migration timeline to February 23, 2024. For more information, please read the docs.

Summary

Over the past five days, Railway has experienced numerous concentrated Distributed Denial of Service (DDoS) attacks, two of which resulted in end user impact.

From Jan 28th to Jan 31st, we had many events of which two resulted in customer impact.

1. L7 (HTTP) DDoS attack affecting the Dashboard [01/29/2024]

Result: Dashboard instability meant that users could not view logs/metrics/etc. Builds were still queued, and no service connection was dropped

Peak RPS: ~250k/sec

2. L4 (TCP) SYN flood attack [01/31/2024]

Result: Users within Europe were unable to dial Railway instances. Requests were still served globally for all other requesters in all regions

Peak RPS: ~12 million/sec

In this post we will cover the “behind the scenes” mitigation of the resolutions for the L4 SYN flood attack.

This is the largest DDoS attack we have witnessed on the platform, and one of very few documented over 10 Million RPS (Request per Second) attacks on the internet. From Cloudflare’s reports, an attack of this scale is only experienced by the top 1% of attacks. At this time we believe that these targeted attacks utilized a Mirai botnet .

We apologize for the impact this has had on developer teams to administer workloads and the significant business impact for your customers. We are extremely sorry.

The issue has been mitigated. The Railway team has implemented a number of mitigations to prevent similar attacks like this in the future.

We will go through the timeline, as well as mitigation steps, below.

Incident Details

Before 18:13 UTC, the Railway engineering team diagnosed and triaged multiple attempts to DDoS the Railway platform.

At 18:13 UTC on January 31, 2024, Railway Engineering began to investigate elevated EU edge proxy metrics from a synthetic alert. Railway Engineering was already engaged from prior attacks that we successfully mitigated with no end user impact. After 20 minutes, European Railway users started to raise support threads notifying the team that workloads were experiencing a percentage of delayed and/or dropped requests. The Railway team then confirmed that Railway-hosted applications were inaccessible for users connecting from the EU.

After Railway Engineering identified that it was related to prior hostile traffic and a DDoS, at 18:59 UTC, we attempted to roll out a previous mitigation that had resolved a previous DDoS affecting only the Railway API. We recovered those boxes temporarily but unfortunately they were then overwhelmed by additional connections. We then determined that this DDoS attack posed a different profile than the previous ones.

At 19:23 UTC, we confirmed the source of the DDoS attack to be the EU edge proxy hosts. The Railway team then spent time isolating the hostile traffic to make sure that mitigation strategies didn’t terminate legitimate traffic. This was complicated by the attack throughput exceeding 12 million RPS — extreme in terms of attack size.

At 20:21 UTC, we inserted additional services in between the serving layer and the router. While this was happening, additional team members got involved working with our large-scale customers to provide step by step resolution on how to proxy workloads using alternative regions.

At 21:32 UTC, the migration strategy was revised to provide gradual service restoration to all impacted users. Around this time, a select few end users noticed that workload connectivity was restored. We then began rollout of the strategy to the remaining 20 percent of global end-user traffic connected on the European proxies.

At 23:16 UTC, most users experienced service recovery.

At 23:46 UTC, the Railway team and our customers confirmed that service was restored for all users.

Response and Resolution

Railway’s internal alerting had the Railway engineering team engaged well before customer impact. Many hostile actors attack Railway and many of those attacks either get neutralized from automated systems or platform mitigations. However with this attack, attackers changed the shape and the source the attack. The increase in RPS from 250K RPS to 12M RPS overwhelmed the European shard of our global edge network.

The Railway Support and Success teams engaged with all of our customers from the very start to make sure that end customer impact was minimized as much as possible. However, we know that we owe our customers an explanation on why this incident lasted as long as it did.

Railway has many safeguards in place to prevent this exact scenario from happening, or, in the worst case, spreading to other tenants/shards/etc

We have connection and request-based rate limits on our network
We have shard and cell-based architecture, limiting blast radius
We have redundancy in services, allowing for transient failures
We have telemetry to notify us of issues with any of the above

Initially, when we were paged, we noticed that there were transient failures on the EU fleet. This fleet represents 3% of our hosts. While this is awful for our customers, we wanted to make sure it didn’t spread to 20, 50, 100% of our fleet.

We started by having our automated systems page an engineer. They looked into the traffic, and found that the vast majority of the traffic was from “clean” IPs. That is, IP addresses which were not flagged in known fraudulent databases.

Our first thought was that someone had gone mega, MEGA viral, and had created backpressure in our system, overwhelming the edge. We jumped in to figure out who the downstream user was, but were met with a massive delta between “requests into our edge” and “requests out of our edge.”

From here, we analyzed our proxy and determined that there were a pile up of initializing connections. The afflicted proxy had a ~130k connections in a SYN state, which meant we had buffered over 100k connections waiting for a response from client to the server’s SYN+ACK. We altered the kernel’s TCP settings (backlog size, ack max retry, tcp_syncookies) and rate limiting rules, making them more aggressive, which dropped initiating connections down to single digits.

We dialed the hosts directly through our load balancer, got 200s, and waited for the cluster to recover. Upon seeing it not recover, we shifted our focus to the next most likely scenario:

A SYN Flood Attack.

A SYN flood attack is characterized by massive RPS, where the client intends to exhaust server resources by initiating the TCP connection handshake but never responding to the server’s SYN+ACK response. Leading to the server keeping the potential connection up for a client that never plans on using it.

Additionally, since the SYN+ACK packet isn’t required to be acknowledged by the client creating the connection, the IP doesn’t need to be valid (making it easy to spoof). Finally, this would explain the rate limit being ineffective — no ack means no connection/request.

Unfortunately, SYN floods are quite difficult to both diagnose and mitigate. By this time we were about 60 minutes into the incident. Our attempts at implementing rate limits, even on initializing connections, lead us needing to move the attack further downstream by adding additional layers.

Railway’s edge network functions by terminating TLS and landing the traffic onto an encrypted WireGuard network. As we need to terminate TLS ourselves, we were unable to insert a service like Cloudflare into our customers stack.

This further complicated the mitigation strategy, and we spent additional time moving simultaneously as fast and as careful as we could to cutover traffic into a service which we could insert into the critical pathway between our load balancer and our edge network (without terminating TLS).

We enabled the service and it blunted the traffic enough for us to rate limit rule our way back to “non-pinned boxes.”

However, traffic was still only trickling through the load balancer. After additional debugging, ultimately, we had to recreate the load-balancer and have since filed a ticket with our upstream provider.

Ultimately, this abated the flow of ~12 million RPS and ~650 Gbps of traffic, and returned the shard to a healthy state.

Preventative Measures

As mentioned above, we’ve inserted an additional service in the critical pathway and have begun designs on a additional service to further layer the network to keep abusive traffic away from user workloads. While we’d love to share more information here, sharing our current preventative measure would open us up to additional targeted attacks.

Security is an ongoing improvement and we are preparing to be able to mitigate attacks faster and with an even smaller blast radius than what we’ve contained.

We understand that future incidents can’t take as long to recover from as it did in this case. We have set up additional failsafes such that, should an even larger DDoS occur, we are a 5-10 minute run book away from full resolution.

Moving Forward

This incident is the result of a sustained and persistent attack on Railway’s networking infrastructure throughout the last week. During this time (and all times) we have engineers on call, 24/7, ready to respond to alerts. Our transparency when something goes wrong is critical to us to earn your trust. Railway is a consistent target of attacks from intentional and benign actors, and through our vendors and prior learnings from past incidents over the last three years, we have avoided major end-user impact until today.

We know that trust is a bank that makes no loans, and we are doing everything possible to make sure that Railway is a platform that you can depend on. If there remain any doubts, we are working with our customers so they can substitute all or some parts of Railway to meet any service level that they need to meet their demands. The team is working day and night to further harden the platform so your engineering teams don’t have to think about events like this.

If you have any questions, you can reach us at [email protected], or join us in the discussion on our community forum.