Avatar of Angelo SaracenoAngelo Saraceno

Scaling Railway: Serving 250k Developers with One Support Engineer

In case you’re new here, Railway is an infrastructure company that provides instant application deployments to developers.

We’re doing a series on Scaling Railway that covers various parts of the company, from Platform to Infrastructure, as well as Support, Product Updates, and everything else. In this blogpost we’re taking a look at some of the ways the Support team has automated itself.

One of the problems that we’re facing as we scale is how to maximize our limited bandwidth. We’re a team of 14 and we’re competing with AWS, GCP, and Azure — infrastructure offerings from some of the biggest companies in the history of the world. It takes more than just a solid product with a growing user base to get there — it takes leverage.

This is the story of how in the time I’ve been at Railway we’ve increased our users by 25x while increasing our Support Engineers by just 2x.

Railway user growth and support timeline

Railway user growth and support timeline

Let’s get started.

When I joined Railway as the company’s first Support Engineer in July 2021, Railway had nearly 10,000 users and no real Support channel at all. We had 6 employees and Railway was a much simpler product. (You can read about that time in detail here.) Today we have around 250K users.

At the start of my tenure, our support load averaged around two serious cases a day and Jake and I could reasonably respond to most customer requests within the hour. And we could often ship a feature request within a day or two. In retrospect, it was a really nice time.

But by late 2021, I was feeling a constant sense of dread.

Our Support channels (Twitter, Email, and Discord) were doing around 40 serious cases a day. Every day, I would wake up, answer ticket, answer ticket, get lunch, answer ticket, answer ticket, and then log off.

By the time I went to sleep, we’d have more tickets than when we started the day.

We knew that philosophically we wanted to be easy to reach, but we needed to balance that with automation. Since we operate in an industry filled with notoriously terrible support, the way we do business is a big differentiator for us. (A blog post for later.)

We went looking for an approach that would allow us to scale without sacrificing high-quality support engineering for users who need it.

Our first goal was to consolidate our support channels on the backend. We wanted to pull the data from each channel into a single queue on a ChatOps server. We initially chose Discord for this purpose, in part because it’s home to our internal chat and in part because we knew a lot of programmers like ourselves use Discord.

The initial solution was to use the Discord API’s Threads system. We managed conversations in threads, tagged them in a queue if they needed to be investigated further, and then linked to the conversations in Notion. (This UI is actually still in use today!)

Discord queue system

Discord queue system

After pouring through Discord.JS docs and the new (at the time) Notion API, we got a working Support bot going.

The bot’s name is Percy and they handle a large number of tasks for us, including:

  • Managing the ticketing queue
  • Pointing users to existing threads with solutions
  • Creating and resolving incidents
  • Letting users self-serve Priority Boarding (our beta program)
Snippet of a Support Document written internally in 2021

Snippet of a Support Document written internally in 2021

Thus began a long coding journey for Support. It’s funny looking at this document 18 months after the fact, but when we wrote down all of our problems with the status quo, we were really writing down our roadmap as a Support org.

Percy gave us the primitives to implement a number of helpful automations and gave us a foothold for tackling customer support issues at scale.

When we started our Teams offering in earnest in June 2021, we already were talking with early customers and design partners in our Discord.

We offer chat support for team users, enabling real-time conversation directly in Discord. We often have the engineer who built a feature jump into feedback on a feature, which helps us tighten the feedback loop and iterate faster.

At the start, I would make the channels and permissions myself. My fingers got tired of clicking, so we built a way for the API to connect to Discord to spin up a channel and invite team members. As long as we don’t hit Discord rate limits, we’re good.

Direct chat support is available to teams

Direct chat support is available to teams

We also actively seek feedback from hobby users who rely on Railway for everything from tutorial applications to side projects. To proactively source high quality feedback at scale, we built Priority Boarding to be as self-serve as possible. Users can opt in to the beta using Percy and sign up for reminder emails for feedback when they join the program.

To enter Priority Boarding (our beta program), users just need to type /beta in Discord

To enter Priority Boarding (our beta program), users just need to type /beta in Discord

You can read about how we built this integration here.

In the meantime: as we crossed 25k developers, we added the Community Champions Program.

No support channel can be completely void of human-in-the-loop intervention. We’ve invested heavily in a Community Champion program to provide resources to community members to help others out. It’s hard to say enough good things about the Railway enthusiasts who help us out on Discord.

We try our best to empower Community Champions to solve problems within the Discord community, improve the capabilities of Percy, and improve automated things like CLI support.

Community Champions like Brody are able to help out a large number of users in Discord

Community Champions like Brody are able to help out a large number of users in Discord

When we got to 100k users, we needed to make another scaling change. At this point we added a third party vendor (shoutout to Operand who are themselves Railway users!) to the mix.

We use Operand to index our Discord forums and surface NLP-backed search results to users. As a result, Percy is able to give users answers gleaned from our own literature.

If you’re wondering how we trained the model, Discord shipped Discord Forums for community servers around October 2022. Forum posts on Discord allow you to tag the post with metadata. Whenever a high quality thread occurs with a correct answer, we mark threads as “Solved.” Marking threads as “Solved” tells Percy to index it feeding it into Operand’s corpus.

When a different user asks a question that’s likely been answered, Percy checks Operand and if a threshold of confidence is crossed by the system, Percy responds in the thread.

And I didn’t write this feature. One of our Community Champions, Wyz, implemented it.

A recent experimental feature added to Percy provides NLP-backed answers to questions

A recent experimental feature added to Percy provides NLP-backed answers to questions

The result still makes my jaw drop sometimes.

At 200k users and 10k+ Discord members, Discord became unmanageable for SLA reasons. We needed to add a support SaaS service. We selected Plain, which is a new support backend that is API first. Instead of using tickets as the first-order primitive, there is a customer timeline which helps us manage our relationship with our customers.

We hooked Plain to Discord using a home-rolled websocket bridge with the help of Plain’s engineering team. This provides a centralized location for support tickets such that anyone in the company can tune in to one or several ongoing user threads. You can send a message to Discord from Plain or move a conversation from Discord to be private over Plain via this system.

Account information is surfaced in the support system sidebar

Account information is surfaced in the support system sidebar

The support system provides a single source of truth for the status of support and pulls in contextual account information from the database like account status, plan, invoice history, and so forth. This opened up a Pandora’s box of new possibilities for what we can build to better serve developers on our platform.

Once we hit 250k+ users, the giant wishlist of support features to integrate within our product became more than just a wishlist. It provided the engineering justification to add another Support Engineer. Welcome Adarsh “Nebula” Krishna!

The common thread in our contributions as we grow is that they solve business problems. Writing code creates a mechanical advantage everywhere. Our support tooling gives us a heat map of where the problems are and Nebula and I can then tackle the root cause.

For instance, when we saw we had 30+ emails from users complaining about the CLI, it gave us license to pay down some deep technical debt and rewrite the CLI.

I am convinced that Software Engineering is more of an art than an engineering practice. As such, it’s important that the team has the understanding to rebuild and improve our functions as we go along. Contributions come in different sizes, but the key is that we’re painting happy little clouds together.

We automate support because we know that in order for us to be the top infra platform of choice, we need to spend time listening to user needs instead of brute-forcing our own unmanageable systems.

Few office jobs in the U.S. (with the exception of Engineering) have the expectation that their work should scale by automation. The usual painkiller for most roles is headcount.

This comes with a number of disastrous side-effects. For every person added to the system, there are greater levels of complexity, which beget organizational entropy.

We found instead it’s best for people to be closely aligned with their work and empowered with the tools to deal with what’s on their desk. Sometimes hiring is really the best solution to a problem, but most of the time it isn’t.

This means we can out-ship those who require infinite meetings with GPMs and AEs just to talk to a customer.

(I also found that developers are generally pretty hostile to the concept of being subject to shitty support. Press or say 1 if you agree.)

To achieve the level of personal support we want with so few people, we have to embrace the ROI of automating our own jobs. It reduces the drag on other employees, frees us up to solve the next business problem, and makes it easier to compete with companies that ship their org charts. Our automations may not start off as perfect, but we will damn sure iterate to get there.

As a team of one, we were able to support 250k users, and now that we've crossed that bridge, our goal is to support two million users with our 2-person team for a 1M-to-1 ratio.

That will be another story for some months in the future.