We’ve been hard at work on some foundational improvements to the Railway platform. One of the most recent is a rework of the way we handle deployment logs. Before this change, all 500,000+ Railway deployments relied on Docker to provide logs at the user’s request.
Railway Logs V2 with filtering!
Every time the logs view was opened a request would be sent to our deployment infrastructure, logs would be fetched from Docker, then these logs would be returned to the frontend to be displayed. This all happened via polling (not a stream) too, which meant that it was repeated every 2 seconds! That put a lot of strain on the Docker daemon, which also has the much-more-important job of keeping deployments online. ☀️
The simplicity of this initial implementation, however, was great in the early stages because it allowed us to launch a product quickly. But Railway has grown tremendously since then and it was time for a more scalable solution.
Introducing Logs V2
Every part of our old logging infrastructure has been rebuilt from the ground up. Docker’s only log-related job now is to forward all logs to a local collector, meaning it can stick to it’s primary role of keeping deployments online. These logs eventually end up in a centralized logging service that handles log aggregation and querying.
Unfortunately, this huge improvement is mostly invisible to you, apart from two new features!
- The ability to view logs for all historical deployments, not just the running one
- The ability to filter logs by keyword which applies to both historical fetching and the real-time stream.
Log filtering demo
Both of these new features were only possible because of the abilities of our new log service. But what is it, you ask?
Building a Centralized Log Service
We used a few neat tools to build a centralized log service capable of aggregating and querying logs at a larger scale 👇
Vector sits at the heart of this solution. We configured Docker’s syslog driver to send container logs to a local Vector instance that runs on each of our nodes. These logs are then forwarded to a different, external Vector instance where they are both streamed to the filesystem and sent in batches to Google Cloud Storage for long-term persistence.
Vector as an agent and central service
The logs that are streamed to the filesystem are used to provide real-time log streams (by tailing files) and short-term query results. These filesystem logs are stored in hourly buckets and only kept for a few hours, however, so Google Cloud Storage is queried for any logs that may exist outside this range.
And finally, to monitor the health of all this, we used Vector’s DataDog sink to send internal metrics to DataDog. We configured some pretty graphs to give us an overview of log volume, as well as some alerts to let us know when something is wrong.
DataDog metrics during the beta rollout
We don’t have time to go over the other half of the architecture—log fetching and streaming— but we’re planning to share more on this later. For now, here’s a brief overview of what happens when a user loads the logs view.
- On the Frontend, a useEffect React hook mounts, calling a react-query hook, which accesses a GraphQL Subscription.
- On our Backend, the GraphQL subscription validates permissions and opens a GRPC stream to the logging service.
- The logging service uses this stream to push directly to the Frontend, causing React to re-render.
The Future of Logs
Since the primary focus of this change was to replace our old log infrastructure, additional features were kept quite light. But there’s a lot more we can do now that we have a solid base to work from. In fact, you can let us know what you’d like to see by submitting a request to feedback.railway.app or commenting on a few log-related requests that already exist.
And, as always, head over to our Discord with any questions or feedback. We’d love to hear what you think!