Sep 29, 2023

Our New Observability Platform for 100B Logs

Welcome to Railway Launch Week 01. The theme of this week is Scale. Throughout the week we’ll be dropping some amazing new features to help your business and applications scale to millions of users.

If this is your first time hearing about us, we’re an infrastructure company that provides instant application deployments to developers.

When the train car derails…

It’s Thursday morning. You’ve just spent the week readying your product for its 1.0 launch, meticulously testing every flow and feature to ensure everything works. The clock strikes 10AM. The press publishes. The waitlist notifies. Potential customers flood in.

Then suddenly — silence.

Your admin dashboard stops responding, database activity drops to 0, and emails flood into your inbox with words like broken, crashed, refund.

…

Unfortunately this story happens all too frequently, and no amount of preparation can eliminate every edge case or bottleneck. The only way forward is to identify and fix problems as fast as possible.

This is why today is all about Observability. 🕵️

We’ve been working on observability for almost a year now, evolving our bare-minimum logging experience to fill-in the missing pieces. You can now do things like trace errors across services, distinguish informational logs from errors, and attach additional metadata.

This is the story of how and why we rebuilt our logging infrastructure to make sense of 1 billion new logs per day.

Observability 1.0 — Log Explorer

Today, we’re announcing the first piece of the observability puzzle: Log Explorer.

Log Explorer showcasing a structured error log

Log Explorer provides an environment-level view of all logs, enabling powerful new features:

Search for generic term like “error” across all services at once
Dive into a specific service, deployment, or replica
View or filter by structured log attributes like user_id=123
Reveal the context around a specific log within search results

This is all possible today, with Log Explorer. The visible interface is only part of the story, however. We completely rebuilt the engine while rocketing down the tracks at breakneck speed.

1 billion logs / day
100 billion logs
60 queries/sec

Rebuilding logging, again

Our first ever implementation of logging was naive. We simply polled the Docker daemon for logs when the UI needed them. However, as we increased efficiency, we shoved more and more deployments onto the same server and the Docker daemon simply couldn’t handle the additional overhead of also serving logs.

So in 2022, when we launched Logging V2, we used a combination of local filesystem and Google Cloud Storage for persistence. This scaled very well and we’re actually still dual-writing all logs to the old system in case catastrophe strikes and we need to revert.

So why replace the current working architecture with a new unproven solution? The main reason for Logging V2 was to have a scalable solution that supported filtering. We built the simplest architecture to suit our needs but, unfortunately, this narrow focus painted us into a corner:

Logs were bucketed by deployment, making cross-service streaming difficult
Query filtering happened in a single Go process, making far-back queries slow
Short-term log storage was filesystem-based, with no easy way to horizontally scale
Aggregate metric queries like COUNT or GROUP BY were not supported

The common theme with the above features is that they would each require a restructuring of the logging architecture to support. There’s a world where we double-down to make it work — Loki does all this with a very similar storage architecture — but we’re not looking to get into the business of building a database.

We decided to go back to the drawing board and see if any off-the-shelf solutions would suffice, and there was one tool that kept coming up over and over again: ClickHouse.

In-house to ClickHouse

ClickHouse is a column-oriented database used for log aggregation and analytics. I had initially brushed it off because, honestly, it felt too good to be true. Scale, speed, and cheap storage, all with a familiar SQL interface? What’s the catch?

After reading four or five technical posts from other tech companies claiming that ClickHouse solved all their problems (here are a few of the standout posts from Cloudflare, Ebay, Uber), I was intrigued enough to start digging in. One thing that stood out immediately was how much it rhymed with the thing we were avoiding building in-house.

ClickHouse is an efficient data storage system with a distributed query engine on top. No foreign keys, no traditional indexes, just raw horsepower applied in an intelligent manner. This simplicity makes cost, performance, and scale easy to reason about. To confirm this reasoning, we spun up a single-node instance with 30 TB of production logs and ran all the queries we’d need to power both our existing and future observability features.

It ticked every. Single. Box.

👶 Simple Queries → Familiar SQL interface, combined with extensive docs on scale-related features like distributed queries, data lifecycles with TTLs, and data ETL with materialized views.

💽 Compact Storage → Column-oriented means column data is stored together. And, since columns usually contain very similar values (think user IDs or enums), compression is extremely efficient, squashing our 28TB of tagged logs into only 5TB of disk space!

🐇 Fast Queries → Our 2-replica/2-shard setup is lighting fast, with an average query response time of 50ms while executing 60 queries/sec!

🧑‍🔬 Advanced Queries → JSON functions allow us to combine structured log queries and advanced filters using dynamically generated WHERE clauses (more in an upcoming post)

🏘️ Simple Ops → Thanks to ClickHouse Operator, all it takes is a single variable change to our Terraform config to scale out the cluster. We recently tested adding a third replica and took mere minutes to pull in the data from existing replicas and bootstrap itself.

The configuration variables for our ClickHouse Terraform configuration

This proof-of-concept was a success, so we decided to prove it out further, in production.

Live in production in days

We were able to slot ClickHouse seamlessly into our existing logging architecture.

The first step was configuring our ingestion tool Vector to dual-write to ClickHouse in batches once per second. This allowed us to test production ingestion volume and gave us real-world data to play with.

Next, we modified our query mechanism to fetch from ClickHouse. Our existing log service performed concurrent filtering in our Go service so all it took was a simple SQL query to fetch from ClickHouse.

And that’s it!

🔥

We wrote a custom parser to convert our filter DSL to SQL clauses in order to take advantage of ClickHouse’s distributed query engine (long story for future post).

We quickly added a flag to force ClickHouse queries for our internal team members and were using ClickHouse live in production in less than a week!

The initial SQL query to fetch ClickHouse logs in production

Real-world woes

It’s typically bad to run QA experiments in production but there are certain qualities that just can’t be reproduced (or even predicted) artificially.

The main issue we’ve faced was due to asynchronous replication. It can take a few seconds to synchronize data from primary to secondary replicas and this delay increases under load, which is why we didn’t catch it in testing. This caused some log queries to omit the latest few seconds of data which was a dealbreaker.

Our first solution to this was adding an HAProxy to direct all inserts and queries to the primary replicas and only fallback to secondaries during service disruptions. This should have worked, in theory, except that ClickHouse automatically proxies queries to other replicas when load increases. This was unexpected, but also demonstrates just one way that ClickHouse scales effortlessly.

Eventually we stumbled upon the insert_distributed_sync setting which forces inserts to propagate to all replicas before returning; no matter which replica fulfills the log query, it will always contain up-to-date data. It would be nice to have learned this sooner, but it forced us to learn how ClickHouse operates and how to debug it. So … lose/win?

Another positive benefit of this struggle was illuminating which metrics and monitors should be created in order to understand the overall health of the system. We now have 20+ charts and monitors related to logs to help identify and pinpoint issues immediately as they arise.

Monitors to alert the team of performance and availability issues

Two (of many) charts to help understand causation and performance trends

Our ClickHouse cluster has been up and running for four months now and has been worth its weight in gold. As mentioned, we’ve already taken advantage of it to add new features like Structured Logs, but that’s just the beginning.

The future of Observability

Log Explorer is our first major step toward a great observability experience.

ClickHouse isn’t the only upgrade we’ve made under the hood. We’ve also swapped our metrics store from TimescaleDB to VictoriaMetrics, unlocking a huge number of benefits there too.

There’s a lot on our mind as we continue to upgrade the Observability experience. Chief among them are:

Including network logs from our external and internal proxies
Alerting based on log filter criteria
Plotting more advanced metrics to showcase trends
Submit your own idea!

Oh, and one more thing. We pride ourselves on not requiring application level lock-in so we don’t require any code changes to take advantage of these new features. Simply continue emitting logs as you always have and we’ll present them to you in a faster, cleaner, more searchable way!

Our goal is to deliver the best observability experience of any application deployment platform on the web. We can only do that with your feedback! Let us know what you’d like to see next and, in the meantime, happy shipping.

We hope you enjoyed Launch Week 01! There’s still one more big announcement coming, so be sure to check the Launch Week landing page!