Serverless Inference on GPUs with Banana.dev

Banana.dev is a serverless GPU platform that lets users deploy AI models quickly and easily.

It’s an impressive company with a ton of technical challenges and opportunities. We were excited to learn more about the customers Banana is serving and the way the team is using Railway to power some of their services.

We were lucky enough to speak with Banana founder Erik Dunteman to learn more about how Banana is making it easy to serve models and bring infrastructure simplicity to GPUs.

Let’s dive in!

Banana.dev

Railway: Can you give us a little introduction into what Banana is doing and what kinds of things end users are doing with Banana?

Erik: Banana provides serverless GPUs for machine learning inference. For users who are doing real-time inference like running a prediction, generating an image — whatever the AI-powered process may be — if they have real-time demands, they need GPUs that autoscale to meet demand.

Without autoscaling, you end up with incredibly underutilized GPUs. Utilization is key because otherwise the naïve hosting approach is to set up a static quantity of replicas to handle peak traffic. The AI teams that do this end up spending a bunch of money — probably ten times more than they ought to.

On the other hand, it’s critical to have as little queuing as possible. Our users’ products depend on fast responses — there’s often an end user on the other end of their product waiting for a response and end-user retention is directly correlated to how fast the responses are.

So that’s why we’re building banana in the first place, to give you the ability to scale from 0 to 1 to as many replicas as you want and then back to zero, without sacrificing latency.

Banana has a large number of templates to get up and running

Railway: Who is the target audience for Banana? Is it primarily for developers integrating a real-time inference pipeline in their application?

Erik: Yeah, exactly. The value is in getting your model into production and then scaling it without thinking about it. Most of our users have already built something in-house and have gotten frustrated that they’re thinking about infrastructure all day instead of their core business.

And beyond just the infrastructure problem itself, running inference opens a Pandora’s Box of issues — there’s the interface with the hardware which is unique in AI, there’s the fact that GPUs are not very easy to provision or autoscale, and then there’s also an overall lack of availability for GPUs period.

When a company chooses Banana, we end up saving them months of time building this out themselves.

Railway: What kinds of models are you seeing people deploy most often?

Erik: We’re seeing a lot of CLIP, Whisper, Stable Diffusion … but of course users have their own custom implementations. Every deployment is unique to the user.

Since we are a code deployment platform rather than an API, users get creative on us and we see a lot of everything.

But we have pretty much every type of model. We have people doing music generation, image generation, and then also more traditional ML applications as well.

Deploying Automatic1111 on Banana

Railway: One thing we think Banana has in common with Railway is this dream of like, push a button, don’t worry about the underlying infrastructure. How do you see the big picture?

Erik: I think with Dev Tools and especially infra, we all have this same vision of reducing the things that have developer mindshare that are unrelated to the core business.

Within GPUs, the main consideration is cost — GPUs need to be serverless and need to scale to zero. I think the multi-tenant model works for that because if we scale a user to zero then that GPU is free for someone else.

If someone is building a GPU cluster in-house, then they need to either build that system themselves, or have so many models that there’s another one to run waiting in the wings.

I see a centralized GPU provider that scales to zero being almost a necessity for an increasing quantity of AI companies to be born in the first place. Especially as finetuning tooling gets better and teams can no longer run on off-the-shelf APIs.

Railway: You hear a lot about the cold start problem in GPU land – how are you guys managing to bring down cold starts for your users?

Erik: The cold boot problem is the number one technical problem we’re tackling. We’re working to get multi-gigabyte models to cold boot in sub-second times, consistently, so teams can reliably depend on those boots to happen on demand.

The most I can share about how we’re approaching cold starts is that we have to think a lot about keeping the weights as close to GPU memory as possible without actually using GPU memory. It’s tempting to “fake” serverless by actually keeping resources in memory between calls, but we’ve worked hard to actually solve the technical challenge and boot the servers on-demand.

We’re down at the lower level doing fun stuff like writing Rust and CUDA code in order to make this happen. We’ve found that it does require us to be opinionated, in our case our Potassium http framework, to keep down the noise and deliver a more reliable service, but in the end it’s important to us to allow users to serve arbitrary python functions in arbitrary environments.

Railway: Can you tell us a little bit about how you started working with Railway and what applications you’re running? You recently tweeted something about Redis, for example?

Erik: We’re an infrastructure company, so our main bread and butter is the inference pipeline on GPUs, which we manage ourselves on our own bare metal GPUs. And much like Railway, we have our own scheduler orchestrating those workloads.

We do have ancillary services which we’d rather not distract ourselves with self-hosting. We transitioned onto Railway from managing our own services with Pulumi and other platforms such as Zeet.

Railway is now hosting the vast majority of our non-GPU services. For example, we run an ingestion layer on Railway. We were trying to figure out how to reduce our total system time on an inference call which previously was more than a second end-to-end inclusive of network calls, reading from database, reading from Kubernetes, in order to make a call actually arrive on the GPU process there and return.

Moving the ingestion service onto Railway and moving the caching layer from Kubernetes into Redis allowed us to shave that 1 second call down to 4ms. Redis is now colocated with the service that’s calling it and is well networked with the service from which it was calling.

Railway: If our readers wanted to find out more about Banana, what should they do? Do you have some resources to get up and running on Banana?

Erik: The best way to get up and running with Banana is to simply use Banana. You can try it out and we give you one hour of GPUs credits when you sign up that you can play around with.

Find us at banana.dev or on Discord or Twitter. We also have a templates page that provides 1-click deploy models to get up and running quickly on. Our community is quick to publish the most recent exciting models, with Whisper, Stable Diffusion, AUTOMATIC1111, and more. They also come with source code that you can fork when it’s time to make it your own.

Serverless Inference on GPUs with Banana.dev

Continue Reading...

Your train has arrived!