Deploying a Multi-Service AI Platform on a Budget

It’s Not One Thing

The first time I tried to deploy this platform, I treated it like a single application. One repository, one deploy, done. That lasted about 30 seconds before I realized this thing has at least four separate processes that all need to run simultaneously:

A WebSocket relay that maintains a persistent connection to the data sourceAn API server that handles REST endpoints, WebSocket broadcasting, and scheduled jobsA vector database for semantic searchA frontend web application

These aren’t optional components. They all need to be running for anything to work. The relay feeds data in. The API serves it out. The vector database stores it. The frontend displays it. Kill any one of them and the whole system is degraded.

And they can’t all run in the same process. The WebSocket relay is a long-running blocking connection. The API server is an async web framework. The vector database is a separate service entirely. The frontend is a static site that gets built and served independently.

The Multi-Service Reality

Cloud platforms generally assume you’re deploying one thing. A web app. A worker process. A database. You pick a template, push your code, and it figures out how to run it.

When you need four services from the same repository, each with different startup commands, different resource requirements, and different networking needs, the deployment story gets complicated fast.

Each service needs its own configuration:

The relay needs a startup command that runs the WebSocket listener scriptThe API needs a startup command that runs the web framework with the right host and port bindingsThe vector database runs as a Docker container with persistent storageThe frontend needs a build step followed by a static file server

They also need to talk to each other. The relay writes to the vector database. The API reads from both the vector database and the analytics database. The frontend talks to the API. These internal connections need to use private networking so they’re fast and don’t incur external traffic costs.

And only some services need public URLs. The API needs one so the frontend can reach it. The frontend needs one so users can access it. The relay and the database should be internal only.

The GPU Problem

Here’s a fun one. The local embedding model that runs great on my development machine with a GPU? It doesn’t work on cloud infrastructure that only has CPUs. This seems obvious in retrospect, but the first deployment just crashed with cryptic errors about missing CUDA drivers.

The fix required building a fallback chain for the embedding system.

On startup, the system checks what hardware is available:

GPU with CUDA? Use the configured model with half-precision acceleration.CPU only? Switch to a CPU-optimized model automatically.That model fails to load? Try the next one in the fallback chain.All models fail? Disable embeddings gracefully and continue running everything else.

The key word is “gracefully.” The platform should still work even if embeddings are broken. You lose semantic search, but the analytics, the agents, the momentum tracking. all of that can function without embeddings. So the system logs the failure, sets a flag, and keeps going.

This fallback chain was designed after the first deployment failed at 2am and I had to wake up to figure out why the whole platform was down because one model couldn’t load. Now the worst case is degraded functionality, not a crash.

Environment Variable Hell

Four services, all reading from the same set of environment variables but each needing slightly different values. The API server needs to know the public URL of itself for CORS headers. The relay needs the internal URL of the vector database. The frontend needs the public URL of the API. The vector database needs its storage configuration.

And then there are the shared secrets. The AI model API keys, the cryptographic wallet key, the database credentials. These need to be the same across services but configured separately in each one because they’re separate deployment units.

I ended up with a master environment template that documents every variable, which services need it, and what the default value should be. Without this, every deployment was a game of “which service is failing because I forgot to set QDRANT_URL in that specific service’s config.”

The most annoying bugs were always the environment ones. Service A works fine. Service B works fine. But they can’t talk to each other because one is using the public URL and the other is using the internal URL and they’re subtly different. Or one service has a trailing slash in an environment variable and the other doesn’t, and the URL construction breaks.

Cost Management

Running four services 24/7 adds up. The naive approach. give each service generous resources. would cost $100–200/month. For a project still in development, that’s a lot.

The optimization strategy was:

Right-size everything. The relay is mostly idle between events. It doesn’t need much CPU or memory. The API handles bursty traffic but isn’t under constant load. The vector database needs memory proportional to the active dataset size. The frontend is static files.

Use free tiers where possible. The vector database has a free cloud tier that’s sufficient for moderate-scale usage. The frontend can be hosted on a static site platform for free. That brings you from four paid services to two.

Combine where it makes sense. The relay and the API can technically run as a single process with some careful async management. Less clean architecturally, but it halves the compute cost. I kept them separate in production for reliability but combined them in staging to save money.

Monitor actual usage. I was paying for compute that was 80% idle. Scaling down to the minimum viable resource allocation for each service cut costs by about 40% with no performance impact.

The Things That Break at 3am

Deployments are easy. Keeping things running is hard. Here are the things that actually broke in production:

WebSocket disconnections. The data source occasionally drops the connection without warning. The reconnection logic works, but there’s a gap. usually a few seconds to a minute. where data is being missed. You only notice because the momentum graphs show a dip and then a catch-up spike.

Memory leaks. One of the background jobs was accumulating state in a dictionary that never got cleaned up. Worked fine for days, then the process would OOM and restart. The fix was a periodic cleanup sweep, but finding the leak took longer than fixing it.

Database connection exhaustion. The analytics database has a connection limit. Under heavy agent processing (multiple agents all querying at once), you can hit it. Connection pooling and query timeouts solved this, but not before a few incidents where the API became unresponsive because all connections were stuck.

Clock drift. Two services disagreeing about what time it is by a few seconds. This caused the mover job (which uses timestamps to decide what data is “old enough” to archive) to occasionally skip batches or process the same batch twice. The fix was using database timestamps instead of local clocks for all time-sensitive operations.

Deployment order. If the API deploys before the database finishes its migration, the API crashes on startup because the schema doesn’t match. I added health checks that wait for dependent services to be ready before the application starts accepting traffic.

What I’d Tell Someone Starting Out

Don’t try to build a monolith and split it later. Design for separate services from the start, even if you deploy them as one thing initially. The separation of concerns pays off immediately in clarity and pays off again when you need to scale or debug individual components.

Invest in your environment configuration management early. One source of truth for all variables, clear documentation of which service needs what, and validation on startup that fails fast with clear error messages.

Build graceful degradation into everything. The system should always prefer “running with reduced functionality” over “crashed.” Users can tolerate a missing feature. They can’t tolerate a blank screen.

And monitor your costs from day one. Cloud services are designed to make spending money easy and tracking spending hard. Set up cost alerts and review usage weekly. The difference between a $300/month deployment and a $3000/month deployment is usually just configuration, not capability.

This article is part of a 10-part series documenting the journey of building a real-time intelligence platform from scratch ( https://naiko.io ). Start from the beginning with “I Built a Real-Time Intelligence Platform and the Hardest Part Was the Plumbing.”

Deploying a Multi-Service AI Platform on a Budget was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Deploying a Multi-Service AI Platform on a Budget

It’s Not One Thing

The Multi-Service Reality

The GPU Problem

Environment Variable Hell

Cost Management

The Things That Break at 3am

What I’d Tell Someone Starting Out

By

Leave a Reply Cancel reply

You Missed

Trump Accounts launch July 4, aiming to build wealth for millions of US children

South Korea and Mexico clash in World Cup match as cultural ties between nations deepen

Soccer players adopt high-tech cooling gear to combat heat during World Cup 2026

Amazon walks away from Sam Altman movie before OpenAI IPO

It’s Not One Thing

The Multi-Service Reality

The GPU Problem

Environment Variable Hell

Cost Management

The Things That Break at 3am

What I’d Tell Someone Starting Out

By

Related Post

Leave a Reply Cancel reply

You Missed