Ask HN: Would you use a serverless, pay-per-second model for AI inference?

2 neuron-enix 1 7/18/2025, 7:30:02 PM
Hey HN,

As a developer, I’ve been using the big AI APIs quite a bit for my projects. While the technology is incredible, I keep hitting a financial wall. I've had single, complex prompts with large contexts cost me $1-2 (claude opus 4), which is a tough pill to swallow when you're a solo dev paying out-of-pocket.

It feels like the current pricing models can be a real barrier to experimentation and building more ambitious, personal projects that my day job won't cover.

This has led me to explore a "what if" scenario for a more developer-friendly pricing model, and I'd love to get your thoughts on its feasibility.

The Idea: Serverless AI Inference My proposal is a platform that prices AI inference based on the actual compute time used.

Think of it exactly like AWS Lambda, but for LLM prompts. The model is simple:

You don't host or manage anything. You just send your prompt to an API endpoint.

We handle routing it to an available GPU from a large, shared hardware pool.

You get charged only for the execution time your task consumes the GPU, billed by the seconds.

This approach provides a direct, transparent link between the resources you use and what you pay. For tasks that are compute-heavy but don't necessarily have a massive word count, this could be a much more predictable and affordable way to build.

My Questions for the Community: Do you also find the cost of using AI APIs to be a barrier?

As a developer, would you prefer paying for compute time in a serverless model like this? What potential downsides do you see?

I'm a backend engineer, and while building this is a challenge I'm willing to take on, is this economically feasible? Am I underestimating the costs of managing a shared GPU pool and competing with existing players? Or could this be a sustainable business that genuinely solves a problem?

I haven't started building anything yet, but the high costs I'm facing are pushing me to seriously consider it. I wanted to tap into the collective wisdom of HN to see if this is a problem worth solving or if I'm just shouting into the void.

Comments (1)

lbhdc · 2h ago
I think this is definitely the future for a lot of developers. Though, like other serverless platforms once you start getting scale this will likely be more expensive than always on variants.

GCP recently announced Cloud Run can do, more or less, what you are proposing. It scales to zero, and you only pay for the request duration.

https://cloud.google.com/run/docs/configuring/services/gpu-b...