Productionising GenAI inside a multi-tenant SaaS — lessons from building HUB

HUB is the e-commerce automation SaaS I built and ship single-handedly, end-to-end. It powers a real retail business and embeds multiple Generative AI use cases — LLM-driven customer engagement, diffusion-model synthetic product photography, AI-assisted CRM, SEO and content automation. This isn’t a tutorial. It’s a working note on what production-grade GenAI actually requires that the demos don’t show.

Open Table of contents

”Production GenAI” is mostly not the GenAI
Treat the prompt as a versioned artefact
Validate the output, always
Diffusion models — what the demos don’t show
Cost observability per tenant is non-negotiable
The “AI inside enterprise platforms” lens
What I’d build differently if starting again
The honest summary

”Production GenAI” is mostly not the GenAI

The model call is maybe 5% of the work. The other 95% is everything around it: prompt versioning, evaluation harnesses, output schema validation, fallback behaviour when the model is slow or wrong, cost tracking per tenant, observability you can debug from, and the integration glue with the actual business workflow.

If your team’s mental model is “we’ll add an OpenAI call here,” you’ll ship something. You won’t ship something you can operate.

Treat the prompt as a versioned artefact

Prompts drift. Models change. What worked last week breaks on the new model snapshot. The fix isn’t discipline — humans won’t track it. The fix is treating the prompt like code: in source control, with semver, with an evaluation suite that runs on every change.

For HUB, the prompts live in a prompts/ directory with one file per use case. Each file has a version header. The runtime loads by version, never “latest.” When a prompt changes, the change is reviewed like a code change, runs against an eval set (a handful of golden inputs with expected output shapes), and ships under a new version. The old version stays live for tenants that haven’t migrated.

This sounds heavyweight. It pays for itself the first time you have to debug “why did this customer message get a weird reply on Tuesday?”

Validate the output, always

LLMs are probabilistic. Even with structured-output modes, you will see malformed JSON, missing fields, off-schema responses. Every model call in HUB goes through a schema validator on the way out. If validation fails, the call is retried up to twice with the same prompt; if it still fails, the result is logged and a deterministic fallback runs.

The fallback matters. If your “AI-assisted CRM” silently breaks 0.5% of the time, you have a trust problem you won’t see in metrics until customers do.

Diffusion models — what the demos don’t show

Synthetic product photography via diffusion models is the use case where the gap between demo and production is widest. The demos show one good output. Production needs:

Brand consistency — the same product, shot the same way, every time. That’s a fine-tune or LoRA, not raw prompting.
Reference adherence — generated images that look like the actual product, not a creative reinterpretation. ControlNet, IP-Adapter, or equivalents are non-optional.
Reject-and-retry — outputs that miss the brief get rejected by a smaller model (or rules engine) and regenerated. Otherwise the operator becomes the rejection step, which doesn’t scale.
Cost management — diffusion is expensive per call and slow. Batch where you can, async everywhere, and never trigger generation in a user-facing critical path.

HUB’s photography pipeline runs as a background job, regenerates failed outputs up to N times, and only surfaces the result to the operator when a confidence threshold is met. The visible UX is “I clicked, I have a photo.” The plumbing underneath is queue + workers + scoring + fallback.

Cost observability per tenant is non-negotiable

If you’re shipping a multi-tenant SaaS with GenAI inside it, every model call needs to be attributed to a tenant, a feature, and a request. Otherwise your bill arrives, you can’t decompose it, and you can’t price the product rationally.

In HUB this is a single middleware layer: every model call is wrapped, the call records {tenant_id, feature, model, input_tokens, output_tokens, latency, cost} to a usage table. A daily job rolls that up into per-tenant cost reports. The pricing model literally falls out of that table.

If you skip this and try to retrofit it later, you’ll discover you can’t reliably attribute historical calls — and you’ll be flying blind on margin.

The “AI inside enterprise platforms” lens

A lot of what I’ve learned shipping HUB transfers directly to enterprise banking. The same disciplines that keep a small SaaS honest — prompt versioning, output validation, per-tenant cost attribution, deterministic fallbacks — are exactly the disciplines a bank needs before it can ship AI into a customer-facing workflow.

The difference at the bank is the audit story. In a SaaS you log; in a bank you log and retain and produce on request to a regulator. The mechanism is the same; the rigor is higher.

For anyone evaluating “should we put AI in this enterprise workflow yet,” the honest answer is: only when you’ve answered three questions. Can you explain a wrong answer after the fact? Can you contain the blast radius when it’s wrong at scale? Can you attribute cost cleanly? If any of those are “not yet,” ship the deterministic version first and add AI as an upgrade once those answers exist.

What I’d build differently if starting again

Start with evaluations, not features. The eval harness — the golden set, the scoring, the regression check — is the thing that lets you ship changes without fear. Build it before the second feature, not after the fifth.

Structured outputs by default. Free-text outputs feel powerful and become unmaintainable. Schemas force the model into shapes your code can actually consume.

One model gateway, not direct calls scattered through the codebase. Every model call goes through a single wrapper that handles retries, validation, cost logging, and fallbacks. Refactoring this in after the fact is painful.

The honest summary

GenAI in production isn’t magic, and it isn’t a demo. It’s a normal engineering discipline with one extra failure mode: the model lies sometimes. Build for that failure mode from day one and the rest of the work feels familiar.

If you’re working on bringing AI capability into enterprise platforms — particularly in banking — I’d be happy to compare notes. LinkedIn is the easiest.