Architecture Blueprint
Secure API-Key Proxy & Usage Gating
Hide third-party API keys (LLM + Speech-to-Text) from your customers and meter their usage server-side. A provider-agnostic, seven-layer architecture blueprint.
A copy-pasteable, provider-agnostic blueprint for hiding third-party API keys (LLM + Speech-to-Text) from your customers and metering / rate-limiting their usage server-side.
This document is written as a portable specification. It describes the pattern, not one framework. Section 12 maps every abstract component to the concrete reference implementation in this repo (Firebase Functions + Flutter), and to common alternatives (AWS, Cloudflare, Supabase, raw Node, etc.) so another engineer — or another coding agent — can drop it into a different stack.
One sentence summary: The client never sees a provider key. It calls your backend with a short-lived identity token + an app-integrity token; your backend checks who they are, checks the request came from your real app, checks they still have quota, then forwards the call to the LLM/STT provider using a secret only the server holds, meters the real cost atomically, and returns the result.
1. Prerequisites
Before implementing this blueprint, you need the following building blocks. Each has many vendor choices; pick one per row.
| # | Capability | Why it’s required | Common choices |
|---|---|---|---|
| 1 | A backend you control (serverless function, container, or edge worker) | The only place the provider key is ever decrypted. The whole pattern collapses if the client can call the provider directly. | Firebase Functions, Cloud Run, AWS Lambda, Cloudflare Workers, Supabase Edge Functions, a plain Node/Express server |
| 2 | A secrets manager | Stores provider keys encrypted at rest, injected into the backend at runtime, never in source control or the client bundle. | Firebase/GCP Secret Manager, AWS Secrets Manager, Doppler, Vault, platform env-vars |
| 3 | Per-user authentication | Quotas are per user. You must be able to prove “this request belongs to user X” without trusting a client-supplied user-id. | Firebase Auth, Auth0, Clerk, Supabase Auth, your own JWT issuer |
| 4 | App-integrity / attestation | Auth proves who; attestation proves the request came from your genuine, unmodified app — not a script replaying a stolen token. Without it, a leaked auth token = free unlimited use of your paid API. | Firebase App Check (Play Integrity / DeviceCheck / reCAPTCHA), AppAttest, hCaptcha, a signed-nonce scheme |
| 5 | A datastore with atomic increments + transactions | Usage counters must increment correctly under concurrent requests. Read-modify-write without atomicity leaks quota. | Firestore (FieldValue.increment), DynamoDB (atomic counters), Redis (INCR), Postgres (UPDATE ... SET n = n + x) |
| 6 | Provider accounts | The actual paid services you are proxying. | LLM: OpenAI, OpenRouter, Anthropic. STT: Groq Whisper, OpenAI Whisper, Deepgram, Google STT |
| 7 | (Optional) A billing / entitlement provider | To distinguish free vs paid users and set their quota. Optional if everyone has the same limit. | RevenueCat, Stripe, Play Billing / App Store directly |
Mindset prerequisite: treat every byte from the client as hostile. The client decides nothing that costs money — not the model, not the limit, not the plan, not the user-id. The server re-derives all of it.
2. The threat model (what we are defending against)
Name the attacker so the design is justified, not cargo-culted.
| Threat | What happens if unaddressed | Defended by |
|---|---|---|
| Key extraction from the client | Anyone decompiles your app / opens DevTools, lifts the API key, and runs your provider bill to $50k overnight. Keys in mobile binaries and JS bundles are public. | §3 server-side key, §4 proxy |
| Token replay / scripted abuse | An attacker grabs one valid auth token and scripts millions of calls. | §6 attestation |
| Quota evasion | Client claims plan = premium or tokensUsed = 0. Free tier becomes infinite. | §7 server-derived entitlement, §8 server-side metering |
| Race-condition over-spend | 100 concurrent requests each read “0 used”, all pass the check, user spends 100× their limit. | §8 atomic counters |
| Model / parameter abuse | Client requests an expensive model or max_tokens: 1_000_000. | §4 server-side allowlist + clamps |
| Billing spoofing | Client fakes a “purchase succeeded” message to unlock premium for free. | §9 signed webhook, server-verified entitlement |
| Cost-of-goods blindness | You can’t see who spends what; one user quietly costs more than they pay. | §8 per-model usage ledger |
3. Layered architecture
Seven layers, each a single responsibility. Implement them as ordered middleware so the cheapest rejections happen first (reject before you spend).
┌─────────────────────────────────────────────┐
CLIENT (untrusted) │ YOUR BACKEND (trusted) │ PROVIDERS (paid)
┌────────────────────┐ │ │ ┌──────────────────┐
│ App / Web client │ │ L1 Identity verify auth token ──401 │ │ LLM provider │
│ │ │ L2 Integrity verify app-check ──401 │ │ (OpenAI / etc.) │
│ • holds NO key │──────▶│ L3 Entitlement resolve plan (cache) │ ├──────────────────┤
│ • sends: │ HTTPS │ L4 Pre-gate quota check ──429 │ │ STT provider │
│ auth token │ │ L5 Proxy inject SECRET KEY ───────┼──▶│ (Whisper / etc.)│
│ app-check tok │ │ L6 Meter atomic usage write │ └──────────────────┘
│ token estimate │◀──────│ L7 Respond result + quota state │
└────────────────────┘ JSON │ │
│ ▲ secret key injected from Secrets Mgr │
└───┼─────────────────────────────────────────┘
│
┌───────┴────────┐ ┌──────────────────────┐
│ Secrets Manager│ │ Billing provider │
│ (server only) │ │ webhook ──▶ L0 sync │
└────────────────┘ └──────────────────────┘
Layer 0 — Secret storage
Provider keys live in a secrets manager, bound to the function at deploy time. They are read at runtime via SECRET.value() / process.env, trimmed/sanitised, and a client is lazily constructed once per warm instance. Keys are never logged, never returned in a response, never committed.
Layer 1 — Identity (who)
Extract the bearer token, verify its signature against the auth provider, derive a trusted uid from the verified claims. Reject 401 on missing/invalid. Never read the user-id from the request body.
Layer 2 — Integrity (genuine app)
Read the attestation header, verify it with the attestation service. Reject 401 if missing/invalid. This is what stops a stolen token from being weaponised at scale. Keep a tiny, explicit dev-bypass list for your own test accounts — and nothing else.
Layer 3 — Entitlement (what they’re allowed)
Resolve the user’s plan (free / premium / …) → a concrete set of limits. Read it from the server’s own store, never from the client. Cache it in-process with a short TTL (e.g. 10 s, 5k entries LRU) so you don’t hit the DB on every call. Allow an explicit cache-bust header for the “just purchased, refresh now” moment.
Layer 4 — Pre-gate (cheap rejection before spend)
Before calling the provider, check the current usage against the limit using a client-supplied estimate (X-Token-Estimate, or audio-length for STT). If used + estimate > limit for either the daily or monthly window, reject 429 quota_exceeded with {used, limit} so the client can show an upgrade prompt. This is an optimistic gate — it’s cheap and stops most over-spend before a single paid token is bought.
Layer 5 — Proxy (the actual forward)
Validate & clamp the request: enforce a model allowlist, clamp max_tokens, clamp temperature, cap upload size. Inject the secret key into the outbound call. Forward to the provider. Treat provider errors as 502. The client’s request is a suggestion; the server decides the final payload.
Layer 6 — Meter (truth after the fact)
Read the real cost from the provider response (usage.total_tokens; for STT, the charged audio minutes), and atomically increment the per-user daily and monthly counters inside a transaction. Record per-model breakdown for cost attribution. The pre-gate is the estimate; this is the settlement.
Layer 7 — Respond
Return the provider result plus a small quota envelope ({used, limit, plan}) so the client UI can render remaining usage without a second round-trip.
4. Request lifecycle (sequence)
client backend (ordered middleware) provider datastore
│ │ │
│ POST /chat │ │
│ Authorization: Bearer <authToken> │ │
│ X-App-Check: <attestationToken> │ │
│ X-Token-Estimate: 1200 │ │
├──────────────▶ L1 verify auth ─────────────────── invalid ▶ 401 │ │
│ L2 verify attestation ───────────── invalid ▶ 401 │ │
│ L3 resolve plan (cache→DB) ───────────────────────────────────▶ read plans/{uid}
│ L4 read usage, estimate gate ─────── over ────▶ 429│◀───────────── read usage/{uid}/...
│ L5 validate+clamp, inject SECRET ─────────────────▶│ │
│ provider call │ │
│ L6 read real usage, atomic incr ◀─────────────────│───────────▶ tx incr daily+monthly
│ 200 { ...result, quota:{used,limit}, plan } ◀── L7 │ │
│◀────────────── │ │
Key ordering rule: identity → integrity → entitlement → gate → spend → meter. Each step is strictly cheaper and more certain than the paid call, so the request is killed as early as possible.
5. Data model
Two collections. Names are illustrative; any DB works.
5.1 Entitlement store — plans/{uid}
Written by the billing webhook (§9), read by Layer 3.
{
"plan_type": "premium", // 'free' | 'premium'
"subscription_status": "active", // active | cancelled | expired | billing_issue
"expires_date": "2026-07-01T00:00:00Z",
"limits": { // server-authoritative limits
"tokens_per_day": 500000,
"tokens_per_month": 3000000,
"stt_minutes_per_day": 20,
"stt_minutes_per_month": 200
},
"billing_user_id": "rc_or_stripe_id",
"last_updated": "<server timestamp>"
}
5.2 Usage ledger — usage/{uid}/daily/{YYYY-MM-DD} and usage/{uid}/monthly/{YYYY-MM}
Written by Layer 6, read by Layer 4. Date-keyed documents give you free, automatic window resets — a new day = a fresh empty doc, no cron job needed.
// usage/{uid}/daily/2026-06-11
{
"total_tokens": 12840,
"tokens": { "openai/gpt-5.4-mini": 12840 }, // per-model attribution
"stt_minutes": 4.2,
"stt": { "whisper-large-v3-turbo": 4.2 },
"last_updated": "<server timestamp>"
}
Why two windows: the daily limit smooths spikes and caps the blast radius of a single bad day; the monthly limit caps the bill. A request must pass both.
Why date-keyed docs: resets are implicit. There is no “reset usage” job to forget — todayKey() simply points at a new document at midnight UTC.
6. Portable reference implementation (pseudo-code)
Language-neutral, framework-neutral. Copy and translate. The control flow is the spec.
6.1 Layer 0 — secrets & lazy provider client
const LLM_KEY = secret("LLM_API_KEY"); // bound at deploy, server-only
const STT_KEY = secret("STT_API_KEY");
let llm; // lazy, per warm instance
function getLlm() {
if (!llm) {
const key = LLM_KEY.value().trim().replace(/^['"]|['"]$/g, "").replace(/\r?\n/g, "");
llm = new ProviderClient({ apiKey: key });
}
return llm;
}
6.2 Layers 1 & 2 — identity + integrity middleware
async function verifyIdentity(req, res, next) {
const token = bearer(req.headers.authorization);
if (!token) return res.status(401).json({ error: "missing_auth" });
try { req.uid = (await auth.verifyToken(token, /*checkRevoked*/ true)).uid; next(); }
catch { return res.status(401).json({ error: "invalid_auth" }); }
}
async function verifyIntegrity(req, res, next) {
const tok = req.headers["x-app-check"];
if (isDevUser(req.uid)) return next(); // explicit, tiny allowlist
if (!tok) return res.status(401).json({ error: "missing_app_check" });
try { await appCheck.verifyToken(tok); next(); }
catch { return res.status(401).json({ error: "invalid_app_check" }); }
}
6.3 Layer 3 — entitlement with cache
const planCache = new LRU({ max: 5000, ttl: 10_000 });
async function loadPlan(req, res, next) {
const reload = req.headers["x-plan-reload"] === "1";
if (reload) planCache.delete(req.uid);
let plan = planCache.get(req.uid);
if (!plan) {
plan = normalizePlan(await db.get(`plans/${req.uid}`)); // server-authoritative
planCache.set(req.uid, plan);
}
req.plan = plan;
next();
}
// normalizePlan clamps every client/DB value to known-safe caps.
// A missing/garbage plan ALWAYS falls back to the FREE tier, never premium.
6.4 Layer 4 — pre-gate (estimate)
async function preGate(req, res, next) {
const estimate = Number(req.headers["x-token-estimate"] ?? 0);
const day = await db.get(`usage/${req.uid}/daily/${todayKey()}`);
const month = await db.get(`usage/${req.uid}/monthly/${monthKey()}`);
if ((day?.total_tokens ?? 0) + estimate > req.plan.tokensLimit)
return res.status(429).json({ error: "quota_exceeded",
used: day?.total_tokens ?? 0, limit: req.plan.tokensLimit });
if ((month?.total_tokens ?? 0) + estimate > req.plan.tokensMonthlyLimit)
return res.status(429).json({ error: "quota_exceeded_monthly" });
next();
}
6.5 Layer 5 + 6 — proxy then meter
async function handleChat(req, res) {
const model = req.body.model ?? DEFAULT_MODEL;
if (!ALLOWED_MODELS.has(model)) return res.status(400).json({ error: "model_not_allowed" });
const payload = { ...req.body, model,
max_tokens: Math.min(req.body.max_tokens ?? 2048, 4096), // server clamps
temperature: Math.min(req.body.temperature ?? 0.7, 1.0) };
const result = await getLlm().chat(payload); // SECRET injected here
const realTokens = result?.usage?.total_tokens ?? Number(req.headers["x-token-estimate"] ?? 0);
const quota = await writeUsage(req.uid, realTokens, model); // atomic settle
return res.json({ ...result, quota, plan: { type: req.plan.type } });
}
async function writeUsage(uid, tokens, model) {
await db.transaction(async tx => {
tx.merge(`usage/${uid}/daily/${todayKey()}`, {
total_tokens: INCREMENT(tokens), [`tokens.${model}`]: INCREMENT(tokens), last_updated: NOW() });
tx.merge(`usage/${uid}/monthly/${monthKey()}`, {
total_tokens: INCREMENT(tokens), [`tokens.${model}`]: INCREMENT(tokens), last_updated: NOW() });
});
}
6.6 Speech-to-text variant
STT is the same seven layers; only the cost unit changes — minutes of audio instead of tokens.
- Client uploads audio as
multipart/form-data(never a URL the server must fetch — that’s an SSRF foot-gun). - Shrink before upload to cut both cost and latency: down-sample to the model’s native rate (16 kHz), mono, a compressed codec (Opus ~32 kbps), and optionally speed up slightly. In this repo a single FFmpeg pass does
highpass=80, atempo=1.25, 16 kHz, mono, libopus 32kclient-side. - Pre-gate on an estimate from file size (e.g. ~0.5 MB ≈ 1 min). Settle on the real duration the provider reports (
charged_minutes). - Enforce an upload content-type allowlist and a hard size cap at the proxy.
7. The client’s contract
The client is a thin, honest courier. It:
- Obtains a fresh auth token and a fresh attestation token (fetch both in parallel; cache the attestation token ~50 min; apply a short timeout so a slow attestation never hangs the UX).
- Sends them as headers to your backend endpoint — never to the provider.
- Sends an honest token/length estimate so the pre-gate works (lying only hurts the user — the server settles on real cost anyway).
- Handles
429by showing an upgrade / paywall prompt, not by retrying. - Holds no provider key, no plan flag the server trusts, no usage counter the server trusts.
// Flutter reference: lib/src/services/chat/api_service.dart
final tokens = await Future.wait([_getIdToken(), _appCheck.get()]);
final headers = {
'Authorization': 'Bearer ${tokens[0]}',
'X-Firebase-AppCheck': tokens[1] ?? '',
'X-Token-Estimate': estimatedTokens.toString(),
'X-Model': effectiveModel,
};
final res = await http.post(Uri.parse('$_baseUrl/chat'), headers: headers, body: body);
if (res.statusCode == 429) throw QuotaExceededException(/* show paywall */);
8. Billing → entitlement sync (optional but recommended)
Never trust the client’s word that a purchase happened. Establish entitlement server-side:
- Billing provider (RevenueCat / Stripe) sends a webhook on
INITIAL_PURCHASE / RENEWAL / CANCELLATION / EXPIRATION. - Verify the webhook signature — HMAC-SHA256 over the raw body with a shared secret, compared with a timing-safe equality check (or a bearer secret). Reject
401otherwise. - Idempotency: store processed
event_ids; a duplicate delivery is a no-op. - Map
product_id → limits, writeplans/{uid}, and bust the plan cache immediately so the upgrade is live within seconds. - For the “just purchased, still shows free” race, let the client send a one-shot
X-RC-Verify: 1/X-Plan-Reload: 1header that forces a fresh entitlement read on the next call.
9. Security checklist (use as PR review gate)
- Provider keys exist only in the secrets manager and the server process — grep the client bundle to be sure.
- Every paid endpoint runs
identity → integrity → entitlement → pre-gatemiddleware, in that order, before any provider call. -
uidis taken from the verified token, never from the request body/query. - App-integrity verification is enforced (not just auth). Dev-bypass is a tiny, explicit, hard-coded allowlist.
- Model is checked against a server allowlist;
max_tokens,temperature, and upload size are clamped server-side. - Usage counters use atomic increments inside a transaction.
- Quota is checked against both daily and monthly windows.
- A missing/garbage entitlement falls back to free, never premium.
- Webhook is signature-verified (timing-safe) and idempotent.
- No secret is ever logged or returned in a response body.
- Provider/transcription endpoints reject non-allowlisted content types and oversized uploads.
- Rate-limit / abuse alerts exist on anomalous per-uid spend (cost attribution from the per-model ledger).
10. Failure modes & how to think about them
| Failure | Right behaviour |
|---|---|
| Attestation service down / slow | Client times out fast and proceeds without the token → server rejects 401. Fail closed on the server, fail fast on the client. Don’t let a slow integrity check freeze the app. |
| Entitlement DB read fails | Fall back to free limits, not premium. Cheaper to under-serve than to give away the paid API. |
Provider returns no usage block | Settle the meter on the client’s estimate so you still charge something — never charge zero. |
| Concurrent burst from one user | Atomic increments mean every request settles correctly; the pre-gate may briefly under-count, but the monthly cap is the backstop. |
| Webhook arrives twice | Idempotency key makes the second one a no-op. |
| Clock/timezone skew on window keys | Pick one zone (UTC) for todayKey()/monthKey() and use it everywhere — client display can localise separately. |
11. Cost & quality optimisations (worth stealing)
- Two-stage metering (estimate-gate then real-settle) keeps the hot path to one DB read before the paid call, and one transaction after.
- Per-model ledger (
tokens.{model}) gives you exact cost attribution per user — essential for spotting a single user who costs more than they pay. - Pre-upload audio compression for STT cut bytes-on-wire and provider minutes simultaneously (16 kHz/mono/Opus + slight speed-up).
- Short-TTL plan cache (10 s) collapses thousands of entitlement reads into one while keeping upgrades near-instant.
- Premium = “unlimited” but safely capped. Even paid plans get a hard ceiling so a compromised premium account can’t run an unbounded bill.
12. Stack-mapping table — adapt to your architecture
This blueprint is intentionally abstract. Here is how each layer maps to the reference implementation in this repo, and to common alternatives.
| Layer | This repo (Firebase + Flutter) | AWS | Cloudflare | Supabase |
|---|---|---|---|---|
| Backend | Firebase Functions v2 onRequest + Express (Firebase-Functions/functions/src/index.ts, exported api at line 2391) | Lambda + API Gateway | Workers | Edge Functions (Deno) |
| Secrets (L0) | defineSecret(...) bound via secrets:[...] on the function (index.ts:23-28, 2395) | Secrets Manager / SSM | Worker Secrets | Vault / project secrets |
| Identity (L1) | getAuth().verifyIdToken() (index.ts verifyFirebaseIdToken, applied :531,535) | Cognito / custom JWT authorizer | Access JWT | supabase.auth.getUser() |
| Integrity (L2) | Firebase App Check getAppCheck().verifyToken() (verifyAppCheckToken, applied :532,536); client lib/src/services/app_check_token_provider.dart | AppAttest / DeviceCheck + WAF | Turnstile | hCaptcha / Turnstile |
| Entitlement (L3) | plans/{uid} + LRU cache 10 s (loadPlanFromCache, planCache :47) | DynamoDB + in-mem cache | KV | Postgres row + cache |
| Pre-gate (L4) | fastPrecheck, X-Token-Estimate header | same logic | same | same |
| Proxy (L5) | /chat → OpenRouter, /whisper → Groq (index.ts:548, 612); model allowlist :48 | same | same | same |
| Meter (L6) | Firestore runTransaction + FieldValue.increment (writeRealUsage); usage/{uid}/daily|monthly | DynamoDB atomic counters | Durable Objects | Postgres += in txn |
| Billing sync (L8) | RevenueCat webhook, HMAC + idempotency (/webhook/revenuecat, verifyRcSignature) | Stripe webhook | same | same |
| Client (§7) | api_service.dart, audio_service.dart, quota_service.dart | any HTTP client | — | supabase-js |
Reference providers used here: LLM via OpenRouter (openai/gpt-5.4-mini), STT via Groq Whisper (whisper-large-v3-turbo), billing via RevenueCat. Swap any of them without touching the seven-layer structure.
13. TL;DR for an implementing agent
Build one backend endpoint per paid capability. Put four middlewares in front of it, in this order: verify identity → verify app integrity → resolve entitlement from your DB → check usage estimate against per-user daily+monthly limits. Only then inject the secret key (from a secrets manager) and call the provider with a server-clamped payload (allowlisted model, capped tokens/size). After the provider responds, read the real cost and atomically increment date-keyed daily+monthly counters in a transaction. Return the result plus a
{used, limit}envelope. Establish free-vs-paid entitlement from a signature-verified, idempotent billing webhook — never from the client. The client holds no key, no plan flag, and no counter the server trusts.
Last updated: 2026-06-11
The shortcut
Want this running in your product in 2 weeks?
This blueprint is free — your time isn't. I build systems like this end-to-end, fixed price, fixed timeline. Bring your use case to a 20-minute call and I'll tell you exactly what I'd ship.
Book a 20-minute call No deck, no pitch — just your architecture.