Architecture Blueprint

Secure API-Key Proxy & Usage Gating

Hide third-party API keys (LLM + Speech-to-Text) from your customers and meter their usage server-side. A provider-agnostic, seven-layer architecture blueprint.

Updated 11 June 2026 · 3,388 words · LLM-ready markdown
Have me build it

A copy-pasteable, provider-agnostic blueprint for hiding third-party API keys (LLM + Speech-to-Text) from your customers and metering / rate-limiting their usage server-side.

This document is written as a portable specification. It describes the pattern, not one framework. Section 12 maps every abstract component to the concrete reference implementation in this repo (Firebase Functions + Flutter), and to common alternatives (AWS, Cloudflare, Supabase, raw Node, etc.) so another engineer — or another coding agent — can drop it into a different stack.

One sentence summary: The client never sees a provider key. It calls your backend with a short-lived identity token + an app-integrity token; your backend checks who they are, checks the request came from your real app, checks they still have quota, then forwards the call to the LLM/STT provider using a secret only the server holds, meters the real cost atomically, and returns the result.


1. Prerequisites

Before implementing this blueprint, you need the following building blocks. Each has many vendor choices; pick one per row.

#CapabilityWhy it’s requiredCommon choices
1A backend you control (serverless function, container, or edge worker)The only place the provider key is ever decrypted. The whole pattern collapses if the client can call the provider directly.Firebase Functions, Cloud Run, AWS Lambda, Cloudflare Workers, Supabase Edge Functions, a plain Node/Express server
2A secrets managerStores provider keys encrypted at rest, injected into the backend at runtime, never in source control or the client bundle.Firebase/GCP Secret Manager, AWS Secrets Manager, Doppler, Vault, platform env-vars
3Per-user authenticationQuotas are per user. You must be able to prove “this request belongs to user X” without trusting a client-supplied user-id.Firebase Auth, Auth0, Clerk, Supabase Auth, your own JWT issuer
4App-integrity / attestationAuth proves who; attestation proves the request came from your genuine, unmodified app — not a script replaying a stolen token. Without it, a leaked auth token = free unlimited use of your paid API.Firebase App Check (Play Integrity / DeviceCheck / reCAPTCHA), AppAttest, hCaptcha, a signed-nonce scheme
5A datastore with atomic increments + transactionsUsage counters must increment correctly under concurrent requests. Read-modify-write without atomicity leaks quota.Firestore (FieldValue.increment), DynamoDB (atomic counters), Redis (INCR), Postgres (UPDATE ... SET n = n + x)
6Provider accountsThe actual paid services you are proxying.LLM: OpenAI, OpenRouter, Anthropic. STT: Groq Whisper, OpenAI Whisper, Deepgram, Google STT
7(Optional) A billing / entitlement providerTo distinguish free vs paid users and set their quota. Optional if everyone has the same limit.RevenueCat, Stripe, Play Billing / App Store directly

Mindset prerequisite: treat every byte from the client as hostile. The client decides nothing that costs money — not the model, not the limit, not the plan, not the user-id. The server re-derives all of it.


2. The threat model (what we are defending against)

Name the attacker so the design is justified, not cargo-culted.

ThreatWhat happens if unaddressedDefended by
Key extraction from the clientAnyone decompiles your app / opens DevTools, lifts the API key, and runs your provider bill to $50k overnight. Keys in mobile binaries and JS bundles are public.§3 server-side key, §4 proxy
Token replay / scripted abuseAn attacker grabs one valid auth token and scripts millions of calls.§6 attestation
Quota evasionClient claims plan = premium or tokensUsed = 0. Free tier becomes infinite.§7 server-derived entitlement, §8 server-side metering
Race-condition over-spend100 concurrent requests each read “0 used”, all pass the check, user spends 100× their limit.§8 atomic counters
Model / parameter abuseClient requests an expensive model or max_tokens: 1_000_000.§4 server-side allowlist + clamps
Billing spoofingClient fakes a “purchase succeeded” message to unlock premium for free.§9 signed webhook, server-verified entitlement
Cost-of-goods blindnessYou can’t see who spends what; one user quietly costs more than they pay.§8 per-model usage ledger

3. Layered architecture

Seven layers, each a single responsibility. Implement them as ordered middleware so the cheapest rejections happen first (reject before you spend).

                              ┌─────────────────────────────────────────────┐
   CLIENT (untrusted)         │            YOUR BACKEND (trusted)            │      PROVIDERS (paid)
 ┌────────────────────┐       │                                             │   ┌──────────────────┐
 │  App / Web client  │       │  L1  Identity     verify auth token  ──401  │   │   LLM provider   │
 │                    │       │  L2  Integrity    verify app-check   ──401  │   │  (OpenAI / etc.) │
 │  • holds NO key    │──────▶│  L3  Entitlement  resolve plan (cache)      │   ├──────────────────┤
 │  • sends:          │ HTTPS │  L4  Pre-gate     quota check        ──429  │   │   STT provider   │
 │     auth token     │       │  L5  Proxy        inject SECRET KEY  ───────┼──▶│  (Whisper / etc.)│
 │     app-check tok  │       │  L6  Meter        atomic usage write        │   └──────────────────┘
 │     token estimate │◀──────│  L7  Respond      result + quota state      │
 └────────────────────┘ JSON  │                                             │
                              │   ▲ secret key injected from Secrets Mgr    │
                              └───┼─────────────────────────────────────────┘

                          ┌───────┴────────┐        ┌──────────────────────┐
                          │ Secrets Manager│        │  Billing provider     │
                          │  (server only) │        │  webhook ──▶ L0 sync  │
                          └────────────────┘        └──────────────────────┘

Layer 0 — Secret storage

Provider keys live in a secrets manager, bound to the function at deploy time. They are read at runtime via SECRET.value() / process.env, trimmed/sanitised, and a client is lazily constructed once per warm instance. Keys are never logged, never returned in a response, never committed.

Layer 1 — Identity (who)

Extract the bearer token, verify its signature against the auth provider, derive a trusted uid from the verified claims. Reject 401 on missing/invalid. Never read the user-id from the request body.

Layer 2 — Integrity (genuine app)

Read the attestation header, verify it with the attestation service. Reject 401 if missing/invalid. This is what stops a stolen token from being weaponised at scale. Keep a tiny, explicit dev-bypass list for your own test accounts — and nothing else.

Layer 3 — Entitlement (what they’re allowed)

Resolve the user’s plan (free / premium / …) → a concrete set of limits. Read it from the server’s own store, never from the client. Cache it in-process with a short TTL (e.g. 10 s, 5k entries LRU) so you don’t hit the DB on every call. Allow an explicit cache-bust header for the “just purchased, refresh now” moment.

Layer 4 — Pre-gate (cheap rejection before spend)

Before calling the provider, check the current usage against the limit using a client-supplied estimate (X-Token-Estimate, or audio-length for STT). If used + estimate > limit for either the daily or monthly window, reject 429 quota_exceeded with {used, limit} so the client can show an upgrade prompt. This is an optimistic gate — it’s cheap and stops most over-spend before a single paid token is bought.

Layer 5 — Proxy (the actual forward)

Validate & clamp the request: enforce a model allowlist, clamp max_tokens, clamp temperature, cap upload size. Inject the secret key into the outbound call. Forward to the provider. Treat provider errors as 502. The client’s request is a suggestion; the server decides the final payload.

Layer 6 — Meter (truth after the fact)

Read the real cost from the provider response (usage.total_tokens; for STT, the charged audio minutes), and atomically increment the per-user daily and monthly counters inside a transaction. Record per-model breakdown for cost attribution. The pre-gate is the estimate; this is the settlement.

Layer 7 — Respond

Return the provider result plus a small quota envelope ({used, limit, plan}) so the client UI can render remaining usage without a second round-trip.


4. Request lifecycle (sequence)

client                 backend (ordered middleware)                provider     datastore
  │                                                                   │            │
  │ POST /chat                                                        │            │
  │  Authorization: Bearer <authToken>                                │            │
  │  X-App-Check: <attestationToken>                                  │            │
  │  X-Token-Estimate: 1200                                           │            │
  ├──────────────▶ L1 verify auth ─────────────────── invalid ▶ 401  │            │
  │                L2 verify attestation ───────────── invalid ▶ 401  │            │
  │                L3 resolve plan (cache→DB) ───────────────────────────────────▶ read plans/{uid}
  │                L4 read usage, estimate gate ─────── over ────▶ 429│◀───────────── read usage/{uid}/...
  │                L5 validate+clamp, inject SECRET ─────────────────▶│            │
  │                                                    provider call  │            │
  │                L6 read real usage, atomic incr ◀─────────────────│───────────▶ tx incr daily+monthly
  │ 200 { ...result, quota:{used,limit}, plan } ◀── L7              │            │
  │◀──────────────                                                    │            │

Key ordering rule: identity → integrity → entitlement → gate → spend → meter. Each step is strictly cheaper and more certain than the paid call, so the request is killed as early as possible.


5. Data model

Two collections. Names are illustrative; any DB works.

5.1 Entitlement store — plans/{uid}

Written by the billing webhook (§9), read by Layer 3.

{
  "plan_type": "premium",                  // 'free' | 'premium'
  "subscription_status": "active",         // active | cancelled | expired | billing_issue
  "expires_date": "2026-07-01T00:00:00Z",
  "limits": {                              // server-authoritative limits
    "tokens_per_day":          500000,
    "tokens_per_month":        3000000,
    "stt_minutes_per_day":     20,
    "stt_minutes_per_month":   200
  },
  "billing_user_id": "rc_or_stripe_id",
  "last_updated": "<server timestamp>"
}

5.2 Usage ledger — usage/{uid}/daily/{YYYY-MM-DD} and usage/{uid}/monthly/{YYYY-MM}

Written by Layer 6, read by Layer 4. Date-keyed documents give you free, automatic window resets — a new day = a fresh empty doc, no cron job needed.

// usage/{uid}/daily/2026-06-11
{
  "total_tokens":     12840,
  "tokens":           { "openai/gpt-5.4-mini": 12840 },   // per-model attribution
  "stt_minutes":      4.2,
  "stt":              { "whisper-large-v3-turbo": 4.2 },
  "last_updated":     "<server timestamp>"
}

Why two windows: the daily limit smooths spikes and caps the blast radius of a single bad day; the monthly limit caps the bill. A request must pass both.

Why date-keyed docs: resets are implicit. There is no “reset usage” job to forget — todayKey() simply points at a new document at midnight UTC.


6. Portable reference implementation (pseudo-code)

Language-neutral, framework-neutral. Copy and translate. The control flow is the spec.

6.1 Layer 0 — secrets & lazy provider client

const LLM_KEY = secret("LLM_API_KEY");     // bound at deploy, server-only
const STT_KEY = secret("STT_API_KEY");

let llm;                                    // lazy, per warm instance
function getLlm() {
  if (!llm) {
    const key = LLM_KEY.value().trim().replace(/^['"]|['"]$/g, "").replace(/\r?\n/g, "");
    llm = new ProviderClient({ apiKey: key });
  }
  return llm;
}

6.2 Layers 1 & 2 — identity + integrity middleware

async function verifyIdentity(req, res, next) {
  const token = bearer(req.headers.authorization);
  if (!token) return res.status(401).json({ error: "missing_auth" });
  try { req.uid = (await auth.verifyToken(token, /*checkRevoked*/ true)).uid; next(); }
  catch { return res.status(401).json({ error: "invalid_auth" }); }
}

async function verifyIntegrity(req, res, next) {
  const tok = req.headers["x-app-check"];
  if (isDevUser(req.uid)) return next();              // explicit, tiny allowlist
  if (!tok) return res.status(401).json({ error: "missing_app_check" });
  try { await appCheck.verifyToken(tok); next(); }
  catch { return res.status(401).json({ error: "invalid_app_check" }); }
}

6.3 Layer 3 — entitlement with cache

const planCache = new LRU({ max: 5000, ttl: 10_000 });

async function loadPlan(req, res, next) {
  const reload = req.headers["x-plan-reload"] === "1";
  if (reload) planCache.delete(req.uid);

  let plan = planCache.get(req.uid);
  if (!plan) {
    plan = normalizePlan(await db.get(`plans/${req.uid}`));   // server-authoritative
    planCache.set(req.uid, plan);
  }
  req.plan = plan;
  next();
}

// normalizePlan clamps every client/DB value to known-safe caps.
// A missing/garbage plan ALWAYS falls back to the FREE tier, never premium.

6.4 Layer 4 — pre-gate (estimate)

async function preGate(req, res, next) {
  const estimate = Number(req.headers["x-token-estimate"] ?? 0);

  const day   = await db.get(`usage/${req.uid}/daily/${todayKey()}`);
  const month = await db.get(`usage/${req.uid}/monthly/${monthKey()}`);

  if ((day?.total_tokens   ?? 0) + estimate > req.plan.tokensLimit)
    return res.status(429).json({ error: "quota_exceeded",
                                  used: day?.total_tokens ?? 0, limit: req.plan.tokensLimit });

  if ((month?.total_tokens ?? 0) + estimate > req.plan.tokensMonthlyLimit)
    return res.status(429).json({ error: "quota_exceeded_monthly" });

  next();
}

6.5 Layer 5 + 6 — proxy then meter

async function handleChat(req, res) {
  const model = req.body.model ?? DEFAULT_MODEL;
  if (!ALLOWED_MODELS.has(model)) return res.status(400).json({ error: "model_not_allowed" });

  const payload = { ...req.body, model,
    max_tokens:  Math.min(req.body.max_tokens  ?? 2048, 4096),   // server clamps
    temperature: Math.min(req.body.temperature ?? 0.7, 1.0) };

  const result = await getLlm().chat(payload);                   // SECRET injected here
  const realTokens = result?.usage?.total_tokens ?? Number(req.headers["x-token-estimate"] ?? 0);

  const quota = await writeUsage(req.uid, realTokens, model);    // atomic settle
  return res.json({ ...result, quota, plan: { type: req.plan.type } });
}

async function writeUsage(uid, tokens, model) {
  await db.transaction(async tx => {
    tx.merge(`usage/${uid}/daily/${todayKey()}`,   {
      total_tokens: INCREMENT(tokens), [`tokens.${model}`]: INCREMENT(tokens), last_updated: NOW() });
    tx.merge(`usage/${uid}/monthly/${monthKey()}`, {
      total_tokens: INCREMENT(tokens), [`tokens.${model}`]: INCREMENT(tokens), last_updated: NOW() });
  });
}

6.6 Speech-to-text variant

STT is the same seven layers; only the cost unit changes — minutes of audio instead of tokens.

  • Client uploads audio as multipart/form-data (never a URL the server must fetch — that’s an SSRF foot-gun).
  • Shrink before upload to cut both cost and latency: down-sample to the model’s native rate (16 kHz), mono, a compressed codec (Opus ~32 kbps), and optionally speed up slightly. In this repo a single FFmpeg pass does highpass=80, atempo=1.25, 16 kHz, mono, libopus 32k client-side.
  • Pre-gate on an estimate from file size (e.g. ~0.5 MB ≈ 1 min). Settle on the real duration the provider reports (charged_minutes).
  • Enforce an upload content-type allowlist and a hard size cap at the proxy.

7. The client’s contract

The client is a thin, honest courier. It:

  1. Obtains a fresh auth token and a fresh attestation token (fetch both in parallel; cache the attestation token ~50 min; apply a short timeout so a slow attestation never hangs the UX).
  2. Sends them as headers to your backend endpoint — never to the provider.
  3. Sends an honest token/length estimate so the pre-gate works (lying only hurts the user — the server settles on real cost anyway).
  4. Handles 429 by showing an upgrade / paywall prompt, not by retrying.
  5. Holds no provider key, no plan flag the server trusts, no usage counter the server trusts.
// Flutter reference: lib/src/services/chat/api_service.dart
final tokens = await Future.wait([_getIdToken(), _appCheck.get()]);
final headers = {
  'Authorization':       'Bearer ${tokens[0]}',
  'X-Firebase-AppCheck': tokens[1] ?? '',
  'X-Token-Estimate':    estimatedTokens.toString(),
  'X-Model':             effectiveModel,
};
final res = await http.post(Uri.parse('$_baseUrl/chat'), headers: headers, body: body);
if (res.statusCode == 429) throw QuotaExceededException(/* show paywall */);

Never trust the client’s word that a purchase happened. Establish entitlement server-side:

  1. Billing provider (RevenueCat / Stripe) sends a webhook on INITIAL_PURCHASE / RENEWAL / CANCELLATION / EXPIRATION.
  2. Verify the webhook signature — HMAC-SHA256 over the raw body with a shared secret, compared with a timing-safe equality check (or a bearer secret). Reject 401 otherwise.
  3. Idempotency: store processed event_ids; a duplicate delivery is a no-op.
  4. Map product_id → limits, write plans/{uid}, and bust the plan cache immediately so the upgrade is live within seconds.
  5. For the “just purchased, still shows free” race, let the client send a one-shot X-RC-Verify: 1 / X-Plan-Reload: 1 header that forces a fresh entitlement read on the next call.

9. Security checklist (use as PR review gate)

  • Provider keys exist only in the secrets manager and the server process — grep the client bundle to be sure.
  • Every paid endpoint runs identity → integrity → entitlement → pre-gate middleware, in that order, before any provider call.
  • uid is taken from the verified token, never from the request body/query.
  • App-integrity verification is enforced (not just auth). Dev-bypass is a tiny, explicit, hard-coded allowlist.
  • Model is checked against a server allowlist; max_tokens, temperature, and upload size are clamped server-side.
  • Usage counters use atomic increments inside a transaction.
  • Quota is checked against both daily and monthly windows.
  • A missing/garbage entitlement falls back to free, never premium.
  • Webhook is signature-verified (timing-safe) and idempotent.
  • No secret is ever logged or returned in a response body.
  • Provider/transcription endpoints reject non-allowlisted content types and oversized uploads.
  • Rate-limit / abuse alerts exist on anomalous per-uid spend (cost attribution from the per-model ledger).

10. Failure modes & how to think about them

FailureRight behaviour
Attestation service down / slowClient times out fast and proceeds without the token → server rejects 401. Fail closed on the server, fail fast on the client. Don’t let a slow integrity check freeze the app.
Entitlement DB read failsFall back to free limits, not premium. Cheaper to under-serve than to give away the paid API.
Provider returns no usage blockSettle the meter on the client’s estimate so you still charge something — never charge zero.
Concurrent burst from one userAtomic increments mean every request settles correctly; the pre-gate may briefly under-count, but the monthly cap is the backstop.
Webhook arrives twiceIdempotency key makes the second one a no-op.
Clock/timezone skew on window keysPick one zone (UTC) for todayKey()/monthKey() and use it everywhere — client display can localise separately.

11. Cost & quality optimisations (worth stealing)

  • Two-stage metering (estimate-gate then real-settle) keeps the hot path to one DB read before the paid call, and one transaction after.
  • Per-model ledger (tokens.{model}) gives you exact cost attribution per user — essential for spotting a single user who costs more than they pay.
  • Pre-upload audio compression for STT cut bytes-on-wire and provider minutes simultaneously (16 kHz/mono/Opus + slight speed-up).
  • Short-TTL plan cache (10 s) collapses thousands of entitlement reads into one while keeping upgrades near-instant.
  • Premium = “unlimited” but safely capped. Even paid plans get a hard ceiling so a compromised premium account can’t run an unbounded bill.

12. Stack-mapping table — adapt to your architecture

This blueprint is intentionally abstract. Here is how each layer maps to the reference implementation in this repo, and to common alternatives.

LayerThis repo (Firebase + Flutter)AWSCloudflareSupabase
BackendFirebase Functions v2 onRequest + Express (Firebase-Functions/functions/src/index.ts, exported api at line 2391)Lambda + API GatewayWorkersEdge Functions (Deno)
Secrets (L0)defineSecret(...) bound via secrets:[...] on the function (index.ts:23-28, 2395)Secrets Manager / SSMWorker SecretsVault / project secrets
Identity (L1)getAuth().verifyIdToken() (index.ts verifyFirebaseIdToken, applied :531,535)Cognito / custom JWT authorizerAccess JWTsupabase.auth.getUser()
Integrity (L2)Firebase App Check getAppCheck().verifyToken() (verifyAppCheckToken, applied :532,536); client lib/src/services/app_check_token_provider.dartAppAttest / DeviceCheck + WAFTurnstilehCaptcha / Turnstile
Entitlement (L3)plans/{uid} + LRU cache 10 s (loadPlanFromCache, planCache :47)DynamoDB + in-mem cacheKVPostgres row + cache
Pre-gate (L4)fastPrecheck, X-Token-Estimate headersame logicsamesame
Proxy (L5)/chat → OpenRouter, /whisper → Groq (index.ts:548, 612); model allowlist :48samesamesame
Meter (L6)Firestore runTransaction + FieldValue.increment (writeRealUsage); usage/{uid}/daily|monthlyDynamoDB atomic countersDurable ObjectsPostgres += in txn
Billing sync (L8)RevenueCat webhook, HMAC + idempotency (/webhook/revenuecat, verifyRcSignature)Stripe webhooksamesame
Client (§7)api_service.dart, audio_service.dart, quota_service.dartany HTTP clientsupabase-js

Reference providers used here: LLM via OpenRouter (openai/gpt-5.4-mini), STT via Groq Whisper (whisper-large-v3-turbo), billing via RevenueCat. Swap any of them without touching the seven-layer structure.


13. TL;DR for an implementing agent

Build one backend endpoint per paid capability. Put four middlewares in front of it, in this order: verify identity → verify app integrity → resolve entitlement from your DB → check usage estimate against per-user daily+monthly limits. Only then inject the secret key (from a secrets manager) and call the provider with a server-clamped payload (allowlisted model, capped tokens/size). After the provider responds, read the real cost and atomically increment date-keyed daily+monthly counters in a transaction. Return the result plus a {used, limit} envelope. Establish free-vs-paid entitlement from a signature-verified, idempotent billing webhook — never from the client. The client holds no key, no plan flag, and no counter the server trusts.


Last updated: 2026-06-11

The shortcut

Want this running in your product in 2 weeks?

This blueprint is free — your time isn't. I build systems like this end-to-end, fixed price, fixed timeline. Bring your use case to a 20-minute call and I'll tell you exactly what I'd ship.

Book a 20-minute call No deck, no pitch — just your architecture.
Hand this to your coding agent 3,388 words of LLM-ready spec