PHASE 15 Prod · ~5 hours

Production Engineering

The bridge from "works on Hardhat" to "survives a Monday morning with 50k users and a flaky RPC". This phase is pure DevOps with Web3 quirks.

Goal — ship a full-stack dApp: verified contracts, monitored indexer, HA RPC, CI/CD pipeline, pager runbooks.

1. The deploy checklist

[ ] Contracts audited (or at least: Slither clean + 90%+ tests + fuzzing).
[ ] Deployment is a script in Git (no "I ran it from my laptop").
[ ] Constructor args committed; contract verified on Etherscan/Basescan.
[ ] Owner is a multisig (Gnosis Safe) or timelock, not a single EOA.
[ ] Emergency pause tested.
[ ] Frontend reads contract addresses from config per chain, not hard-coded.
[ ] Indexer has backfill strategy and reorg safety.
[ ] Monitoring & alerts wired for stuck txs, RPC errors, event gap.

2. RPC strategy — your biggest single dependency

Provider	Sweet spot
Alchemy	Most features (debug, trace, webhooks, NFT APIs)
Infura	Battle-tested; part of Consensys stack
QuickNode	Wide chain support
Ankr / Public endpoints	Dev/backup only
Self-hosted (Erigon / Geth / Reth)	At scale, cheaper + no rate limits

Always use ≥2 providers behind a load balancer. Fall over on 5xx / rate limit. Ethers v6 supports FallbackProvider; or use a router (e.g., dRPC).

3. Key management

User wallets — not your problem; MetaMask / WalletConnect / RainbowKit.
Hot keys (indexer relayer, bots) — KMS (AWS KMS, GCP KMS). Never raw keys in env vars in prod. Use @aws-sdk/client-kms + ethers Signer adapter.
Admin keys — Gnosis Safe multisig with threshold N-of-M. Add a timelock (e.g., 48h) so users can react to malicious proposals.

// Gnosis Safe + Timelock upgrade flow
Safe → schedule(timelock, contract.upgradeTo(newImpl))
        │ 48h pass, users can exit
        ▼
Safe → execute(timelock, contract.upgradeTo(newImpl))

4. Upgradeability — only if you need it

Transparent proxy (EIP-1967) — classic, simple.
UUPS — upgrade logic in the implementation. Cheaper to deploy, risk of locking yourself out.
Diamond (EIP-2535) — multi-facet; powerful, complex.
Don't upgrade — simplest and safest. Deploy v2 and migrate.

Gotcha — upgradeable contracts have storage-layout constraints. Changing variable order breaks everything. Use OpenZeppelin's hardhat-upgrades with storage checks.

5. CI/CD

# .github/workflows/contracts.yml
name: contracts
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx hardhat test
      - run: npx hardhat coverage
      - name: Slither
        uses: crytic/slither-action@v0.4.0
        with: { fail-on: medium }
      - name: Gas report
        run: REPORT_GAS=true npx hardhat test

Ship a deploy workflow that is gated on manual approval + tag, and writes deployments/<chain>.json back to the repo as a PR.

6. Observability

Signal	Why	Tool
RPC latency & error rate	Detects provider issues	Prom + Grafana
Indexer lag (head - cursor)	Alerts on stuck ingest	Custom metric
Stuck tx in mempool	Relayer nonce jam	Tenderly / custom
Contract balances	Treasury drift, drains	Forta / custom
Event-rate anomalies	Attack detection	OpenZeppelin Defender Sentinels
Gas spikes	Exec budget awareness	Blocknative / EthGasStation

7. Docker + infra shape

[ CDN / Cloudflare ] │ ┌────────┼─────────┐ ▼ ▼ ▼ UI API WebSocket gateway (static) (Node) (Node, ethers WSS) │ ▼ Postgres (managed, RDS/Neon) Redis (cache, queues) │ ▼ Indexer workers (Node/Go) ── RPC LB ── Alchemy / Infura / self-hosted geth │ ▼ Relayer (KMS-signed) → tx submission

For 10–50k DAU, a single VPS + managed DB + Alchemy free tier is enough. Scale by adding worker replicas and a pull-oracle pattern for price fetch.

8. Deploying the frontend

Static build (Vite) → Vercel / Cloudflare Pages. No secrets ship client-side.
Config per chain in public/config.json; don't rebuild per env.
Set Content-Security-Policy; forbid inline scripts except vetted.
Pin RPC endpoints in your own backend; don't expose API keys to the browser.

9. Monitoring with OpenZeppelin Defender / Tenderly

Defender Sentinels — condition-based alerts on events (e.g., "paging if withdraw > $100k").
Tenderly — tx simulator + alerts + Web3 Actions (serverless reactors). Great for incident forensics.
Forta — federated monitoring network; detect protocol anomalies with custom bots.

10. Incident response

Pause switch runbook: who calls it, how many sigs, how to communicate.
Public post-mortem template (within 72h of major incident).
Bug bounty on Immunefi with clear scope and severity payouts.
"War room" Slack channel; dedicated status page.

11. Project

Capstone — integrate everything: Phase 6 contracts, Phase 7 UI, Phase 8 indexer. Deploy to Base Sepolia. Wrap ownership in a Safe multisig. Add Grafana dashboards for RPC + indexer lag. Write a README that a fresh engineer can use to redeploy from scratch in < 30 minutes.

Quiz

Q. Your indexer silently fell behind by 4 hours because the RPC returned a 200 with an empty logs: [] array. How do you detect this kind of failure?

Check HTTP status codes
Trust Alchemy's SLA
Emit a metric lag_blocks = head − cursor and alert when it exceeds threshold; cross-check with a second RPC
Restart nightly

"Empty result" is indistinguishable from "nothing happened". Always compute lag against the current head and run a shadow check against a second provider.