Production Engineering
The bridge from "works on Hardhat" to "survives a Monday morning with 50k users and a flaky RPC". This phase is pure DevOps with Web3 quirks.
Goal — ship a full-stack dApp: verified contracts, monitored indexer, HA RPC, CI/CD pipeline, pager runbooks.
1. The deploy checklist
- [ ] Contracts audited (or at least: Slither clean + 90%+ tests + fuzzing).
- [ ] Deployment is a script in Git (no "I ran it from my laptop").
- [ ] Constructor args committed; contract verified on Etherscan/Basescan.
- [ ] Owner is a multisig (Gnosis Safe) or timelock, not a single EOA.
- [ ] Emergency pause tested.
- [ ] Frontend reads contract addresses from config per chain, not hard-coded.
- [ ] Indexer has backfill strategy and reorg safety.
- [ ] Monitoring & alerts wired for stuck txs, RPC errors, event gap.
2. RPC strategy — your biggest single dependency
| Provider | Sweet spot |
|---|---|
| Alchemy | Most features (debug, trace, webhooks, NFT APIs) |
| Infura | Battle-tested; part of Consensys stack |
| QuickNode | Wide chain support |
| Ankr / Public endpoints | Dev/backup only |
| Self-hosted (Erigon / Geth / Reth) | At scale, cheaper + no rate limits |
Always use ≥2 providers behind a load balancer. Fall over on 5xx / rate limit. Ethers v6 supports
FallbackProvider; or use a router (e.g., dRPC).3. Key management
- User wallets — not your problem; MetaMask / WalletConnect / RainbowKit.
- Hot keys (indexer relayer, bots) — KMS (AWS KMS, GCP KMS). Never raw keys in env vars in prod. Use
@aws-sdk/client-kms+ ethersSigneradapter. - Admin keys — Gnosis Safe multisig with threshold N-of-M. Add a timelock (e.g., 48h) so users can react to malicious proposals.
// Gnosis Safe + Timelock upgrade flow
Safe → schedule(timelock, contract.upgradeTo(newImpl))
│ 48h pass, users can exit
▼
Safe → execute(timelock, contract.upgradeTo(newImpl))
4. Upgradeability — only if you need it
- Transparent proxy (EIP-1967) — classic, simple.
- UUPS — upgrade logic in the implementation. Cheaper to deploy, risk of locking yourself out.
- Diamond (EIP-2535) — multi-facet; powerful, complex.
- Don't upgrade — simplest and safest. Deploy v2 and migrate.
Gotcha — upgradeable contracts have storage-layout constraints. Changing variable order breaks everything. Use OpenZeppelin's
hardhat-upgrades with storage checks.5. CI/CD
# .github/workflows/contracts.yml
name: contracts
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci
- run: npx hardhat test
- run: npx hardhat coverage
- name: Slither
uses: crytic/slither-action@v0.4.0
with: { fail-on: medium }
- name: Gas report
run: REPORT_GAS=true npx hardhat test
Ship a deploy workflow that is gated on manual approval + tag, and writes deployments/<chain>.json back to the repo as a PR.
6. Observability
| Signal | Why | Tool |
|---|---|---|
| RPC latency & error rate | Detects provider issues | Prom + Grafana |
| Indexer lag (head - cursor) | Alerts on stuck ingest | Custom metric |
| Stuck tx in mempool | Relayer nonce jam | Tenderly / custom |
| Contract balances | Treasury drift, drains | Forta / custom |
| Event-rate anomalies | Attack detection | OpenZeppelin Defender Sentinels |
| Gas spikes | Exec budget awareness | Blocknative / EthGasStation |
7. Docker + infra shape
[ CDN / Cloudflare ]
│
┌────────┼─────────┐
▼ ▼ ▼
UI API WebSocket gateway
(static) (Node) (Node, ethers WSS)
│
▼
Postgres (managed, RDS/Neon) Redis (cache, queues)
│
▼
Indexer workers (Node/Go) ── RPC LB ── Alchemy / Infura / self-hosted geth
│
▼
Relayer (KMS-signed) → tx submission
For 10–50k DAU, a single VPS + managed DB + Alchemy free tier is enough. Scale by adding worker replicas and a pull-oracle pattern for price fetch.
8. Deploying the frontend
- Static build (Vite) → Vercel / Cloudflare Pages. No secrets ship client-side.
- Config per chain in
public/config.json; don't rebuild per env. - Set
Content-Security-Policy; forbid inline scripts except vetted. - Pin RPC endpoints in your own backend; don't expose API keys to the browser.
9. Monitoring with OpenZeppelin Defender / Tenderly
- Defender Sentinels — condition-based alerts on events (e.g., "paging if withdraw > $100k").
- Tenderly — tx simulator + alerts + Web3 Actions (serverless reactors). Great for incident forensics.
- Forta — federated monitoring network; detect protocol anomalies with custom bots.
10. Incident response
- Pause switch runbook: who calls it, how many sigs, how to communicate.
- Public post-mortem template (within 72h of major incident).
- Bug bounty on Immunefi with clear scope and severity payouts.
- "War room" Slack channel; dedicated status page.
11. Project
Capstone — integrate everything: Phase 6 contracts, Phase 7 UI, Phase 8 indexer. Deploy to Base Sepolia. Wrap ownership in a Safe multisig. Add Grafana dashboards for RPC + indexer lag. Write a README that a fresh engineer can use to redeploy from scratch in < 30 minutes.
Quiz
Q. Your indexer silently fell behind by 4 hours because the RPC returned a 200 with an empty
logs: [] array. How do you detect this kind of failure?- Check HTTP status codes
- Trust Alchemy's SLA
- Emit a metric
lag_blocks = head − cursorand alert when it exceeds threshold; cross-check with a second RPC - Restart nightly
"Empty result" is indistinguishable from "nothing happened". Always compute lag against the current head and run a shadow check against a second provider.