Principles
- User value first: business outcomes over ticket throughput.
- Small, shippable slices: feature flags, trunk-based dev, fast feedback.
- Interfaces > Internals: stable APIs, strong contracts, typed schemas.
- Observability by default: logs, metrics, traces in day one PR.
- Automate the boring: CI/CD, codegen, scaffolds, tests, lint/format.
APIs
Design: start with an API spec (OpenAPI/AsyncAPI). Keep nouns consistent, verbs predictable, and pagination standard.
- Versioning:
/v1in path; backwards-compatible additions; use deprecation headers. - Contracts: JSON Schema + examples; validation at edge; typed SDKs generated from spec.
- Resilience: idempotency keys, retries with jitter, timeouts, and circuit breakers.
- Docs: living reference + task-centric guides; include copy-pasteable cURL and code.
Dashboards & UX
- Jobs to be done: one main task per view; secondary tasks tucked away.
- State & errors: explicit loading/empty/error states; optimistic updates where safe.
- Accessibility: keyboard navigation, ARIA roles, color contrast, RTL support.
- Perf: code-split, cache queries, avoid N+1 fetches; instrument Web Vitals.
Services & Architecture
Client
→ API Gateway (auth, rate limits)
→ Services (stateless) ↔ Data Stores (managed)
→ Async Workers / Queues (idempotent)
→ Observability Stack (logs/metrics/traces)
- Choose scope: modular monolith → microservices only when necessary.
- Data: pick the simplest store that works; single writer per entity; migrations in code.
- Asynchronicity: queues for slow/fragile work; design for at-least-once.
Observability
- Logs: structured (JSON), request IDs, PII scrubbing, sampling.
- Metrics: RED (rate, errors, duration) for services; USE for infra; SLOs with error budgets.
- Traces: propagate
traceparent; instrument hot paths; tag user/org/feature flags. - Dashboards & alerts: symptom-based, low-noise; paging for SLO burn, not every 500.
Testing Strategy
- Pyramid: fast unit tests → focused integration → a few e2e happy paths.
- Contracts: consumer-driven tests for APIs; schema checks in CI.
- Fixtures: deterministic seeds; ephemeral envs for PRs; test containers for deps.
- Non-func: load tests for P95/P99; security scans; migration dry-runs.
CI/CD
- CI: lint/format → unit → integration (containers) → security checks → artifact build.
- CD: blue/green or canary; automated migrations; instant rollback and feature flags.
- Policy: required reviews, status checks, conventional commits, signed images.
- Speed: cache deps, parallelize jobs, fail fast; target <10 min CI wall time.
Security & Compliance
- AuthN/Z: OAuth/OIDC, least privilege, per-tenant scoping; service-to-service with mTLS.
- Secrets: never in env files or code; use a secrets manager; short-lived tokens.
- Data: encryption in transit/at rest; audit trails; data retention and deletion jobs.
- Supply chain: SBOMs, image signing, dep-update bots, SAST/DAST in CI.
Performance & Scalability
- Budgets: SLO P50/P95, cold-start targets, memory/CPU caps.
- Caching: request-level, computed results, and read-through; bust with care.
- Backpressure: queues, bulkheads, rate-limits, adaptive concurrency.
- Cost: per-request cost and per-tenant cost tracked in metrics.
Release & On-call
- Runbooks: link from alerts; include quick triage, dashboards, and rollback steps.
- Incident lifecycle: severity levels, comms templates, postmortems with actions.
- Change management: weekly release notes; feature flag kill switches.
Sane Defaults (copy/paste)
APIs: OpenAPI, JSON Schema, idempotency keys, retries, timeouts
Dashboards: optimistic UI; error/loading/empty states
Services: stateless; queues for long tasks; one writer per entity
Observability: logs(JSON) + RED + traces; SLOs + burn alerts
Testing: unit > integration > e2e; contract tests; test containers
CI/CD: <10m CI; canary deploys; instant rollback; feature flags
Security: OIDC; mTLS between services; secrets manager
Perf: P95 budgeted; caching; backpressure; per-tenant cost
30-60-90 Roadmap
- 30d: scaffold service template (spec, tracing, health), CI linters/tests, staging env.
- 60d: contract tests, canary deploy, basic SLOs + dashboards, incident runbooks.
- 90d: load tests in CI, error budget policy, cost dashboards, automated dep updates.