Deployment¶
Environments¶
| Env | Purpose | Cluster |
|---|---|---|
dev |
Day-to-day engineering | GKE Autopilot |
staging |
Integration; always-on | GKE Autopilot |
uat |
Customer acceptance | GKE Autopilot |
prod |
Production | GKE Autopilot (plus dedicated AI / BIM node pools) |
Each environment has its own Terraform state, AlloyDB instances, GCS buckets, and KMS keys. Environments are symmetric — same Helm chart, different values.
Compute¶
- GKE Autopilot for all services.
- Dedicated node pools for AI workers (high memory) and BIM workers (high CPU + optional GPU).
- GPU node pool is provisioned on-demand; default is managed Vertex AI calls, not self-hosted inference.
Release pipeline¶
flowchart LR
PR[PR merged] --> CI[CI<br/>build · test · scan]
CI --> IMG[Push image<br/>Artifact Registry]
IMG --> DEV[ArgoCD<br/>dev rollout]
DEV --> STG[Promote<br/>staging]
STG --> UAT[Promote<br/>uat]
UAT --> PROD[Canary<br/>prod]
PROD --> WATCH[SLO burn-rate alert]
WATCH -->|breach| RB[Automated rollback]
WATCH -->|healthy| FULL[Full rollout]
Infrastructure as code¶
| Concern | Tool |
|---|---|
| GCP resources | Terraform (modules under infra/terraform/modules/*, environments under infra/terraform/env/<env>/) |
| Service charts | Helm per service; umbrella chart deploy/charts/siss/ |
| Continuous deploy | ArgoCD (or Cloud Deploy — interchangeable) |
| Container images | Artifact Registry |
Every service ships:
- A Helm chart with values per environment.
- A Terraform module defining its AlloyDB instance, GCS buckets, KMS keys, and IAM.
- A
.argocd/manifest wiring it into the rollout.
Progressive delivery¶
- Canary rollout in
prod— new version serves a small percentage of traffic first. - SLO burn-rate alerts wired to the rollout controller — a breach triggers automated rollback before it reaches the full fleet.
- Manual approval gate for any schema migration that isn't additive.
Secrets¶
- Google Secret Manager only. No secrets in Kustomize, Helm values, or env files.
- CMEK on AlloyDB and GCS.
- mTLS between services via Anthos Service Mesh (or Istio).
What each environment is for¶
- dev — engineers push freely; cluster is cheap and can be torn down.
- staging — integration target; contract tests run here continuously.
- uat — customer sign-off per milestone; stable for the duration of a UAT cycle.
- prod — canary-first, SLO-gated.
Hypercare window
M7 (go-live) includes a hypercare window where the delivery team remains on rotation alongside ops. After hypercare, support reverts to the standard on-call rotation with runbooks.