Deployment¶

Environments¶

Env	Purpose	Cluster
`dev`	Day-to-day engineering	GKE Autopilot
`staging`	Integration; always-on	GKE Autopilot
`uat`	Customer acceptance	GKE Autopilot
`prod`	Production	GKE Autopilot (plus dedicated AI / BIM node pools)

Each environment has its own Terraform state, AlloyDB instances, GCS buckets, and KMS keys. Environments are symmetric — same Helm chart, different values.

Compute¶

GKE Autopilot for all services.
Dedicated node pools for AI workers (high memory) and BIM workers (high CPU + optional GPU).
GPU node pool is provisioned on-demand; default is managed Vertex AI calls, not self-hosted inference.

Release pipeline¶

flowchart LR
  PR[PR merged] --> CI[CI<br/>build · test · scan]
  CI --> IMG[Push image<br/>Artifact Registry]
  IMG --> DEV[ArgoCD<br/>dev rollout]
  DEV --> STG[Promote<br/>staging]
  STG --> UAT[Promote<br/>uat]
  UAT --> PROD[Canary<br/>prod]
  PROD --> WATCH[SLO burn-rate alert]
  WATCH -->|breach| RB[Automated rollback]
  WATCH -->|healthy| FULL[Full rollout]

Infrastructure as code¶

Concern	Tool
GCP resources	Terraform (modules under `infra/terraform/modules/*`, environments under `infra/terraform/env/<env>/`)
Service charts	Helm per service; umbrella chart `deploy/charts/siss/`
Continuous deploy	ArgoCD (or Cloud Deploy — interchangeable)
Container images	Artifact Registry

Every service ships:

A Helm chart with values per environment.
A Terraform module defining its AlloyDB instance, GCS buckets, KMS keys, and IAM.
A .argocd/ manifest wiring it into the rollout.

Progressive delivery¶

Canary rollout in prod — new version serves a small percentage of traffic first.
SLO burn-rate alerts wired to the rollout controller — a breach triggers automated rollback before it reaches the full fleet.
Manual approval gate for any schema migration that isn't additive.

Secrets¶

Google Secret Manager only. No secrets in Kustomize, Helm values, or env files.
CMEK on AlloyDB and GCS.
mTLS between services via Anthos Service Mesh (or Istio).

What each environment is for¶

dev — engineers push freely; cluster is cheap and can be torn down.
staging — integration target; contract tests run here continuously.
uat — customer sign-off per milestone; stable for the duration of a UAT cycle.
prod — canary-first, SLO-gated.

Hypercare window

M7 (go-live) includes a hypercare window where the delivery team remains on rotation alongside ops. After hypercare, support reverts to the standard on-call rotation with runbooks.