Skip to content

Deployment

Environments

Env Purpose Cluster
dev Day-to-day engineering GKE Autopilot
staging Integration; always-on GKE Autopilot
uat Customer acceptance GKE Autopilot
prod Production GKE Autopilot (plus dedicated AI / BIM node pools)

Each environment has its own Terraform state, AlloyDB instances, GCS buckets, and KMS keys. Environments are symmetric — same Helm chart, different values.

Compute

  • GKE Autopilot for all services.
  • Dedicated node pools for AI workers (high memory) and BIM workers (high CPU + optional GPU).
  • GPU node pool is provisioned on-demand; default is managed Vertex AI calls, not self-hosted inference.

Release pipeline

flowchart LR
  PR[PR merged] --> CI[CI<br/>build · test · scan]
  CI --> IMG[Push image<br/>Artifact Registry]
  IMG --> DEV[ArgoCD<br/>dev rollout]
  DEV --> STG[Promote<br/>staging]
  STG --> UAT[Promote<br/>uat]
  UAT --> PROD[Canary<br/>prod]
  PROD --> WATCH[SLO burn-rate alert]
  WATCH -->|breach| RB[Automated rollback]
  WATCH -->|healthy| FULL[Full rollout]

Infrastructure as code

Concern Tool
GCP resources Terraform (modules under infra/terraform/modules/*, environments under infra/terraform/env/<env>/)
Service charts Helm per service; umbrella chart deploy/charts/siss/
Continuous deploy ArgoCD (or Cloud Deploy — interchangeable)
Container images Artifact Registry

Every service ships:

  • A Helm chart with values per environment.
  • A Terraform module defining its AlloyDB instance, GCS buckets, KMS keys, and IAM.
  • A .argocd/ manifest wiring it into the rollout.

Progressive delivery

  • Canary rollout in prod — new version serves a small percentage of traffic first.
  • SLO burn-rate alerts wired to the rollout controller — a breach triggers automated rollback before it reaches the full fleet.
  • Manual approval gate for any schema migration that isn't additive.

Secrets

  • Google Secret Manager only. No secrets in Kustomize, Helm values, or env files.
  • CMEK on AlloyDB and GCS.
  • mTLS between services via Anthos Service Mesh (or Istio).

What each environment is for

  • dev — engineers push freely; cluster is cheap and can be torn down.
  • staging — integration target; contract tests run here continuously.
  • uat — customer sign-off per milestone; stable for the duration of a UAT cycle.
  • prod — canary-first, SLO-gated.

Hypercare window

M7 (go-live) includes a hypercare window where the delivery team remains on rotation alongside ops. After hypercare, support reverts to the standard on-call rotation with runbooks.