Kore CI/CD pipeline: Harness, Terraform, Artifactory, and the multi-region story

[DRAFT NOTE] This is a structural skeleton. Sections are accurate to the toolchain you’ve used; specifics tagged [TODO] — pipeline counts, change-failure-rate numbers, specific Harness/Terraform patterns — need verification before promoting.

A CI/CD pipeline is the single piece of infrastructure that every engineer interacts with every day. Get it right and shipping is invisible. Get it wrong and it’s the thing every team complains about in retros.

Kore’s pipeline has been through a few major shapes over the years. This dive is what it looks like today, the reasoning behind the key choices, and the operational realities of running CI/CD across two clouds and the residual VM workloads.

The shape of the pipeline

  Developer commit
        │
        ▼
   ┌──────────┐      ┌───────────────┐      ┌──────────────────┐
   │   Git    │─────▶│    Harness    │─────▶│   Artifactory    │
   │ (review, │      │ (orchestrator)│      │  (Docker + libs) │
   │   CI)    │      └───────────────┘      └──────────────────┘
   └──────────┘             │
                            ▼
                  ┌─────────────────────┐
                  │   Terraform plan/   │
                  │   apply per region  │
                  └─────────────────────┘
                            │
            ┌───────────────┼───────────────┐
            ▼               ▼               ▼
        ┌───────┐       ┌───────┐       ┌───────┐
        │  EKS  │       │  AKS  │       │  VMs  │
        │ (US)  │       │ (EU)  │       │(misc) │
        └───────┘       └───────┘       └───────┘

The pieces:

Git for source. Pull request reviews, branch-based workflow, CI checks on every PR before merge.
Harness as the pipeline orchestrator. Multi-stage pipelines, approval gates, environment promotion, rollback automation.
Artifactory for artefact storage (Docker images, internal libraries, Helm charts). Self-hosted — moved off JFrog Cloud for the multi-region story.
Terraform for cloud infrastructure. Modules per cloud, per region. Plan-and-apply through Harness rather than from developer machines.
Three deployment targets: EKS, AKS, and the still-existing VM-based services that haven’t migrated.

Why Harness, not Jenkins / GitHub Actions / Argo

[TODO: verify these were the actual reasons]

Harness was picked over the obvious alternatives for a few reasons:

First-class environment promotion. Pipelines that span dev → staging → multiple production regions are a built-in primitive, not a thing you scripts yourself.
Approval gates and audit trails. Compliance-friendly out of the box. Important for some enterprise customer contracts.
Templated pipelines. A new service starts from a template rather than from a blank Jenkinsfile. This is the same enablement principle as golden-path K8s manifests — give people a working starting point.

The cost: Harness is paid, the configuration model has its own learning curve, and there’s vendor lock-in we periodically re-evaluate. The benefits have outweighed these so far; if I were starting fresh today I’d evaluate Argo Workflows + a GitOps approach (Argo CD) seriously, because the OSS story has matured.

The Artifactory move from JFrog Cloud

[TODO: details and timeline of the migration]

We originally used JFrog’s hosted Artifactory. The reasons for moving to a self-hosted Artifactory deployment:

Multi-region performance. Pulling large Docker images from a US-hosted registry into our Asian and European regions was slow enough to add minutes to deploys. Self-hosted Artifactory with regional replicas put images close to where they’re pulled.
Cost scaling. Hosted pricing scaled hard with our storage and bandwidth. Self-hosting traded ops cost for compute cost, and at our volume the math flipped.
Cross-region replication on our terms. We can control what’s replicated, when, and to where.

What we lost: the JFrog SaaS team handling upgrades, capacity, and incidents. Self-hosted means we own all of that. The migration itself was [TODO: weeks/months] of work and required carefully coordinating with every team that pushed or pulled artefacts.

If you don’t have the multi-region need and can’t justify the operational cost, JFrog Cloud is genuinely good. Don’t move off it because of a blog post.

Terraform: per cloud, per region

Terraform is the source of truth for cloud infrastructure: VPCs, security groups, EKS/AKS clusters, RDS-equivalents we use, S3 buckets, IAM, the whole substrate.

Module structure:

Cloud-agnostic modules for things that have direct equivalents on both clouds (DNS, certificate management).
Cloud-specific modules for things that genuinely differ (EBS gp3 vs Azure managed disks; ALB vs Application Gateway).
Region-specific roots that compose modules into actual deployments.

Terraform runs through Harness pipelines, not from developer machines. State is stored centrally with locking. PR review on infrastructure changes is mandatory; emergency changes have a documented break-glass process.

What I’ve learned about this setup:

Drift detection matters. Manual changes happen, even with the best intentions. A scheduled terraform plan that compares actual state to declared state catches drift before it bites.
Modules want to be smaller than you think. A 500-line module is hard to reason about; a 100-line module composed from smaller pieces is easier to test and reuse.
terraform import is your friend for the migration cases. Lots of legacy resources were created manually before Terraform; importing them into state is the discipline of “make the declared model match reality before changing reality.”

Docker, Kubernetes, and the VM holdouts

Most workloads ship as Docker images deployed to Kubernetes. The pipeline for these is standard: build image, tag, push to Artifactory, update Helm chart values, Harness rolls the deployment through the K8s API.

The few remaining VM-based services use a different deployment path — Ansible playbooks, AMI builds, instance refresh through the cloud provider’s API. Slower, more brittle, but it works for workloads that haven’t been migrated yet.

The push to retire the VM holdouts is constant but slow. Each service has its own reasons for not having moved (state, customer-specific integration, plugin model that doesn’t fit cleanly into K8s). The migration story for each is documented; the timing is driven by the team owning the service.

Multi-region rollout

A production change typically flows:

Merged PR triggers Harness pipeline.
Build artefacts, push to Artifactory.
Deploy to dev region (full automation).
Deploy to staging region (full automation, may require approval depending on change type).
Deploy to first production region (with canary, automated rollback if SLOs degrade).
Wait period (24 hours typical for non-critical changes; longer for risky ones).
Deploy to remaining production regions in tranches.

Critical changes (security patches, urgent bug fixes) have a fast-path that compresses this timeline but never skips the canary step. The fast-path requires explicit approval.

What this gives us: a change can be in production within minutes for urgent cases, or rolled across all regions over days for normal cases. Bad changes get caught at the canary step in one region rather than blowing up everywhere simultaneously.

The pain points I’d flag honestly

[TODO: verify and expand based on real experience]

Pipeline duration is creeping up. Adding test coverage and security scanning to every build is right; the cumulative effect on pipeline time isn’t free. We periodically prune low-value steps and parallelize what we can.
Harness is a paid dependency we periodically re-evaluate. Open-source alternatives (Argo Workflows, Jenkins X, Tekton) have matured. We haven’t migrated because the cost of moving outweighs the cost of staying so far. The calculation is worth re-running every couple of years.
Terraform state can become a contention point. Long-running plans hold locks; concurrent changes queue. We mitigate with state file splitting (per cloud, per region) but it’s a constant tuning exercise.
VM workloads are a maintenance tax. Every Ansible playbook, every AMI build, every AWS/Azure VM instance refresh is friction the K8s side doesn’t have. The case for finishing the migration is “stop paying this tax forever.”

What I’d do differently if starting fresh

[TODO: this is the section to fill in with strong opinions you actually hold]

GitOps over imperative deploys. Argo CD (or Flux) plus Argo Workflows would give the same outcomes with the desired state always in Git. The “what’s actually deployed?” question becomes trivial.
Container-only from day one. No VM holdouts. New services have to be K8s-native; legacy services have an explicit retirement timeline.
Self-hosted Artifactory from the start if multi-region. The migration was painful; building on hosted and migrating off is more work than building on self-hosted from the beginning.
One Terraform repository per cloud, not per region. The region-specific stuff is parametric; the cloud-specific stuff is structural. Aligning the repo structure to the structural boundary makes refactoring easier.

Kore infrastructure overview — what this pipeline actually deploys
VM-to-K8s migration pillar — the work the VM holdouts are gradually moving onto
K8s misconceptions — pipeline-related misconceptions (like image tagging) covered here