#UPGRADE AUTOMATION

Upgrade Execution Engine

Human-gated autonomous upgrades with rollback safety. Execute infrastructure changes with confidence and full audit trails.

Executing a Kubernetes upgrade is the highest-stakes operation in infrastructure management. Control plane components restart. Workloads reschedule. API compatibility changes take effect. A single misconfigured step can cascade into application downtime that affects every team in the organization. Medulla's Upgrade Execution Engine transforms this process from a manual, high-stress runbook into a structured, checkpoint-driven operation with built-in safety gates and automatic rollback. Every step is auditable, every state change is captured, and the operator remains in control throughout.

The Problem

Upgrade execution in most organizations follows a familiar pattern. An engineer opens a runbook, starts executing steps manually, and monitors dashboards for signs of trouble. Each step requires judgment calls: Is this warning expected? Should I proceed or roll back? How long should I wait before checking pod health? The process is sequential, error-prone, and heavily dependent on the experience of the person running it.

When something goes wrong, the pressure intensifies. Rolling back a partially completed upgrade is significantly more complex than rolling back a clean failure. State has changed. Some nodes are on the new version, others are not. Addons may have been upgraded in anticipation of the new control plane version. The blast radius of a failed upgrade extends well beyond the cluster itself, affecting deployments, CI/CD pipelines, and on-call teams. Engineers who experience a failed upgrade carry that stress into the next upgrade cycle, leading to excessive caution and further delays that compound the version debt problem.

Observability tools can show you that something is wrong. Workflow engines can automate the steps. But neither combines execution automation with the domain-specific health intelligence needed to make safe proceed-or-abort decisions at each checkpoint.

The most dangerous moment in a Kubernetes upgrade is not the failure itself. It is the decision point immediately after a failure, when an operator must choose between rolling back, pressing forward, or waiting. Automating this decision with health-gate intelligence removes the highest-risk human judgment from the process.

How Medulla Solves It

Medulla executes upgrades as a sequence of steps, each bounded by checkpoint snapshots that capture the complete cluster state before and after. At every checkpoint, automated health gates evaluate the cluster against a set of domain-specific criteria: pod readiness across all namespaces, crashloop detection, webhook health, API server availability, and CoreDNS resolution.

If any health gate fails, execution pauses automatically. The operator can inspect the before-and-after state comparison, evaluate the failure, and choose to retry, skip, or abort. If the operator is not available, Medulla's automatic rollback engages, reverting the cluster to the last healthy checkpoint state.

Every execution produces a complete audit trail. Each step, each health gate result, each state snapshot, and each operator decision is recorded and exportable as JSON or HTML. This audit trail is not just an operational record. It is a compliance artifact that demonstrates controlled, governed infrastructure changes.

Key Capabilities

Checkpoint-driven execution — Every upgrade step is bounded by state snapshots. Before and after comparisons are always available, enabling precise rollback to any checkpoint. This structured approach ensures that no upgrade step proceeds without a verified recovery point.
Automated health gates — Pod readiness, crashloop detection, webhook health, API server availability, and CoreDNS resolution are evaluated at every checkpoint. Health gates use domain-specific criteria that go beyond generic health checks to assess Kubernetes-aware cluster stability.
Pause, resume, and abort controls — Operators maintain full control over execution flow. Pause at any point to investigate, resume when ready, or abort to trigger rollback. These controls ensure that automation never proceeds beyond the operator's comfort level, preserving human judgment for high-stakes decisions.
Automatic rollback — If health gates fail and no operator intervention occurs, Medulla automatically reverts to the last healthy checkpoint state. Automatic rollback eliminates the most dangerous decision point in an upgrade: the moment after failure when an operator must choose between reverting and pressing forward under pressure.
Before and after state comparison — Visual diff of cluster state at each checkpoint, showing exactly what changed and whether the change matches expectations.
Full audit trail — Every step, health gate result, state snapshot, and operator decision is recorded. Exportable as JSON or HTML for compliance and post-mortem review. Audit trails satisfy change management requirements for SOC 2, ISO 27001, and other governance frameworks.

The Upgrade Execution Engine makes Kubernetes upgrades a controlled, repeatable process. Platform teams stop relying on heroic manual effort and start relying on structured automation with built-in safety. Every upgrade is auditable. Every failure is recoverable. Every decision point is informed by real-time health intelligence rather than dashboard watching and intuition. Execution results feed back into Medulla's confidence scoring, improving future upgrade predictions based on actual outcomes. Infrastructure changes become routine operations, not organizational events.