Skip to content

Writing a Kubernetes Cluster Autoscaler Provider with externalgrpc

On This Page

The Kubernetes Cluster Autoscaler ships with integrations for all the major clouds and a long tail of smaller ones. There is also a Cluster API provider, which composes nicely with the setup I described in my previous post on private Talos on Hetzner. But I have use cases where clusters are not managed by CAPI and I still need autoscaling on cloudscale.ch, which has no upstream integration.

So but what do you do if you’re a random dude on the internet like me who needs autoscaling for a cloud that is not intree? The answer is the External gRPC Cloud Provider. You implement a gRPC service, the Cluster Autoscaler connects to it, and your service translates scaling decisions into cloud API calls.

This post walks through the autoscaler internals, the externalgrpc interface, the implementation that became autoscaler-cloudscale, and the decisions that ended up mattering. Both as a reference for anyone building their own provider, and because most of what is interesting about this interface is not documented anywhere outside of a proto file and the source code.

Note: I am not associated with cloudscale.ch. I built this for a personal use case, out of pure joy, and because I like the product the folks at cloudscale.ch are building.

The split: the autoscaler knows when, not how

This is the load-bearing concept for the whole post.

The Cluster Autoscaler (CA) decides when to scale. It looks at pending pods and idle nodes and answers “do we need more capacity?” and “can we drop a node?“. It does not decide how to create a VM. That part is delegated to a cloud provider plugin.

Upstream ships about thirty in-tree providers: AWS, GCP, Azure, DigitalOcean, OVH, Hetzner, Equinix, Linode, Vultr, Scaleway, and so on. The provider answers questions (“what node groups exist”, “what would a new node look like”) and executes actions (“create N servers”, “delete these specific servers”). The CA owns everything else: scheduling simulation, scale-up and scale-down decisions, backoff, eviction with PDB respect, safety logic.

Hold that split in mind. Every subsequent section either lives on the CA side of the line or on the provider side.

Every ten seconds, the autoscaler asks itself five questions

The CA runs a single-threaded, leader-elected loop. The entry point is StaticAutoscaler.RunOnce() in core/static_autoscaler.go. Roughly every ten seconds it asks:

  1. Are there any unschedulable pods?
  2. Would they fit on a node from group X? (scheduler simulator)
  3. If yes, call NodeGroupIncreaseSize(X, delta).
  4. Has any node been idle for more than ten minutes, with pods that can move elsewhere?
  5. If yes, call NodeGroupDeleteNodes(X, [nodes]).

That word simulator in step 2 is the one to hold onto. The CA needs to know what a new node would look like before the node exists. The whole TemplateNodeInfo saga later in this post falls out of that requirement.

What hides inside one IncreaseSize call

The ten-second loop is the shape. Inside one decision to scale up, several distinct components are doing work, none of which the provider sees:

ComponentJob
EstimatorHow many nodes of group X do we need to fit these pods? Default is binpacking. It is the only built-in.
ExpanderMultiple groups would fit, which one wins? Six advertised in the --expander help text (random, most-pods, least-waste, price, priority, grpc) plus a seventh (least-nodes) that the factory registers but the help text omits. Default is least-waste as of CA v1.33.0 (was random from inception through v1.32, changed to prevent accidentally picking expensive groups).
Upcoming nodesTracks scale-ups already in-flight so the CA does not double-scale while VMs are booting.
DaemonSet accountingSubtracts DS pods from each candidate’s allocatable before binpacking. Skipping this overestimates by your DaemonSet footprint.
BackoffGroup that recently failed a create is skipped for 5 min initially, doubles up to a 30 min cap, resets after 3 h. Prevents tight-looping on stockouts or quota issues.

The provider sees none of this. It receives IncreaseSize(group=X, delta=N) and that is the entire contract. But the quality of that decision is bounded by the quality of one RPC: NodeGroupTemplateNodeInfo. Get it wrong and the estimator, the expander, and the upcoming-nodes accounting all reason from wrong numbers, on every scale-up evaluation.

The simulator is the scheduler

The Cluster Autoscaler does not reimplement scheduling. It imports k8s.io/kubernetes/pkg/scheduler/framework and runs the actual Filter plugins:

  • NodeResourcesFit, NodeAffinity, NodeUnschedulable
  • PodTopologySpread, InterPodAffinity, TaintToleration
  • VolumeBinding, NodeVolumeLimits, NodePorts

These are the exact same plugins kube-scheduler runs at the Filter extension point, returning the same verdicts.

What the CA adds on top is the ClusterSnapshot. An in-memory cluster view that supports Fork and Revert. The default implementation, DeltaSnapshotStore, is copy-on-write. Fork and Revert are O(1). That is how scale-down can simulate removing every candidate node, hundreds of forks per loop on big clusters, without deep-copying the cluster.

If you ever want a fun afternoon, read autoscaler#2799, where the delta snapshot landed.

What if your cloud is not on the list?

kubernetes/autoscaler is currently a monolith: the core CA logic plus thirty-something cloud provider implementations, all vendored in-tree, all tied to one release cycle. SIG Autoscaling has been honest about the cost. From their own refactor proposal autoscaler#9264 (opened Feb 2026):

Dependency hell. Release coupling. Many cloud providers are not owned by any SIG-Autoscaling maintainer.

If you want a cloud that is not in the tree, you have three options:

ApproachEffortMaintenance
In-tree PR to k/autoscalerHighRebase hell, ride upstream releases, no external deps allowed
Fork the autoscalerHighOwn ~100k lines of Go forever (GKE, AKS, Datadog have done this)
externalgrpcLowStandalone binary, your own release cycle, language-agnostic

Option three is the only one a single maintainer can sustain. You write one binary, deploy it next to the CA, and the contract between them is a proto file.

Where it is heading: the split (#9264)

That refactor proposal does not just complain. It proposes a structural split, with lazy consensus invoked in March 2026 and buy-in from Microsoft, Red Hat, and SIG Autoscaling reviewers.

RepoRole
kubernetes-sigs/cluster-autoscaler (new)Pure library: core logic, scale decisions, test providers, and externalgrpc.
Per-cloud repos (trending toward cluster-autoscaler-provider-<name>, not finalized)Vendor core, implement the CloudProvider interface. Karpenter pattern.
kubernetes/autoscaler (current)Temporary back-compat home. Eventually shrinks.

The proposal explicitly retains externalgrpc in the new core, alongside the test providers, while every other cloud moves out. Two integration models survive the split:

  • Library (compile-time): vendor core, implement the Go interface, ship a binary. Same shape Karpenter uses today.
  • externalgrpc (wire-time): stay a separate process. Language-agnostic. Independent release. Process isolation, mTLS, the whole envelope.

Jack Francis, Azure provider maintainer at Microsoft, summarized the direction in the thread:

“Urge providers to maintain their own projects in repositories that they are 100% responsible for.”

For an outsider building a provider today, externalgrpc is the realistic path. Post-refactor, it stays as the only language-agnostic, process-isolated option.

Why I picked externalgrpc

In short: nothing else was viable.

  • No upstream provider for cloudscale.ch and no realistic channel to land one as a non-maintainer.
  • Fork the CA is a permanent commitment I am not making in my spare time.
  • externalgrpc lets me ship a binary, own my release cycle, and not become anyone’s maintainer.

It happens to also be one of the two integration models SIG is keeping. That worked out.

Architecture

Cluster Autoscaler externalgrpc architecture: pending pods feed the autoscaler, which simulates with the imported kube-scheduler, talks gRPC over mTLS to autoscaler-cloudscale, which calls the cloudscale.ch REST API; new VMs boot with userData, kubelet joins, the CCM stamps providerID.

Two processes inside the cluster. The upstream cluster-autoscaler is the brain. autoscaler-cloudscale is the hands. They talk gRPC with mutual TLS. The provider speaks REST to cloudscale.ch. One tag, k8s-autoscaler-group, joins everything together.

Walking the diagram in both directions:

  • Scale up. A pending pod shows up, and the CA simulates it against the template for a node group using the imported scheduler. If the pod would fit, the CA sends IncreaseSize over gRPC, and the provider POSTs to the cloud with the right tags applied at creation. The VM boots with the configured userData, the kubelet joins the cluster, and the CCM stamps spec.providerID onto the new Node. On the next Refresh the CA reconciles, and the pod schedules.
  • Scale down. Once a Node has sat idle for ten minutes with pods that could move elsewhere, the CA cordons it and evicts the pods itself (that part is squarely on the CA side of the line) and only then sends DeleteNodes over gRPC. The provider validates that the UUID belongs to this group via the tag check before issuing a DELETE to the cloud, and the target size decrements per successful delete, clamped at minSize.

Same diagram, two directions, same engine.

The contract: six RPCs and a tag

The proto defines fifteen RPCs. Most discussions handwave at that. Here is the honest shape:

RPCStatusNotes
NodeGroups, NodeGroupForNode, RefreshRequiredCalled every loop (~10 s)
NodeGroupTargetSize, NodeGroupIncreaseSize, NodeGroupDeleteNodes, NodeGroupDecreaseTargetSize, NodeGroupNodesRequiredScaling operations and bookkeeping
CleanupRequiredCalled once on autoscaler shutdown
NodeGroupTemplateNodeInfo“Optional”Without it, scale-from-zero and phantom-node injection both break. Implement it.
GPULabel, GetAvailableGPUTypesRequiredReturn empty values if no GPU flavors
PricingNodePrice, PricingPodPriceOptionalReturn Unimplemented. Only used by the pricing-aware expander.
NodeGroupGetOptionsOptionalReturn Unimplemented. Per-group autoscaler tuning.

Six RPCs do the heavy work (Refresh, NodeGroups, NodeGroupForNode, NodeGroupIncreaseSize, NodeGroupDeleteNodes, NodeGroupTemplateNodeInfo). Three are bookkeeping (NodeGroupTargetSize, NodeGroupDecreaseTargetSize, NodeGroupNodes). Three are required but answer with constants (GPULabel returns a single label string, GetAvailableGPUTypes returns an empty map, Cleanup returns an empty ack). The remaining three are the ones the proto explicitly marks as optional and we return codes.Unimplemented for (PricingNodePrice, PricingPodPrice, NodeGroupGetOptions).

The generated Go interface won’t compile if any RPC method is missing, so every one of the fifteen has to exist as a function on the server type. The CA’s externalgrpc client only explicitly handles codes.Unimplemented for the four optional RPCs, degrading gracefully on those. For required RPCs, the client behavior is per-RPC and uneven: Refresh errors are propagated as a CloudProviderError that aborts the current RunOnce iteration, NodeGroups errors are logged at V(1) and the call returns an empty slice (so the autoscaler quietly does nothing for that iteration), and NodeGroupForNode errors propagate to whatever caller invoked it. Worth reading the client source if you ever ship a half-implemented server.

Group membership is carried by one tag: k8s-autoscaler-group=<name>. The CA does not track that. The provider stamps every server it creates and reads the tag on every Refresh. Pick a key, stick with it.

Refresh: trust the cloud, not yourself

Every ~10 seconds, the CA calls Refresh. This is the chance to sync state with reality. The implementation:

  1. Fetch all servers from the cloudscale API (filtered by cluster tag if set).
  2. Rebuild the in-memory server cache.
  3. Refresh the flavor cache (1-hour TTL, flavors do not change often).
  4. For each node group, set targetSize to the actual server count.

The mindset matters more than the code. The provider keeps no persistent state of its own. It does not track “I created X, so X must exist.” Drift happens: manual deletes, failed creates, partial outages. Every Refresh reconciles. The cloud is the source of truth.

func (c *APIClient) Refresh(ctx context.Context) error {
    var modifiers []cloudscale.ListRequestModifier
    if c.clusterTag != "" {
        modifiers = append(modifiers, cloudscale.WithTagFilter(cloudscale.TagMap{
            "k8s-cluster": c.clusterTag,
        }))
    }

    servers, err := c.api.Servers.List(ctx, modifiers...)
    if err != nil {
        return err
    }

    byUUID := make(map[string]*cloudscale.Server, len(servers))
    for i := range servers {
        byUUID[servers[i].UUID] = &servers[i]
    }

    c.mu.Lock()
    c.serversByUUID = byUUID
    c.mu.Unlock()
    return nil
}

Gotcha: the lock-ordering deadlock. Two mutexes in this code: a cache mutex on the cloudscale client (client.mu) and a per-node-group mutex (ng.mu). They protect different state and they get called in nested patterns. They have to be acquired in a consistent order, cache first. Reverse it and a Refresh running concurrently with DecreaseTargetSize will eventually deadlock under load.

The fix in DecreaseTargetSize is one line of code and one comment:

// Get server count BEFORE acquiring ng.mu to avoid lock-ordering
// issues with client.mu inside ServersByTag.
serverCount := len(ng.Servers())

ng.mu.Lock()
defer ng.mu.Unlock()

Caught in a test. Grateful that it was not caught at 3 AM in production.

Scale-from-zero needs a lie

Here is a question that should bother you: how does the CA decide whether a pending pod fits on a node group that currently has zero nodes?

There is no v1.Node in the snapshot to feed the simulator. The CA cannot run its scheduler simulator against a node that does not exist.

So it asks the provider to make one up. NodeGroupTemplateNodeInfo returns a synthetic v1.Node representing what a node from this group would look like if one existed. The CA runs its simulator against that fake.

Get the numbers wrong and scale-from-zero silently never triggers. The CA’s scheduler looks at the fake, decides the pods do not fit, and never calls IncreaseSize. No error, no event, no log line. The autoscaler is “working” and nothing scales.

That is one of the two things this RPC powers.

TemplateNodeInfo’s other job: phantom nodes

The other job is more clever, and the consequences of getting it wrong are worse.

After IncreaseSize, real VMs take 60 to 120 seconds to register as Nodes. The CA’s next loop fires in ten seconds. It cannot wait. If it did, it would see the pods are still pending, decide it needs more nodes, and call IncreaseSize again. Then again. Then again.

So the CA injects fake v1.Node objects, built from your TemplateNodeInfo, into the next snapshot. They carry an annotation: cluster-autoscaler.k8s.io/upcoming-node. The simulator sees pending pods as schedulable on these phantoms. No duplicate scale-up while the real VMs are booting. When the real nodes register, the phantoms vanish.

The implication: one TemplateNodeInfo feeds two mechanisms.

  1. Scale-from-zero, simulating against a node that may not yet exist.
  2. Phantom-node injection, simulating against nodes that will exist soon.

Wrong numbers in TemplateNodeInfo mean the CA either double-provisions or stalls. Both silent. One mistake, two failure modes.

Lying convincingly

The synthetic Node is built from the configured flavor (vCPU, memory), the node group config (volume size, labels, taints). The response carries the Node as proto-marshaled bytes in nodeBytes (v1.Node#Marshal()), per the proto’s doc comment.

Simplified shape (real code uses resource.NewQuantity(...) for each field and computes Allocatable via a helper; it also sets Conditions: NodeReady=True):

node := &v1.Node{
    ObjectMeta: metav1.ObjectMeta{Labels: nodeGroup.Labels},
    Spec:       v1.NodeSpec{Taints: nodeGroup.Taints},
    Status: v1.NodeStatus{
        Capacity: v1.ResourceList{
            v1.ResourceCPU:              flavor.VCPUs,
            v1.ResourceMemory:           flavor.Memory,
            v1.ResourceEphemeralStorage: nodeGroup.VolumeSize,
            v1.ResourcePods:             resource.MustParse("110"),
        },
        Allocatable: subtractReserved(capacity),
    },
}

The critical detail is that Allocatable is not equal to Capacity. Real nodes never expose full capacity because the kubelet reserves resources:

Allocatable = Capacity − kubeReserved − systemReserved − evictionHard
  • kubeReserved: for kubelet, container runtime, kube-proxy
  • systemReserved: for OS daemons (systemd, sshd)
  • evictionHard: headroom so the kubelet can evict pods before the OOM killer fires

If the template claims full capacity, the CA over-provisions. Pods that look schedulable on the template will not fit on the real node, and the CA will scale up again the next loop. Tight loop.

Worth noting: vanilla kubelet ships built-in defaults for evictionHard only; kubeReserved and systemReserved are empty unless someone sets them. Some distros and managed services configure these for you, others leave them empty, so the actual values on your nodes depend on what you run. Check your kubelet config and mirror those numbers here, otherwise the template will drift from reality. The numbers below match what I actually run:

// Values matching the kubelet config:
//   systemReserved: cpu=50m, memory=384Mi, ephemeral-storage=256Mi
//   evictionHard:   memory.available=100Mi, nodefs.available=10%
cpuMillis := max(int64(vcpus)*1000-cpuReservedMillis, 0)
memBytes  := max(int64(memoryGB)*1024*1024*1024-memReserved, 0)
ephBytes  := max(capacityBytes-ephReservedFixed-ephEviction, 0)

For an 8 vCPU / 16 GB flavor that comes out to roughly 7950m CPU and 15.5 GB memory available to pods. Conservative on purpose. Underestimating allocatable is better than overestimating it. A consumer with a different kubelet config would want these as per-node-group settings.

Gotcha: ResourcePods defaults to zero.

Early in development, scale-from-zero just did not work. No error, no event, no log. Root cause: v1.ResourcePods on the constructed v1.Node defaults to zero. The simulator saw a node with capacity for zero pods, decided nothing could ever fit there, and never called IncreaseSize. Set it to 110 explicitly (or whatever your kubelet’s --max-pods is).

IncreaseSize: optimistic, concurrent, with rollback

NodeGroupIncreaseSize is the scale-up path. The CA decides it needs N more nodes and calls this with a delta. The naive approach (for-loop, create, bail on first error) is wrong, because cloud APIs fail partially. Two non-obvious decisions here:

First, bump targetSize optimistically before creating any servers, and reject the request entirely if that would exceed maxSize. The CA sees the rejection immediately. If we bumped after, the CA could call IncreaseSize again during the slow API calls and double-count in-flight scale-ups.

Second, create servers in parallel with bounded concurrency, then roll back targetSize based on failures.

ng.mu.Lock()
newTarget := ng.targetSize + delta
if newTarget > ng.cfg.MaxSize {
    ng.mu.Unlock()
    return fmt.Errorf("increasing by %d would exceed max %d (current: %d)",
        delta, ng.cfg.MaxSize, ng.targetSize)
}
ng.targetSize = newTarget
ng.mu.Unlock()

var wg sync.WaitGroup
sem := make(chan struct{}, maxConcurrentAPICalls)
errsCh := make(chan error, delta)
for range delta {
    wg.Go(func() {
        sem <- struct{}{}
        defer func() { <-sem }()
        if err := ng.createServer(ctx); err != nil {
            errsCh <- err
        }
    })
}
wg.Wait()
close(errsCh)

var errs []error
for err := range errsCh {
    errs = append(errs, err)
}

if len(errs) > 0 {
    ng.mu.Lock()
    ng.targetSize -= len(errs)
    ng.mu.Unlock()
    return fmt.Errorf("failed to create %d/%d servers: %v", len(errs), delta, errs)
}
return nil

The semaphore caps concurrency at 10 to avoid bursting the cloud API.

On partial failure, the target size is rolled back to reflect only the successes. The CA sees the discrepancy on the next Refresh and decides whether to retry. Rolling forward (deleting the successes) would be wrong: those nodes might be needed, and a follow-up delete adds another opportunity to fail.

There is a small window between the optimistic bump and the rollback where a concurrent Refresh can observe targetSize = initial + delta even though only initial + (delta - failures) servers exist. That is harmless: Refresh is the source of truth and the next iteration re-snaps the cache from the cloud API.

Gotcha: orphan VMs from create-then-tag.

An earlier iteration did server creation and tag application in two separate API calls: create, then PATCH the tags. Restarting the provider pod during an in-flight create, between the create succeeding and the tag PATCH firing, left a VM running on cloudscale with no k8s-autoscaler-group tag. The next Refresh filtered it out, the autoscaler had no record of it, and it kept running until I noticed.

The fix: tag at create time, in a single API request. Most cloud APIs let you pass tags in the create body; cloudscale’s does. If yours does not, you need a reconciliation loop that picks up orphans. Either way, do not split create and tag across two requests.

Scale-down and node deletion

NodeGroupDeleteNodes is the inverse of scale-up. The CA picks specific nodes (based on utilization, pod disruption budgets, drain status) and sends their provider IDs. The provider extracts UUIDs, validates ownership, and deletes.

func (ng *NodeGroup) DeleteNodes(ctx context.Context, uuids []string) error {
    tagKey, tagVal := ng.cfg.ManagedTag()

    for _, uuid := range uuids {
        server := ng.client.ServerByUUID(uuid)
        if server == nil {
            return fmt.Errorf("server %q not found", uuid)
        }
        if server.Tags[tagKey] != tagVal {
            return fmt.Errorf("server %q does not belong to node group %q", uuid, ng.cfg.Name)
        }
    }

    var (
        wg   sync.WaitGroup
        emu  sync.Mutex
        errs []error
        sem  = make(chan struct{}, maxConcurrentAPICalls)
    )

    for _, uuid := range uuids {
        wg.Go(func() {
            sem <- struct{}{}
            defer func() { <-sem }()

            if err := ng.client.DeleteServer(ctx, uuid); err != nil {
                emu.Lock()
                errs = append(errs, err)
                emu.Unlock()
                return
            }
            ng.mu.Lock()
            ng.targetSize--
            if ng.targetSize < ng.cfg.MinSize {
                ng.targetSize = ng.cfg.MinSize
            }
            ng.mu.Unlock()
        })
    }
    wg.Wait()
    // ...
}

The validation loop before any deletion is the load-bearing piece. If the CA sends a UUID that does not belong to this group (wrong tag) or does not exist in the cache, the entire request is rejected before any server is deleted. A bug in node-to-group mapping cannot cascade into deleting the wrong infrastructure.

The per-success decrement is the mirror of scale-up’s optimistic bump. Each successful delete decrements targetSize individually, clamped to minSize. If 2 out of 3 deletes succeed, targetSize reflects the 2 that are gone. The next Refresh reconciles whatever is left.

After create: bootstrapping the Node

One thing the proto contract does not cover at all: how does a newly created server actually become a Kubernetes node?

The CA does not know and does not care. It calls IncreaseSize, the provider creates a server with userData, and the CA waits for a new Node to appear in the API server. If it does not show up within --max-node-provision-time (default 15 min), the CA stops counting the missing node toward the group’s in-flight tally, may try a different group if pods are still pending, and eventually attempts to remove the unregistered server. When that happens, the bootstrap is broken, not the provider.

userData is opaque bytes to the provider. It gets passed through to the cloudscale API at server creation and that is the end of the provider’s involvement. What happens next depends on the OS:

  • With Talos Linux, userData is a machine config. The server reads it on first boot, configures the kubelet, and joins the cluster. No shell, no scripts.
  • With a cloud-init-capable image like Ubuntu, userData is a #cloud-config YAML. You can install k3s (a script that curls k3s and runs k3s agent --server=... --token=...), run kubeadm join, or do whatever you need to get a kubelet running and pointed at the API server.
  • With Fedora CoreOS, userData is Ignition.

Either way, the chain is: provider creates server → server boots with userData → kubelet registers Node → CCM sets spec.providerID → CA sees the node on the next Refresh and maps it to its group.

Creating the server is honestly the easy part of the whole stack. Once you have an OS image that joins your cluster autonomously from userData, the autoscaler glue is a couple thousand lines on top.

NodeGroupForNode: whose node is this?

Before the CA can scale down a node, it asks the provider: which group does this v1.Node belong to?

The chain is: providerID (set by the CCM) → strip prefix → server UUID → O(1) cache lookup → read k8s-autoscaler-group tag → node group.

func (p *Provider) NodeGroupForNode(ctx context.Context, req *pb.NodeGroupForNodeRequest) (*pb.NodeGroupForNodeResponse, error) {
    providerID := req.GetNode().GetProviderID()
    if providerID == "" {
        return &pb.NodeGroupForNodeResponse{}, nil
    }

    uuid, err := nodegroup.UUIDFromProviderID(providerID)
    if err != nil {
        return &pb.NodeGroupForNodeResponse{}, nil
    }

    server := p.client.ServerByUUID(uuid)
    if server == nil {
        return &pb.NodeGroupForNodeResponse{}, nil
    }

    groupName, ok := server.Tags["k8s-autoscaler-group"]
    if !ok {
        return &pb.NodeGroupForNodeResponse{}, nil
    }

    ng, ok := p.nodeGroups[groupName]
    if !ok {
        return &pb.NodeGroupForNodeResponse{}, nil
    }

    return &pb.NodeGroupForNodeResponse{NodeGroup: pbNodeGroup(ng)}, nil
}

Every step that does not resolve returns an empty response. The CA interprets that as “not managed by this provider, leave it alone”. Returning an error here would make the CA think the cloud API is broken and back off.

Two things to call out here.

Performance. This RPC is on the hot path. The CA calls it once per known node, every loop. The math: 500 nodes × 6 loops/min = 3000 RPCs/min. Hitting the cloud API on every call would melt the rate limit. The cache map above keeps it at one cloud API call per Refresh. Three orders of magnitude in two lines of code.

CCM dependency. The cloudscale Cloud Controller Manager is what sets spec.providerID on nodes. If the CCM is not running, no node has a providerID, every lookup returns empty, and the autoscaler treats every node as unmanaged. The CCM is on the critical path.

Gotcha: each CCM picks its own providerID format.

A few examples:

aws:///us-east-1a/i-abc       (AWS)
gce://project/zone/instance   (GCE)
cloudscale://<uuid>           (cloudscale)

Check your CCM’s docs (or source) for the exact format and parse on that.

mTLS

The CA-to-provider connection uses mTLS. Both sides present certificates, both verify the other. cert-manager issues the certs and the Helm chart wires it up automatically.

This matters because the gRPC interface has no other authentication. The externalgrpc README calls mTLS “recommended”, which in practice means “without it, anyone who can reach the provider’s port can trigger server creation and deletion on your cloud account.”

return credentials.NewTLS(&tls.Config{
    Certificates: []tls.Certificate{cert},
    ClientAuth:   tls.RequireAndVerifyClientCert,
    ClientCAs:    caPool,
    MinVersion:   tls.VersionTLS13,
}), nil

RequireAndVerifyClientCert is the line that does the work. Without it the server accepts any client. TLS 1.3 minimum because there is no reason to support older versions on an internal gRPC channel.

Multi-cluster isolation

autoscaler-cloudscale takes a clusterTag setting. When set, every API call is filtered to only see servers tagged k8s-cluster=<value>, and every created server gets that tag automatically.

Without this, multiple clusters in the same cloudscale project would see each other’s servers. NodeGroupForNode would scan the entire account. Scale operations could touch the wrong cluster’s nodes.

The config validation auto-injects the cluster tag into every node group’s tag set, and refuses to start if a group declares a conflicting k8s-cluster value. Catching that at startup beats catching it at runtime when the autoscaler deletes the wrong server.

Observability

The provider exposes Prometheus metrics on a separate HTTP port (:9090). The interesting ones:

# How often the cloud API is called and whether it is healthy
autoscaler_cloudscale_api_requests_total{operation="list_servers|create_server|delete_server", result="success|error"}
autoscaler_cloudscale_api_request_duration_seconds{operation}

# Whether scaling is happening and succeeding
autoscaler_cloudscale_node_group_scale_up_total{node_group, result="success|partial_failure"}
autoscaler_cloudscale_node_group_scale_down_total{node_group, result}

# Current state at a glance
autoscaler_cloudscale_node_group_current_size{node_group}
autoscaler_cloudscale_node_group_target_size{node_group}

The gap between current_size and target_size tells you whether servers are being created (target > current) or whether something is stuck. Combined with the API error rate, you can tell whether scaling is blocked by cloud API issues or by provider bugs.

A gRPC interceptor chain handles cross-cutting concerns (panic recovery, request metrics, per-RPC logging) without cluttering the business logic:

grpc.ChainUnaryInterceptor(
    interceptor.Recovery,
    interceptor.Metrics,
    interceptor.Logging,
)

Deploying it

Two Helm releases: autoscaler-cloudscale first (creates the cloud-config ConfigMap and TLS Secret), then the upstream cluster-autoscaler chart configured with cloudProvider: externalgrpc. Prerequisites: cert-manager and the cloudscale CCM.

1. Install autoscaler-cloudscale:

helm install autoscaler-cloudscale \
  oci://ghcr.io/kubeterm-sh/charts/autoscaler-cloudscale \
  --namespace kube-system \
  --set cloudscaleAPI.token="your-token" \
  -f autoscaler-cloudscale-values.yaml
config:
  clusterTag: my-cluster
  nodeGroups:
    - name: worker
      minSize: 0
      maxSize: 10
      flavor: flex-8-2
      image: "custom:talos-v1.13.0"
      zone: rma1
      volumeSizeGB: 100
      usePrivateNetwork: true
      networkUUID: <uuid>
      subnetUUID: <uuid>
      userData: "@/etc/autoscaler-cloudscale/machineconfig/machineconfig.yaml"
      labels:
        node.kubernetes.io/role: worker

The @-prefix on userData loads from a file mounted via Secret. The provider passes the contents through to the cloudscale API unchanged.

2. Install cluster-autoscaler with externalgrpc:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f cluster-autoscaler-values.yaml
cloudProvider: externalgrpc

autoDiscovery:
  clusterName: my-cluster

extraArgs:
  cloud-config: /etc/cloud-config/cloud-config.yaml

extraVolumes:
  - name: cloud-config
    configMap:
      name: autoscaler-cloudscale-cloud-config
  - name: autoscaler-tls
    secret:
      secretName: autoscaler-cloudscale-tls

extraVolumeMounts:
  - name: cloud-config
    mountPath: /etc/cloud-config
  - name: autoscaler-tls
    mountPath: /etc/autoscaler-cloudscale/tls
    readOnly: true

The cloud-config tells cluster-autoscaler where to find the gRPC service. The TLS Secret makes mTLS work. Both created by the first chart.

Run cluster-autoscaler at -v=4 and you will see it call Refresh, NodeGroups, and NodeGroupForNode every loop. Create a Deployment with more replicas than your cluster can schedule and you will see NodeGroupIncreaseSize fire and a cloudscale server appear seconds later.

Building one for your cloud

If your cloud is not in the upstream tree and you are thinking about doing this, here is the recipe:

  1. Implement six RPCs: Refresh, NodeGroups, NodeGroupForNode, NodeGroupTemplateNodeInfo, NodeGroupIncreaseSize, NodeGroupDeleteNodes. Stub the other nine as codes.Unimplemented. They must exist; they do not have to do anything.
  2. Pick one tag scheme for group membership. k8s-autoscaler-group=<name> is fine. The CA does not care what the key is. Do not overthink it.
  3. Confirm your cloud has a CCM. providerID is the bridge between Kubernetes identity and cloud identity. No CCM, no scale-down. If your cloud does not have a CCM, you have a bigger project than the autoscaler.
  4. Decide your userData shape. Talos, cloud-init, Ignition, whatever your OS speaks. The provider passes bytes through. You own the bootstrap.
  5. Get TemplateNodeInfo right. Match the kubelet’s --system-reserved and --eviction-hard flags. Set ResourcePods explicitly. Marshal the response as protobuf bytes, not JSON. Remember it feeds both scale-from-zero and phantom-node injection.
  6. mTLS the gRPC channel. cert-manager plus a self-signed Issuer chain is about five minutes of YAML.

The Cluster Autoscaler does the orchestration. You write under 2000 lines of cloud-specific glue.

Wrapping up

externalgrpc is how you plug virtually any cloud into the Cluster Autoscaler without forking the CA, without taking a dependency on CAPI, and without becoming a SIG maintainer. The CA keeps owning the parts that are genuinely hard (scheduling simulation, scale decisions, drain coordination, backoff). Your provider owns the parts that are cloud-specific and only that: list servers, create N, delete these. Six RPCs do the real work, three more are bookkeeping, the rest are stubs that just need to exist.

If your cloud has a working API, a Cloud Controller Manager that stamps providerID, and an image the kubelet can join from userData, you have everything you need. The blocking parts of this project were the ones that are not visible from the proto file: getting TemplateNodeInfo’s allocatable numbers right so the CA does not loop, realizing that same response also drives phantom-node injection, the lock ordering between caches, the optimistic-then-rollback pattern for partial failures, and treating the refresh loop as the single source of truth for target size. Everything else falls out of the contract.

The provider does not know anything about workloads. It translates “create 2 more workers” into API calls and gets out of the way. That separation is the whole point, and it scales to any cloud willing to expose those primitives.

Code: github.com/kubeterm-sh/autoscaler-cloudscale, Apache 2.0.

© 2026 mdnix
RSS