ADR 0010: Capture Ownership Arbitration (multi-instance ownership map + ownershipEpoch fencing)
Status
Proposed; Accepted for experimental implementation — 2026-06-01.
Builds on ADR 0008: Capture Coordination via Fencing Tokens and the Selective Address Mobility dataplane. Addresses issue #76. Its consumer is ADR 0011: Generalized Failover (#74). Experimental.
Context
At scale a single cloud router cannot hold every captured secondary IP (ENI/NIC/
VNIC slot limits), so an N+1 set of same-provider routers must distribute
captured addresses. Today routerd has no cross-node ownership map and no mutual
exclusion:
- Coordination is single-node local projection: every node independently
projects the same federation event stream to the same
AddressLeasestate (pkg/controller/mobility/controller.go). There is no distributed lock, quorum, or consensus. - "Single owner" is implicit (capturePolicy
all-non-owner-sites+ deterministicevaluatePlacement), andcaptureEpoch(pkg/state/mobility_capture_epoch.go) is a per-node, per-(pool,address, captureDomain) monotonic token that fences stale provider actions at the import/execute gate (ADR 0008). - The reserved
MobilityPoolSpec.Authorityfield is unused.
#76 asks for a centralized ownership map, conflict exclusion, and split-brain prevention. ADR 0008 deliberately avoided consensus (Paxos/Raft/etcd) and built safety from a monotonic fencing token + the provider's structural single-assignment
- idempotent convergence. This ADR keeps that philosophy.
What "ownership" can and cannot guarantee without consensus (honest scope)
This is not a linearizable distributed lock. With event-ordered arbitration + fencing + the cloud's single-assignment semantics we guarantee:
- all nodes that see the same event stream converge to the same owner map;
- a node that has seen ownershipEpoch N+1 will not execute an epoch-N action (fenced at the gate);
- a cloud secondary IP can belong to exactly one NIC, so provider state converges to a single assignment.
We cannot guarantee that an old owner that is still alive but partitioned from federation (has not seen N+1) never re-grabs the address via the provider API — eliminating that needs consensus / STONITH / provider conditional-fencing, which we do not add. So the property is "fenced eventual ownership + provider-enforced single assignment," not "split-brain prevention." On-prem proxy-ARP is weaker still (no provider single-assignment): the cap there is VRRP-master authority + fail-closed (per ADR 0008).
Decision
ownershipEpoch — a per-(pool, address) cluster fence token
Introduce ownershipEpoch, a higher-level concept than captureEpoch: a
per-(pool, address) monotonic token that increments only on a confirmed owner
change (not while a lease is candidate/holding). It is the fence token that spans
cloud / on-prem / provider / action. captureEpoch is retained as a
compatibility/derived annotation; the source of truth moves to ownershipEpoch.
Ownership map — leader-less deterministic convergence
There is no elected leader (leader election needs consensus). The ownership map is a converged view each node builds deterministically from the federated event stream:
- For each
(pool, address), the owner is chosen by a deterministic arbitration: preferNodes → placement priority → stable tie-break over the eligible members (eligibility defined by ADR 0011: not drained, healthy, live, VRRP-master where applicable). - Multi-instance distribution: within a placement group, each address is arbitrated to one owner; the set of addresses spreads across the eligible members (future: least-loaded). One IP → one owner at a time.
- The map is surfaced (status DB + metrics + control/
routerctl) so operators see "which IP is owned by which node" — the "centralized ownership map" #76 wants, realized as a converged view rather than a single-writer store.
ipOwnershipPolicy on MobilityPool
spec:
ipOwnershipPolicy:
type: centralized # converged deterministic map (only mode)
epochLocking: true # stamp + fence actions by ownershipEpoch
preferNodes: [aws-router-a, aws-router-b]
autoFailover: true # consumed by ADR 0011 (liveness-driven seize)
preferNodes biases arbitration; epochLocking enables ownershipEpoch fencing;
autoFailover is the hook ADR 0011 uses. type has one mode now (centralized =
converged-deterministic).
Action idempotency key
Provider-action idempotency keys carry, at minimum, pool / address / ownerNode / ownershipEpoch / actionVerb / provider / nicRef — so a stale-epoch or
wrong-owner action is fenced deterministically.
Phasing (this ADR)
- Phase 1 (this ADR's minimal scope): the
ownershipEpochtoken, the deterministic ownership record + arbitration (preferNodes/priority/tie-break),ipOwnershipPolicyspec + validation, and ownership-map visibility (status + metrics +routerctl). No automatic seizure — Phase 1 only computes and exposes desired ownership and fences actions by ownershipEpoch; the existing static placement still drives who acts. - Liveness-driven failover/seize is ADR 0011.
Consequences
- routerd gains a cluster-converged, fenced ownership model for distributing captured IPs across N+1 same-provider routers, without adding a consensus store.
- The safety scope is stated honestly ("fenced eventual ownership," not a distributed lock); cloud is structurally strong, on-prem is VRRP-authority best-effort.
ownershipEpochis the single cross-cutting fence token that ADR 0011's seize and Phase-4 cloud-inventory/drift detection build on.