ADR 0006: CloudEdge Event Federation (routerd-to-routerd typed events)
Status
Accepted; experimental implementation in progress — 2026-05-30.
Phases 1, 1.5, 2, and 3 are implemented on event-federation:
- Phase 1 (event envelope +
EventGroupKind + SQLite local store +routerctl federation event emit/list) — done. - Phase 1.5 (
EventPeer/EventSubscriptionKinds + validation) — done. - Phase 2 (peer delivery over the overlay via
routerd-eventd, HMAC, retry, retention prune) — done; lab-smoke PASS (transport evidence). - Phase 3 (subscription → plugin →
RemoteAddressClaimDynamicConfigPart) — done; lab-smoke PASS (subscription evidence, how-to).
Phase 4 (provider actionPlan plugins, dry-run) is next, not started.
Phase 5 (provider action execution) is out of MVP.
Context
SAM (reference,
milestone) is clean-validated
on Azure×PVE, AWS×PVE, and OCI×PVE (3-cloud parity). It proves the
capture (provider-specific) / delivery+claim (routerd-common) split. But the
RemoteAddressClaim that drives it is hand-authored today. The next step is to
discover, propagate, and materialize claims event-driven:
on-prem routerd observes a client IPv4 (ARP/Clients/DHCP) → emits a typed event → federation bus delivers it to the cloud routerd → a subscription triggers a provider plugin → plugin returns a
RemoteAddressClaimas aDynamicConfigPart(+ a provider secondary-IPactionPlan) → cloud is ready forprovider-secondary-ipcapture, with no human editing the cloud config.
What already exists (so the MVP is not greenfield)
Grounding the design in the current tree — most building blocks are present; the genuinely new work is the node-to-node federation transport and the event→plugin subscription trigger:
- Typed event envelope:
pkg/daemonapiDaemonEvent{Type,Time,Daemon,Resource, Severity,Reason,Message,Attributes}+NewEvent(...). Today it flows daemon→main, but it is already a typed topic-carrying envelope. - Daemon→routerd transport pattern: daemons POST to the control socket over
HTTP-over-unix (
cmd/routerd-dhcp-event-relay→controlapi.Prefix + /dhcp-lease-eventviaunix:/run/routerd/routerd.sock). There is even an event-relay daemon precedent. - Separate long-lived daemon precedent: 13
cmd/routerd-*daemons (routerd-bgp,routerd-ra-observer,routerd-dhcp-event-relay, …). The gobgp pivot (ADR 0004) established "separate long-lived process over in-process" to avoid restart-drops. - Plugin → DynamicConfigPart pipeline:
pkg/plugin/runner.go,pkg/plugin/dynamic_config.go,pkg/dynamicconfig/{types,merge}.go,PluginRequest/PluginResult. effective = startup + active dynamic − masks. - State: SQLite (
pkg/state/sqlite.go). - Provider profile + external auth:
CloudProviderProfilewithauth.mode=external-command(specs.go:1193) — already the hook for provider-specific plugins.provider: oci|aws|azure|gcpalready validated.
Decision
Build CloudEdge Event Federation as the next experimental MVP, on a new branch on top of merged-experimental SAM. Do not cut scope — decompose into ordered, independently-acceptable phases and drive each phase as a workflow. Each phase ships a working, demoable slice and gates the next.
Design principles
- Events are observed facts, not config. A node sends
routerd.client.ipv4.observed, never a rawRemoteAddressClaim. The receiver's trusted local plugin decides whether/how to turn it into a typed claim + actionPlan. The wire never carries commands to execute. - at-least-once + idempotent, not exactly-once. Store idempotency is keyed on
the event
id(a duplicateidis a no-op insert);dedupeKeyis a subscription-side grouping key for collapsing repeated observations of the same fact, not a DB unique constraint in Phase 1. Dynamic resource names are deterministic (onprem-10-88-60-9); provider actions are no-op if already satisfied. No consensus, no gossip, no total ordering. - Reuse, don't reinvent. Reuse the
DaemonEventenvelope, the control-socket HTTP transport idiom, the Plugin→DynamicConfigPart pipeline, SQLite state, andCloudProviderProfile/Plugin(no newCloudProviderPluginKind). - Minimize new Kinds. MVP introduces three:
EventGroup(bus identity + auth + retention),EventPeer(delivery target + inline push/receive filters),EventSubscription(received-event → local plugin trigger). Fold the proposed standaloneEventFilterintoEventPeerfor now; promote to its own Kind only if filters need to be shared across peers. - Separate daemon. Federation send/receive lives in a new
cmd/routerd-eventdlong-lived daemon (per ADR 0004 precedent), not the reconcile loop. It binds to the overlay (wg-hybrid) only. - Provider mutation stays dry-run in the MVP. Plugins emit
actionPlans; execution is a later phase behind an explicit approval/auto-apply policy.
Transport & security (MVP)
- Receiver = HTTP listener bound to the WireGuard overlay interface/address only
(e.g.
169.254.x.y:9443). The WG tunnel is the confidentiality boundary; add message-level HMAC (shared secret from file) for integrity/anti-misroute. Defer TLS — a TLS listener needs cert provisioning, which reintroduces exactly the bootstrap friction the SAM stocktake flagged. (Future: mTLS / per-peer Ed25519 / cloud-KMS signing.) - Push-only for MVP (
onprem→cloudobservations;cloud→onpremclaim/result acks). - Retry with backoff; per-(event,peer) delivery status in SQLite.
Critical invariants to review at the state-machine level (not just diff)
Per the project's rule for out-of-process stateful daemons, these are the correctness conditions, stated as invariants:
- No feedback loop. A node MUST NOT re-emit
*.observedfor an address it itself captures (provider-secondary-ip or proxy-arp). Observation is scoped byownerSide+domain; captured/secondary addresses are excluded from the observer's source set. Without this, cloud's own secondary.9gets re-observed → re-propagated → flap. - Asymmetric provision vs de-provision. Provisioning (claim appears) may be
prompt. De-provisioning (TTL expiry /
*.expired) must be hysteretic — a much longer grace + debounce than the 300s observe TTL — because a flapping client must not drive repeated cloud secondary-IP assign/unassign (API rate limits + cost + dataplane churn). TTL→teardown policy is explicit and conservative. - Single writer per (domain, address). The owning side is authoritative; the
receiver only ever proposes a claim for an address whose
ownerSideis the sender. - Idempotent provider actions. "already assigned" ⇒ success/no-op across aws/azure/oci.
Provider plugin framework
OS-CLI-invoking local executables, not SDKs statically linked into routerd (keeps SDK churn/auth out of core; enables cloud-native identity; easier debug):
- AWS:
aws ec2 assign-private-ip-addresses— auth: IAM instance profile first,AWS_PROFILE/env fallback. - Azure:
az network nic ip-config …— auth: managed identity first,az login/SP env fallback. - OCI:
oci network private-ip create/vnic— auth: instance principal first, OCI config profile fallback.
Plugin.capabilities gate what a plugin may do
(observe.events/propose.dynamicConfig/propose.providerAction).
Phased decomposition (one workflow per phase, run in order)
Each phase = an independently-acceptable slice; later phases gated on earlier acceptance. Implementation delegated to codex; claude orchestrates + reviews.
- ✅ DONE — Phase 1 — Event model + local store.
EventGroupKind; reuse/extendDaemonEventas the externalEventenvelope (id, group, sourceNode, type, subject, ttl, dedupeKey, payload); SQLitefederation_eventstable;routerctl federation event emit/list. Accept: emit→stored w/ TTL; dup id idempotent; expired ignored. - ✅ DONE (lab-smoke PASS) — Phase 1.5 —
EventPeer/EventSubscriptionKinds + validation. - ✅ DONE (lab-smoke PASS) — Phase 2 — Peer delivery over overlay.
EventPeerKind;routerd-eventdreceiver bound towg-hybrid; HMAC; push + backoff;event_deliveries. Accept: onprem pushes to cloud overwg-hybrid; dup push idempotent; bad HMAC rejected;routerctl event deliveries;routerd-eventdperiodically prunesfederation_eventsperEventGroupretention (maxAge/maxEvents), androuterctl federation event prune --dry-runreports what would be removed. - ✅ DONE (lab-smoke PASS) — Phase 3 — Subscription-triggered plugin → DynamicConfigPart.
EventSubscriptionKind; event batch →PluginRequest;PluginResult→DynamicConfigPart(withrouterd.net/dynamic-source,event-id,event-groupannotations); debounce/batchWindow;event_subscription_runs. Accept: cloud receivesclient.ipv4.observedfor10.88.60.9/32→ plugin →RemoteAddressClaimDynamicConfigPart visible inrouterctl dynamic render; actionPlan shown, not executed. - ⏭ NEXT (not started) — Phase 4 — Provider actionPlan plugins (dry-run).
aws/azure/oci-address-claimexample plugins; standardizedactionPlanformat; instance-identity auth. Accept: plugins propose assign-secondary-IP; no mutation; plan visible inrouterctl plugin/dynamic. - Phase 5 — (post-MVP) provider action execution. Approval/auto-apply policy; action journal; best-effort undo; identity docs. Out of MVP.
The first end-to-end smoke is manual routerctl federation event emit →
federation → DynamicConfigPart (Phases 1–3). The ARP/Clients observer plugin comes after
that smoke (modeled on routerd-ra-observer), so failures are isolatable.
MVP event types
routerd.client.ipv4.observed, …ipv4.expired, …dynamic.part.accepted/rejected,
…provider.action.planned/succeeded/failed. observed+expired alone suffice for
the first smoke.
Consequences
- Positive: turns SAM from hand-authored to event-driven; small, demoable phases; reuses existing envelope/transport/plugin/state; no new Kind sprawl (3); provider mutation stays gated; cloud-native identity from day one.
- Negative / risks: a new network listener (overlay-bound + HMAC mitigates); loop/flap and provision/de-provision asymmetry must be enforced as invariants (above); at-least-once pushes idempotency onto plugins and naming; TLS/mTLS deferred. de-provisioning automation is deliberately the last thing enabled.
- Scope-out (MVP): consensus, exactly-once, gossip mesh, arbitrary remote command execution, automatic provider mutation, full IP lifecycle automation, remote plugin registry, cross-node config rewrite.
Known limitations (experimental)
routerd-eventdsupervision is systemd-only. No FreeBSDrc.dunit is generated yet; on FreeBSD the daemon must be supervised manually until therc.dresource is added.EventSubscriptionbatchWindow/debounceare accepted but coarse. The fields validate and are honored at poll granularity — the controller batches events per poll tick, not on a precise sub-tick timer. Tight debounce windows are effectively rounded up to the tick interval.
Out of scope / open questions for later
- Whether
cloud→onpremneeds more than acks (e.g. capture-ready signal that toggles on-prem proxy-arp only after cloud secondary exists). - Sharing filters across peers (promote
EventFilterto its own Kind). - Multi-peer / >2-node groups (MVP targets the validated pair topology).