跳至主要内容

Federation Auto-Remediation Readiness Matrix

This matrix classifies each remediation action from routerctl doctor federation --remediation-plan by execution readiness. P4 generates plans only; P5+ would wire execution.

Actions

#Action constantCheck codeSafeOperator approvalAuto-execute readyRationale
1retry-failed-deliveriesfailed-deliveriesyesnoreadyRe-enqueues outbox entries; idempotent delivery with HMAC; receiver deduplicates.
2investigate-pending-deliveriespending-deliveriesyesnoinspect-onlyPending may be in-flight. Action is diagnostic (list pending), not mutating.
3force-repush-stale-ttlstale-ttlyesnoreadyRe-pushes events whose TTL was refreshed locally but not yet delivered. Idempotent; receiver applies latest TTL.
4check-peer-connectivitydelivery-lagyesnoinspect-onlyProbes overlay/TCP reachability to peer endpoint. Diagnostic; cannot fix network issues.
5configure-peer-endpointexpected-delivery-no-endpointnoyesnot readyRequires config mutation (add EventPeer endpoint). Must be operator-reviewed; wrong endpoint = data leak risk.
6investigate-missing-delivery-rowsexpected-deliveryyesnoinspect-onlyExpected peer has endpoint but no delivery rows. Diagnostic query only.
7inspect-failed-subscription-runssubscription-runsyesnoinspect-onlyLists recent failed/pending subscription runs. Diagnostic; does not retry.

Readiness categories

  • ready: Action is safe, idempotent, and has no side effects beyond the intended fix. Can be wired to auto-execute with FederationSLO thresholds gating frequency.
  • inspect-only: Action is diagnostic. It collects information but does not mutate state. Useful for triage dashboards and alerting, not auto-remediation.
  • not ready: Action requires operator judgment or config changes. Must remain behind approval gating even in future auto-execute phases.

P4 contract

In P4, --remediation-plan emits all 7 actions as a plan-only JSON document. The plan:

  1. Never mutates state (read-only doctor run + plan generation).
  2. Uses stable typed action constants (not free-text strings).
  3. Deduplicates by (action, group, peer, resource).
  4. Sorts deterministically for diff-friendly output.
  5. Includes safe and requiresOperatorApproval flags per action.

Pre-conditions for P5+ auto-execute

Before any action graduates from plan-only to auto-execute:

  1. The action must be classified as ready in this matrix.
  2. A FederationSLO resource must exist for the target EventGroup.
  3. The ProviderActionPolicy (or a new FederationRemediationPolicy) must gate execution.
  4. Rate limiting must prevent remediation storms (e.g., max 3 retries per peer per hour).
  5. Every execution must be journaled in the action_executions table.
  6. Qualification evidence must demonstrate the action resolving the fault without side effects.