DPI observability architecture backlog
Scope
This note records the implementation direction for using DPI as an observation and analysis signal in routerd. DPI must not sit on the forwarding verdict path for this work. Packet handling should remain copy-based and best-effort: if DPI is slow, missing, or crashing, traffic forwarding continues and only analysis quality degrades.
The goal is to make connection and client analysis more useful than port-based
guesses. The first useful outcome is a lower port-fallback ratio and richer
flow metadata in the Web Console, not application-aware blocking.
Current state
routerd-firewall-loggerreceives NFLOG copies from nftables accept/drop logging and can callrouterd-dpi-classifier.routerd-dpi-classifierexposes HTTP+JSON over a Unix socket and supportsbuiltin,ndpi-agent, andautoengines.autofalls back to the built-in parser when the agent is unavailable, returns unknown, times out, or errors.- The external
ndpiReaderoption is deprecated compatibility only. Its presence does not change classification behavior. routerd-ndpi-agentis implemented as an optional CGO/libndpi service. The non-libndpibuild reports an unavailable service boundary instead of silently pretending to classify.traffic-flows.dbremains conntrack-derived and payload-free. It is enriched fromdpi_flow, DNS/SNI/HTTP host metadata, resolved hostnames, and port fallback while preserving existing DPI fields when later conntrack updates do not carry payload-derived details.- The Web Console separates protocol/application evidence from provider labels more clearly than before. Client summaries use the recent traffic window and map provider-only nDPI names such as Google, AWS, Microsoft, Apple, Cloudflare, and Nintendo back to the observed transport protocol where appropriate.
Target architecture
Keep routerd core and the existing Go helpers independent of libndpi. Add a
separate optional native service for nDPI-backed analysis:
nftables NFLOG packet copy
-> routerd-firewall-logger
-> routerd-dpi-classifier pure Go facade, cache, fallback, API
-> routerd-ndpi-agent optional CGO/libndpi service
-> firewall-logs.db dpi_flow
-> traffic-flow enrichment / Web Console / client analysis
routerd-dpi-classifier remains the stable integration point. It owns timeout
policy, fallback to the built-in parser, source labeling, and compatibility with
callers. routerd-ndpi-agent owns libndpi flow state and native resource
lifetime.
Process boundaries
routerd-firewall-logger
- Continues reading NFLOG/pflog packet copies.
- Sends only candidate packets to
routerd-dpi-classifier. - Never waits on DPI for a forwarding verdict.
- Records accepted flow classifications into
dpi_flow. - Enriches deny events from cached
dpi_flowentries when available.
routerd-dpi-classifier
- Remains pure Go.
- Provides the public
/v1/status,/v1/healthz, and/v1/classifyAPI. - Maintains engine selection:
builtin,ndpi-agent, orauto. - Calls
routerd-ndpi-agentwhen enabled and falls back to the built-in parser on timeout, unavailable socket, unknown result, or agent error. - Adds result source metadata:
ndpi-agentbuiltinport-fallback- future sources such as
dns-cache,sni-cache, andcloud-ip
- Applies policy limits before forwarding work to the agent:
- first payload packets per flow
- copy range
- flow TTL
- flow limit
- per-request timeout
routerd-ndpi-agent
- Is optional and isolated from routerd core.
- Uses CGO and links to
libndpi. - Keeps nDPI detection modules and per-flow detection state in one process.
- Accepts packet observations over a Unix socket.
- Returns protocol, application, category, confidence, risk, and metadata.
- Exposes status counters and resource usage so routerd can show whether nDPI is actually contributing.
- Failure is non-fatal; systemd may restart it without disrupting forwarding.
Configuration direction
Introduce a DPI classifier resource or extend the existing generated systemd resources with a typed policy shape. Prefer a resource once the fields are stable enough to expose:
- apiVersion: net.routerd.net/v1alpha1
kind: DPIClassifier
metadata:
name: default
spec:
engine: auto # builtin | ndpi-agent | auto
socket: /run/routerd/dpi-classifier/default.sock
ndpiAgentSocket: /run/routerd/ndpi-agent/default.sock
firstPayloadPackets: 10
copyRange: 2048
flowTTL: 1h
flowLimit: 100000
requestTimeout: 200ms
FirewallLog.spec.log.acceptSampleRate still controls how much accepted traffic
is copied into NFLOG. The DPI policy controls what happens after the packet copy
reaches routerd.
Data model backlog
- Extend
dpi.ClassifyResultwithEngineandSource. - Extend
dpi.ClassifyResultwith typed protocol and risk fields such asDetectedProtocol,MasterProtocol,ApplicationProtocol,Category,Risk, and genericMetadata. - Keep existing fields such as
AppName,AppCategory,TLSSNI,HTTPHost, andDNSQueryfor API compatibility until a cleaner typed shape replaces them. - Add source and engine details to firewall log hints.
- Add structured source and engine columns to
DPIFlowEntryand propagate them totraffic-flows.db. - Add schema migration tests for new
traffic_flowsSQLite columns. - Add schema migration coverage for
dpi_flowandtraffic_flowssource/engine columns. - Keep
traffic-flows.dbpayload-free. Enrich traffic flows by joining againstdpi_flow, DNS query history, SNI/HTTP host metadata, and port fallback. - Expose nDPI-agent status counters through the agent status endpoint.
- Store enough cross-component counters to explain value in the Web
Console:
- packets observed
- packets sent to built-in parser
- packets sent to nDPI agent
- nDPI classified flows
- fallback classified flows
- unknown flows
- timeout/error counts
Implementation backlog
Phase 1: Make the current state honest
- Rename user-facing engine labels from nDPI to built-in parser where appropriate.
- Remove or de-emphasize
ndpiReaderstatus fields that imply active nDPI classification. - Keep
routerd-dpi-classifierAPI stable. - Add tests proving
ndpiReaderavailability does not change classification.
Phase 2: Introduce engine abstraction
- Add an internal classifier engine interface in
routerd-dpi-classifier. - Move the existing parser behind a
builtinengine. - Add result source and engine fields.
- Add unit tests for agent fallback behavior and unknown results.
- Add an explicit timeout-path unit test with a deliberately slow agent.
- Update Web Console labels so provider labels are not treated as protocol labels in connection and client summaries.
- Add Web Console source counters for
ndpi-agent,builtin, andport-fallbackwhere structured source data is available. - Add Web Console connection filtering by classification source
(
dpi,port-fallback,identifying,none). - Add Web Console traffic-flow filtering by structured source
(
ndpi-agent,builtin,port-fallback).
Phase 3: Add optional routerd-ndpi-agent
- Add a new command under
cmd/routerd-ndpi-agent. - Keep the default command as an unavailable service boundary when the CGO
libndpibackend is not enabled. - Build the
libndpibackend only when CGO andlibndpiheaders are available through the explicitlibndpibuild tag. - Define a small Unix-socket HTTP+JSON API:
-
GET /v1/status -
GET /v1/healthz -
POST /v1/observe-packet - optionally
POST /v1/reset-flow
-
- Keep nDPI flow state in the agent, not in
routerd-dpi-classifier. - Enforce flow TTL, flow limit, and inspected-packet limit in the service boundary.
- Add a selftest/status path that reports whether
libndpiis loaded. - Add a
libndpi-tagged test that classifies a synthetic TLS ClientHello with the native backend.
Phase 4: Wire service management
- Render systemd/OpenRC/FreeBSD/NixOS service scaffolding for
routerd-ndpi-agentonly when configured by classifier engine selection. - Make
routerd-dpi-classifier.servicewant and start afterrouterd-ndpi-agent.servicewhen engine isautoorndpi-agent. - Keep
routerd-firewall-logger.servicedepending only onrouterd-dpi-classifier.service. - Update
install.shdependency handling solibndpiis optional and explicit. - Ensure upgrades restart active helper services that are running deleted binaries.
Phase 5: Improve analysis surfaces
- Add Web Console observed-flow source metrics for nDPI, built-in, port fallback, and unidentified traffic where data is available.
- Add Web Console service health metrics for DPI engine status and nDPI error counters.
- Add explicit classifier timeout-rate counters.
- Add client-level protocol/category summaries based on the same one-hour window used for traffic, DNS, firewall, and DHCP evidence.
- Add filters for
source=ndpi-agent,source=builtin, andsource=port-fallback. - Keep destination provider labels separate from protocol/application labels.
- Persist traffic-flow enrichment from
dpi_flow, SNI, HTTP host, DNS query, resolved hostname, and port fallback. - Persist typed DPI fields (
detectedProtocol,applicationProtocol,category,confidence,risk, andmetadata) throughdpi_flow, active traffic flows, control API JSON, and the Web Console classification path. - Keep read-only Web Console queries compatible with legacy SQLite files until the writer has had a chance to add the new DPI columns.
Phase 6: Production evaluation on homert02
- Enable
routerd-ndpi-agenton homert02 only. - Keep
acceptSampleRate: 1initially, but capcopyRangeto 1536 or 2048 bytes. - Confirm current nDPI-agent health on homert02, including
libndpiLoadedanderrorPackets. - Confirm current traffic-flow enrichment counts on homert02.
- Capture full before/after metrics:
port-fallbackratio- unknown ratio
- nDPI classified ratio
- CPU and RSS for
routerd-firewall-logger,routerd-dpi-classifier, androuterd-ndpi-agent - NFLOG backlog/drop indicators if available
- Web Console first-load latency
- Exercise rollback by disabling only
routerd-ndpi-agent; built-in DPI remains.
Production notes from homert02 on 2026-05-16:
- The nDPI agent was healthy with
libndpiLoaded=true,libndpiVersion=4.2.0, anderrorPackets=0. - Before the new deploy, the nDPI agent reported 3,118 active flows, 5,531 observed packets, 4,791 backend packets, 1,530 classified packets, 3,261 unknown packets, and 740 skipped packets.
- After enabling
copyRange: 2048,routerdneeded a service restart so the long-running daemon reloaded the edited config and regenerated/run/routerd/firewall.nft. A one-shot apply alone wrote the snaplen-enabled file used by that process, but the already-running daemon still had the previous in-memory config. - The live nftables ruleset then exposed
snaplen: 2048on routerd firewall log rules. - A Web Console summary fetch with 600 traffic flows returned HTTP 200 in 2.73s. The sample contained 51 nDPI-agent flows (8.5%), 1 built-in flow (0.2%), 444 port-fallback flows (74.0%), and 104 unidentified flows (17.3%).
- Process snapshots after deploy showed roughly 35 MiB RSS / 22% CPU for
routerd-firewall-logger, 12 MiB RSS / 0.3% CPU forrouterd-dpi-classifier, and 17 MiB RSS / 0.2% CPU forrouterd-ndpi-agent. - Host counters did not expose a direct NFLOG drop counter. Netlink inspection showed the firewall logger socket present, and UDP receive-buffer error counters were zero.
- The routerd supervisor restarted
routerd-ndpi-agentquickly when it was stopped. A classifier self-test with an intentionally missing agent socket fell back to the built-in TLS SNI classifier, preserving analysis without inline packet verdict impact.
Non-goals
- Do not use DPI for firewall verdicts in this work.
- Do not introduce NFQUEUE or inline packet verdict handling.
- Do not make routerd core depend on CGO or
libndpi. - Do not store packet payloads in persistent databases.
- Do not claim application identity is authoritative when the source is only a port guess or CDN/provider heuristic.
Open questions
- Whether
DPIClassifiershould be a first-class resource or remain generated fromFirewallLoguntil the configuration shape settles. - Whether
routerd-ndpi-agentshould use HTTP+JSON initially or a smaller binary protocol after the API stabilizes. - Which
libndpipackage names and versions are acceptable for Ubuntu, Alpine, FreeBSD, and NixOS. - How much nDPI risk metadata should be exposed in the v1alpha1 API.
- Whether cloud IP/FQDN intelligence should live in the same classifier pipeline or a separate enrichment pipeline.