Distributed Sdn On Router Controller Replication

Issue 65 Edition 2026-03-06 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-11 19:23

Key takeaways

DSDN is described as data-plane-agnostic and able to use other source-routing data planes such as SRv6, with data-plane programming implemented as a separate controller module.
A contributor argues that complexity in centralized SDN architectures has contributed to significant network failures despite redundancy and best practices.
DSDN is described as incrementally deployable in brownfield networks by programming a separate or lower-priority VRF and initially steering a small fraction of traffic or a subset of destinations through DSDN paths.
MPLS label-stack depth limits are described as a constraint for strict source routing at global-WAN scale, and chips supporting roughly 12 to 14 labels may still be insufficient for some traffic-engineered paths.
The DSDN churn evaluation results described were generated with fast reroute disabled to reflect a worst-case impact scenario.

Sections

Distributed Sdn On Router Controller Replication

DSDN is described as data-plane-agnostic and able to use other source-routing data planes such as SRv6, with data-plane programming implemented as a separate controller module.
DSDN is described as using rich flooding where routers flood local topology state plus additional information such as utilization and an aggregated demand vector so each router reconstructs a global view.
To manage churn, DSDN is described as aggregating small topology and demand changes over a short window and triggering immediate recomputation only for large events like link failures.
Modern routers are described as capable of hosting operator-run containers and exposing standardized management/programming APIs such as gNMI and gRIBI/OpenConfig.
Distributed SDN (DSDN) is described as running a full copy of the SDN controller on every router, co-resident with the data plane, to avoid reliance on an out-of-band controller network.
In DSDN, each router is described as running a global traffic-engineering optimization and programming only source routes for flows originating at that router.

Centralized Sdn Operational Reality And Failure Domains

A contributor argues that complexity in centralized SDN architectures has contributed to significant network failures despite redundancy and best practices.
Centralized SDN uses a controller (or controller hierarchy) to compute engineered paths and push forwarding state into the network.
At planet scale, centralized SDN typically requires replicated controllers, a controller hierarchy, and a separate globally distributed out-of-band control-plane network to connect controllers to routers during in-band failures.
Large SDN deployments require structured telemetry aggregation to build an abstract global view of topology and demands for controller path computation.
In production, centralized SDN commonly retains an in-band distributed routing protocol as a fallback when the SDN control plane is disconnected (running headless).
It is claimed that over half of Google's largest WAN outages over a four-year period were root-caused to the control plane or control-plane infrastructure.

Deployment And Change Management Constraints

DSDN is described as incrementally deployable in brownfield networks by programming a separate or lower-priority VRF and initially steering a small fraction of traffic or a subset of destinations through DSDN paths.
High change risk and large failure blast radius are described as key reasons legacy networking components persist even when better ideas exist.
Rolling out DSDN software updates is described as requiring careful versioning and capability negotiation, and traffic-engineering algorithm changes may require coordinated rollout via shadow runs and gradual traffic shifts.
Network architectures are described as tending to ossify because radical forwarding paradigm shifts are hard to implement in complex environments intertwined with business systems.
Google is described as actively investigating deploying DSDN, and the implementation described was written as production-grade code rather than throwaway research code.

Data Plane And Hardware Constraints For Source Routing

DSDN is described as data-plane-agnostic and able to use other source-routing data planes such as SRv6, with data-plane programming implemented as a separate controller module.
MPLS label-stack depth limits are described as a constraint for strict source routing at global-WAN scale, and chips supporting roughly 12 to 14 labels may still be insufficient for some traffic-engineered paths.
A technique is described that encodes two hops inside a single MPLS label and programs routers to act on the label only when they are the intended hop, extending effective source-routed path length on existing hardware.

Convergence And Evaluation Claims Under Stress

The DSDN churn evaluation results described were generated with fast reroute disabled to reflect a worst-case impact scenario.
It is claimed that, in experiments with failure and churn rates increased up to 20x production levels, DSDN had faster end-to-end reaction and convergence than traditional approaches.
It is claimed that DSDN can converge roughly 100 to 120 times faster than centralized SDN because computation and programming occur locally on routers and avoid multi-hop coordination.

Watchlist

The production readiness and operational maturity of the DSDN implementation remains an open question in this excerpt and should be validated before assuming deployability.
Alex Krenzel calls for increased dialogue between academic networking researchers and network operators because academic “clean” protocol narratives often diverge from operational reality where legacy systems persist and radical shifts are risky.
Ethan Banks is watching for production use of the discussed distributed SDN paradigm to gather real-world feedback and assess outcomes.

Unknowns

What primary data supports the claim that over half of Google's largest WAN outages in a four-year period were rooted in the control plane or control-plane infrastructure, and what definitions and categories were used?
What are the exact experimental setups and baselines behind the reported 20x-churn stress tests and the reported 100–120x convergence improvement, and do these results hold with FRR enabled?
What is the scaling overhead of rich flooding (control traffic, CPU, memory) as topology size, churn rate, and demand-vector dimensionality increase?
How are containerized control applications on routers isolated, scheduled, and resource-limited to prevent control-plane instability from co-resident software faults?
What are the real-world operational procedures for mixed-version operation, capability negotiation, and safe rollback during DSDN upgrades, and how often do algorithm changes require coordinated rollout?

Investor overlay

Read-throughs

If on-router replicated SDN gains production traction, demand could increase for router platforms that can reliably host containerized control applications with strong isolation, scheduling, and resource limits, since co-resident software faults are a highlighted operational risk.
Strict source routing feasibility depends on data-plane encoding and ASIC limits. If MPLS label-stack depth remains constraining at WAN scale, there could be a read-through toward SRv6 capable platforms or alternative encodings, and away from shallow label-stack dependencies.
Incremental brownfield deployment via separate or lower-priority VRFs and small traffic steering suggests a path where tooling for mixed-version interoperability, capability negotiation, safe rollback, and shadow deployment becomes a gating enabler for adoption and operator confidence.

What would confirm

Documented production deployments showing incremental rollout using VRF isolation and limited traffic steering, plus real-world feedback on outcomes and operator procedures for mixed-version operation, capability negotiation, and rollback during upgrades.
Independent, reproducible evaluations of convergence and churn behavior with transparent experimental setups and baselines, including results with fast reroute enabled, and quantified overhead of rich flooding as topology size and churn increase.
Evidence of hardware and encoding viability at WAN scale, such as acceptable operational use of SRv6 or validated mitigations for MPLS label-stack depth, along with demonstrated source-routed path coverage for traffic-engineered use cases.

What would kill

Production trials show control-plane instability from on-router containerized applications, inability to enforce resource limits, or unacceptable operational complexity from co-resident faults, undermining the premise that decentralization reduces failure domains.
Scaling studies show rich flooding overhead grows too quickly in control traffic, CPU, or memory under realistic topology and churn, or convergence claims fail to hold under representative conditions such as with fast reroute enabled.
Operational constraints block incremental deployment, such as lack of safe mixed-version interoperability, unreliable capability negotiation, or high-risk upgrade and rollback requirements, making brownfield adoption impractical despite architectural benefits.

Sources

HN817: Is There a Better Way to Do Software Defined Networking?

2026-03-06 packetpushers.net