Constraints And Rollout Mechanics For Distributed Sdn

Issue 65 Edition 2026-03-06 9 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-08 21:25

Key takeaways

To manage churn, distributed SDN aggregates small topology and demand changes over a short window and triggers immediate recomputation for large events like link failures.
There is an ongoing debate about whether BGP is a flawed protocol or an elegant, extensible design.
Distributed SDN is data-plane-agnostic and can use other source-routing data planes such as SRv6, with data-plane programming implemented as a separate controller module.
Centralized SDN uses a controller or controller hierarchy to compute engineered paths and push forwarding state into the network.
The distributed SDN churn results discussed were generated without fast reroute enabled to reflect a worst-case impact scenario.

Sections

Constraints And Rollout Mechanics For Distributed Sdn

To manage churn, distributed SDN aggregates small topology and demand changes over a short window and triggers immediate recomputation for large events like link failures.
Distributed SDN can be incrementally deployed in brownfield networks by programming entries in a separate or lower-priority VRF and initially steering a small fraction of traffic or a subset of destinations through distributed SDN paths.
MPLS label-stack depth limits constrain strict source routing at global-WAN scale, and router chips supporting roughly 12 to 14 labels may be insufficient for some traffic-engineered paths.
Distributed SDN can extend effective source-routed path length by encoding two hops inside a single MPLS label and programming routers to act on the label only when they are the intended hop.
Fast reroute reduces loss during convergence but cannot prevent impact when backup paths are already near capacity and capacity is not reserved for failures.
Rolling out distributed SDN software updates requires careful versioning and capability negotiation, and traffic-engineering algorithm changes may require coordinated rollout using shadow runs and gradual traffic shifts.

Adoption Ecosystem Ossification And Practice Theory Gap

There is an ongoing debate about whether BGP is a flawed protocol or an elegant, extensible design.
High change risk and large blast radius of failures are key reasons legacy networking components persist even when better ideas exist.
There is a clear difference between how academia describes SDN and what SDN deployments look like in practice at Google.
Alex Krenzel states that he views BGP as elegant and extensible rather than fundamentally broken.
Networks tend to ossify because radical forwarding paradigm shifts are hard to implement in complex environments intertwined with business systems.
Academia often lacks access to the operational rationale behind current network designs, and missing that context can render research output useless.

Distributed Sdn Architecture On Routers

Distributed SDN is data-plane-agnostic and can use other source-routing data planes such as SRv6, with data-plane programming implemented as a separate controller module.
Distributed SDN uses rich flooding where each router floods local topology state plus additional information such as utilization and an aggregated demand vector so every router reconstructs a global view similar to a link-state database.
Modern routers can host operator-run containers and expose standardized management/programming APIs such as gNMI and gRIBI/OpenConfig.
Distributed SDN can be implemented by running a full copy of the SDN controller on every router so control and data planes remain conceptually separate but co-resident, avoiding reliance on an out-of-band controller network.
In distributed SDN, each router runs an operator-specified global traffic-engineering optimization using the shared global view, then programs only the source routes for flows that originate at that router.
Distributed SDN can be consensus-free in the core because strict source routing encodes the full path in the packet header, so changing a flow’s path requires only an ingress update rather than per-hop reservation or agreement.

Centralized Sdn In Practice Is Multi-Infrastructure And Dual-Control

Centralized SDN uses a controller or controller hierarchy to compute engineered paths and push forwarding state into the network.
At planet scale, centralized SDN typically requires replicated controllers, a controller hierarchy, and a separate globally distributed out-of-band control-plane network to connect controllers to routers during in-band failures.
Large SDN deployments require structured telemetry aggregation to build an abstract global view of topology and demands for the controller to compute paths.
In production, centralized SDN does not eliminate distributed routing protocols because operators often keep an in-band protocol running as a fallback when the SDN control plane is disconnected.

Evaluation Claims And Methodology Caveats

The distributed SDN churn results discussed were generated without fast reroute enabled to reflect a worst-case impact scenario.
In experiments, distributed SDN evaluation increased failure and churn rates up to 20 times production levels and still found faster end-to-end reaction and convergence than traditional approaches.
Fast reroute reduces loss during convergence but cannot prevent impact when backup paths are already near capacity and capacity is not reserved for failures.
Distributed SDN is reported to converge roughly 100 to 120 times faster than centralized SDN because computation and programming occur locally on routers and avoid multi-hop coordination required by central controllers.

Watchlist

The production readiness and operational maturity of the distributed SDN implementation remains an open question in the excerpt and should be validated before assuming deployability.
Ethan Banks is watching for production use of the discussed distributed SDN paradigm to gather real-world feedback and assess outcomes.
Alex Krenzel calls for increased dialogue between academic networking researchers and network operators because academic “clean” protocol narratives often diverge from operational reality where legacy systems persist and radical shifts are risky.

Unknowns

What published evidence (paper, dataset, or methodology details) supports the outage attribution and convergence speedup claims, and what are the exact metrics and baselines used?
Is distributed SDN running in any production or partial-production deployment, and if so what are the observed incident root causes, change-failure rates, and recovery characteristics?
What are the scaling limits and operational stability characteristics of rich flooding (control traffic overhead, CPU/memory impact, and behavior under churn) in representative topologies?
How does distributed SDN interoperate with existing routing/TE systems during incremental deployment, especially given that centralized SDN often retains in-band routing fallbacks?
What concrete hardware constraints (label-stack depth by platform, MTU impacts, and operational constraints of two-hops-per-label encoding) bound strict source-routing path expressiveness in practice?

Investor overlay

Read-throughs

If distributed SDN that runs controller logic on routers matures operationally, it could shift spend toward router software stacks and in-router compute, and away from external controller and telemetry aggregation infrastructure.
Data-plane-agnostic distributed SDN that can use SRv6-like source routing suggests optionality for multiple forwarding encodings, potentially increasing demand for platforms supporting strict source routing within practical label stack and MTU constraints.
Centralized SDN described as dual-control with controller hierarchies and separate telemetry pipelines implies sustained operational complexity, creating ongoing demand for tooling that reduces change risk, validates path programming, and supports incremental brownfield migration.

What would confirm

Verified production or partial-production deployments of distributed SDN, with disclosed incident root causes, change-failure rates, and recovery characteristics that show acceptable operational maturity.
Published methodology details, datasets, and baselines supporting claimed convergence speedups and churn behavior, including comparisons with fast reroute enabled and under representative utilization.
Clear interoperability stories for incremental deployment alongside existing routing and TE, plus documented hardware constraint boundaries for source-routing expressiveness across common platforms.

What would kill

Evidence that rich flooding control traffic or in-router software containers create instability or unacceptable CPU and memory overhead under realistic churn and topology sizes.
Real-world results show convergence and outage outcomes are highly sensitive to baseline choices such as fast reroute settings, making claimed speedups non-reproducible or not meaningful versus operational practice.
Incremental deployment proves impractical due to poor interoperation with existing routing fallbacks or because label stack depth and MTU constraints materially limit strict source-routing paths.

Sources

HN817: Is There a Better Way to Do Software Defined Networking?

2026-03-06 packetpushers.net