Vendor Opacity Supplier Dynamics And Fix Path

Issue 77 Edition 2026-03-18 8 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-11 19:45

Key takeaways

Flex initially claimed they had never seen the problem in the field.
MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.

Flex initially claimed they had never seen the problem in the field.
Oxide argued the symptom is subtle enough that typical field monitoring would likely miss it.
Oxide needs to avoid applying the prior mitigation to the fixed part revision to reduce software complexity, even though accidentally applying the mitigation to the fixed part is considered safe.
Oxide could not obtain vendor-provided debug firmware or equivalent introspection from Flex despite documentation hints, limiting visibility into IBC internal behavior.
Flex issued a PCN stating the auxiliary power circuit had insufficient margin such that component tolerance combinations could cause the IBC to falsely sense low input voltage and shut down as observed.
Flex’s mitigation consisted of programming undocumented register values to effectively disable the IBC undervoltage detector and rely on upstream hot-swap undervoltage protection, and this appeared to eliminate the droop.

MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
During the ~8V input event, most downstream regulators and the CPU remained operational while drives and the 12V-generation path were the first to fail due to tolerance differences.
The observed 12V rail droop event is short (about 12 milliseconds) and resembles a controlled ramp, implying the controller purposefully reduced output rather than a passive discharge.
The initial operational symptom was one sled showing much lower uptime than its peers, suggesting unexpected resets or panics.
Crash dumps indicated ZFS panicked after being blocked on I/O to the pool for minutes.
Dump analysis suggested the pool hang was not a ZFS deadlock but an underlying device-layer failure where device state appeared uninitialized.

Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
The added manufacturing test check detected two to three affected sleds in a rack about to ship.
The incidence of the issue in manufacturing was estimated at roughly 3% to 5% of sleds.
The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
If a sled is going to exhibit the issue, it appears to do so in manufacturing rather than emerging later in the field.

Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
The problem did not reproduce on a bench setup but did reproduce when installed in a rack environment after running for days.
The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
Operating the part at 54.5V may make the issue less likely than operating closer to 50V, implying input voltage affects failure incidence.

The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
A manufacturing mishap showed that if the heatsink is not installed on the hot part, the sled will overheat.
At a customer site, the team needed software support to reconfigure around the affected sled and safely remove it because a distributed system was running on it.
Illumos hot-plug framework could be used to offline and re-online the devices to recover, but the millisecond-scale event is faster than the framework expects and makes mitigation non-trivial.

The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.

What are the underlying counts, observation windows, and confidence intervals behind the stated manufacturing incidence rate and the reported post-mitigation drop to zero failures?
Which specific rack-level variables (power distribution, supply voltage setpoint, airflow/thermal profile, EMI, load pattern) are necessary and sufficient for reproduction?
What exact internal IBC sensing path/auxiliary supply margin issue is described in the PCN, and how does it map to the observed VN undervoltage reporting with stable measured inputs?
How does disabling the IBC undervoltage detector affect protection coverage across all credible fault scenarios (true undervoltage, transient dips, upstream faults), and what validation was performed?
How will Oxide reliably identify fixed versus unfixed part revisions in manufacturing and at runtime to prevent unnecessary software branching and reduce operational errors in mixed fleets?

Supplier component revision issues can force customers into software workarounds that alter protection behavior, increasing qualification burden and mixed fleet operational complexity.
Low initial failure counts may delay supplier engagement, extending time to root cause and raising debug cost when reproduction requires full system context and long runs.
Adding manufacturing screening based on telemetry logs can materially reduce escapes for intermittent power related defects, but may introduce throughput and cycle time impacts during ramp.

Public or customer disclosures describing a specific component revision fix and a documented workaround that requires nonstandard register programming or protection setting changes.
Manufacturing or field reports that screening based on logs or targeted tests reduced failures, including denominator data, observation window, and confidence intervals.
Evidence that reproduction depends on rack level variables such as supply setpoint, thermal profile, EMI, or load pattern, and that long duration environment accurate tests are required.

Independent validation that the reported undervoltage events were measurement artifacts and not linked to component sensing margin or protection settings.
Data showing incidence was negligible or the post mitigation zero failure claim is not statistically meaningful once denominators and windows are provided.
Demonstration that the workaround does not change protection coverage and that mixed revision identification is solved cleanly, removing meaningful operational complexity.