Vendor Opacity Supplier Dynamics And Fix Path
Sources: 1 • Confidence: Medium • Updated: 2026-04-11 19:45
Key takeaways
- Flex initially claimed they had never seen the problem in the field.
- MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
- Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
- Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
- The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
Sections
Vendor Opacity Supplier Dynamics And Fix Path
- Flex initially claimed they had never seen the problem in the field.
- Oxide argued the symptom is subtle enough that typical field monitoring would likely miss it.
- Oxide needs to avoid applying the prior mitigation to the fixed part revision to reduce software complexity, even though accidentally applying the mitigation to the fixed part is considered safe.
- Oxide could not obtain vendor-provided debug firmware or equivalent introspection from Flex despite documentation hints, limiting visibility into IBC internal behavior.
- Flex issued a PCN stating the auxiliary power circuit had insufficient margin such that component tolerance combinations could cause the IBC to falsely sense low input voltage and shut down as observed.
- Flex’s mitigation consisted of programming undocumented register values to effectively disable the IBC undervoltage detector and rely on upstream hot-swap undervoltage protection, and this appeared to eliminate the droop.
Failure Chain From Transient Power Event To Storage Panic
- MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
- During the ~8V input event, most downstream regulators and the CPU remained operational while drives and the 12V-generation path were the first to fail due to tolerance differences.
- The observed 12V rail droop event is short (about 12 milliseconds) and resembles a controlled ramp, implying the controller purposefully reduced output rather than a passive discharge.
- The initial operational symptom was one sled showing much lower uptime than its peers, suggesting unexpected resets or panics.
- Crash dumps indicated ZFS panicked after being blocked on I/O to the pool for minutes.
- Dump analysis suggested the pool hang was not a ZFS deadlock but an underlying device-layer failure where device state appeared uninitialized.
Manufacturing Containment And Defect Prevalence
- Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
- The added manufacturing test check detected two to three affected sleds in a rack about to ship.
- The incidence of the issue in manufacturing was estimated at roughly 3% to 5% of sleds.
- The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
- If a sled is going to exhibit the issue, it appears to do so in manufacturing rather than emerging later in the field.
Environmental Dependency And Measurement Friction
- Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
- The problem did not reproduce on a bench setup but did reproduce when installed in a rack environment after running for days.
- The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
- Operating the part at 54.5V may make the issue less likely than operating closer to 50V, implying input voltage affects failure incidence.
Operational Readthrough Serviceability And Resilience Roadmap
- The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
- A manufacturing mishap showed that if the heatsink is not installed on the hot part, the sled will overheat.
- At a customer site, the team needed software support to reconfigure around the affected sled and safely remove it because a distributed system was running on it.
- Illumos hot-plug framework could be used to offline and re-online the devices to recover, but the millisecond-scale event is faster than the framework expects and makes mitigation non-trivial.
Watchlist
- The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
Unknowns
- What are the underlying counts, observation windows, and confidence intervals behind the stated manufacturing incidence rate and the reported post-mitigation drop to zero failures?
- Which specific rack-level variables (power distribution, supply voltage setpoint, airflow/thermal profile, EMI, load pattern) are necessary and sufficient for reproduction?
- What exact internal IBC sensing path/auxiliary supply margin issue is described in the PCN, and how does it map to the observed VN undervoltage reporting with stable measured inputs?
- How does disabling the IBC undervoltage detector affect protection coverage across all credible fault scenarios (true undervoltage, transient dips, upstream faults), and what validation was performed?
- How will Oxide reliably identify fixed versus unfixed part revisions in manufacturing and at runtime to prevent unnecessary software branching and reduce operational errors in mixed fleets?