Rosa Del Mar

Daily Brief

Issue 77 2026-03-18

Vendor Opacity Supplier Dynamics And Fix Path

Issue 77 Edition 2026-03-18 8 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-11 19:45

Key takeaways

  • Flex initially claimed they had never seen the problem in the field.
  • MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
  • Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
  • Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
  • The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.

Sections

Vendor Opacity Supplier Dynamics And Fix Path

  • Flex initially claimed they had never seen the problem in the field.
  • Oxide argued the symptom is subtle enough that typical field monitoring would likely miss it.
  • Oxide needs to avoid applying the prior mitigation to the fixed part revision to reduce software complexity, even though accidentally applying the mitigation to the fixed part is considered safe.
  • Oxide could not obtain vendor-provided debug firmware or equivalent introspection from Flex despite documentation hints, limiting visibility into IBC internal behavior.
  • Flex issued a PCN stating the auxiliary power circuit had insufficient margin such that component tolerance combinations could cause the IBC to falsely sense low input voltage and shut down as observed.
  • Flex’s mitigation consisted of programming undocumented register values to effectively disable the IBC undervoltage detector and rely on upstream hot-swap undervoltage protection, and this appeared to eliminate the droop.

Failure Chain From Transient Power Event To Storage Panic

  • MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
  • During the ~8V input event, most downstream regulators and the CPU remained operational while drives and the 12V-generation path were the first to fail due to tolerance differences.
  • The observed 12V rail droop event is short (about 12 milliseconds) and resembles a controlled ramp, implying the controller purposefully reduced output rather than a passive discharge.
  • The initial operational symptom was one sled showing much lower uptime than its peers, suggesting unexpected resets or panics.
  • Crash dumps indicated ZFS panicked after being blocked on I/O to the pool for minutes.
  • Dump analysis suggested the pool hang was not a ZFS deadlock but an underlying device-layer failure where device state appeared uninitialized.

Manufacturing Containment And Defect Prevalence

  • Historical logs showed the issue had occurred before at least one sled shipped, and Oxide added manufacturing test checks to avoid shipping additional affected sleds.
  • The added manufacturing test check detected two to three affected sleds in a rack about to ship.
  • The incidence of the issue in manufacturing was estimated at roughly 3% to 5% of sleds.
  • The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
  • If a sled is going to exhibit the issue, it appears to do so in manufacturing rather than emerging later in the field.

Environmental Dependency And Measurement Friction

  • Reproducing and measuring the event required intrusive scope probing and repeated multi-day runs due to the sled’s cableless design and long time-to-failure.
  • The problem did not reproduce on a bench setup but did reproduce when installed in a rack environment after running for days.
  • The issue appeared only in full rack context and not on the bench, and affected sleds were being set aside in manufacturing as a growing quarantined inventory pool.
  • Operating the part at 54.5V may make the issue less likely than operating closer to 50V, implying input voltage affects failure incidence.

Operational Readthrough Serviceability And Resilience Roadmap

  • The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
  • A manufacturing mishap showed that if the heatsink is not installed on the hot part, the sled will overheat.
  • At a customer site, the team needed software support to reconfigure around the affected sled and safely remove it because a distributed system was running on it.
  • Illumos hot-plug framework could be used to offline and re-online the devices to recover, but the millisecond-scale event is faster than the framework expects and makes mitigation non-trivial.

Watchlist

  • The team wants to implement an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.

Unknowns

  • What are the underlying counts, observation windows, and confidence intervals behind the stated manufacturing incidence rate and the reported post-mitigation drop to zero failures?
  • Which specific rack-level variables (power distribution, supply voltage setpoint, airflow/thermal profile, EMI, load pattern) are necessary and sufficient for reproduction?
  • What exact internal IBC sensing path/auxiliary supply margin issue is described in the PCN, and how does it map to the observed VN undervoltage reporting with stable measured inputs?
  • How does disabling the IBC undervoltage detector affect protection coverage across all credible fault scenarios (true undervoltage, transient dips, upstream faults), and what validation was performed?
  • How will Oxide reliably identify fixed versus unfixed part revisions in manufacturing and at runtime to prevent unnecessary software branching and reduce operational errors in mixed fleets?

Investor overlay

Read-throughs

  • Supplier component revision issues can force customers into software workarounds that alter protection behavior, increasing qualification burden and mixed fleet operational complexity.
  • Low initial failure counts may delay supplier engagement, extending time to root cause and raising debug cost when reproduction requires full system context and long runs.
  • Adding manufacturing screening based on telemetry logs can materially reduce escapes for intermittent power related defects, but may introduce throughput and cycle time impacts during ramp.

What would confirm

  • Public or customer disclosures describing a specific component revision fix and a documented workaround that requires nonstandard register programming or protection setting changes.
  • Manufacturing or field reports that screening based on logs or targeted tests reduced failures, including denominator data, observation window, and confidence intervals.
  • Evidence that reproduction depends on rack level variables such as supply setpoint, thermal profile, EMI, or load pattern, and that long duration environment accurate tests are required.

What would kill

  • Independent validation that the reported undervoltage events were measurement artifacts and not linked to component sensing margin or protection settings.
  • Data showing incidence was negligible or the post mitigation zero failure claim is not statistically meaningful once denominators and windows are provided.
  • Demonstration that the workaround does not change protection coverage and that mixed revision identification is solved cleanly, removing meaningful operational complexity.

Sources

  1. 2026-03-18 share.transistor.fm