Vendor Relationship Friction And Observability Dispute
Sources: 1 • Confidence: Medium • Updated: 2026-03-25 17:55
Key takeaways
- Flex initially claimed they had never seen the problem in the field, and Oxide argued the symptom could be subtle enough to evade typical field monitoring.
- On Gimlet, drive reset behavior is mediated by the sharkfin board and MAX5970 hot-plug controller circuitry rather than directly by the AMD platform PERST signal.
- MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
- Oxide needs to avoid applying the prior mitigation to the fixed part revision to keep software complexity down, although applying it to the fixed revision is considered safe if done accidentally.
- Historical logs showed the issue had occurred before at least one sled shipped, leading Oxide to add manufacturing-test checks to avoid shipping additional affected sleds.
Sections
Vendor Relationship Friction And Observability Dispute
- Flex initially claimed they had never seen the problem in the field, and Oxide argued the symptom could be subtle enough to evade typical field monitoring.
- Oxide could not obtain vendor-provided debug firmware or equivalent introspection for the intermediate bus converter, limiting visibility into its internal software behavior.
- The supplier initially dismissed Oxide’s defect reports and eventually stopped responding to emails.
- Oxide attributed early supplier dismissal to the supplier having produced thousands of parts while Oxide initially only had about three failing units, making it easier to label as anomalous.
- A Flex contact provided a raw output mode exposing internal ADC readings, but it did not resolve the issue.
- The supplier later recognized a root cause found elsewhere resembled Oxide’s issue and communicated that information to Oxide.
Fault-Domain Narrowing From Zfs To Device Resets
- On Gimlet, drive reset behavior is mediated by the sharkfin board and MAX5970 hot-plug controller circuitry rather than directly by the AMD platform PERST signal.
- Dump analysis suggested the pool hang was not a ZFS deadlock but an underlying device-layer failure where device state appeared uninitialized.
- Investigators concluded all ten front drives experienced an unexpected PCIe fundamental reset event during runtime.
- The droop triggered a simultaneous PCI reset across all ten drives and generated a hot-plug-like interrupt indicating the data link layer went down.
- Oxide treated an unexpected runtime PCIe reset that effectively drops all ten disk drives at once as abnormal for their system.
Power Transient As Proximate Cause Of Storage Collapse
- MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
- Most downstream regulators and the CPU remained operational at an 8V input because of wide input tolerance, while drives and the 12V-generation path failed first.
- The observed rail droop event was about 12 milliseconds and resembled a controlled ramp, implying a controller purposefully reduced output rather than passive discharge.
- Oxide ruled out noise or instability on relevant external inputs and observed the intermediate bus converter output still drooped despite stable inputs, indicating the issue was within the converter.
- The intermediate bus converter reported VN undervoltage events even though measured voltage at the converter pins and upstream nodes remained stable.
Vendor-Stated Root Cause And Mitigations, Plus Revision-Management Constraints
- Oxide needs to avoid applying the prior mitigation to the fixed part revision to keep software complexity down, although applying it to the fixed revision is considered safe if done accidentally.
- Flex issued a product change notice stating the auxiliary power circuit had insufficient margin and that component tolerance combinations could cause the converter to falsely sense low input voltage and shut down.
- Flex’s mitigation involved programming undocumented register values to effectively disable the converter’s undervoltage detector and rely on upstream undervoltage protection in the hot-swap controller, and this appeared to eliminate the droop.
- As Cosmo scaled, the issue appeared at roughly the same rate as on Gimlet; an accidental shipment without the mitigation failed at the expected rate; after mitigation deployment, failures dropped to zero.
- A fixed revision of the part exists and includes significant rework of the relevant input circuitry involving roughly a dozen components.
Manufacturing Containment And Prevalence
- Historical logs showed the issue had occurred before at least one sled shipped, leading Oxide to add manufacturing-test checks to avoid shipping additional affected sleds.
- A new manufacturing test check detected two to three affected sleds in a rack that was about to ship.
- The incidence of the issue in manufacturing was estimated at roughly 3% to 5% of sleds.
- If a sled is going to exhibit the issue, it appears to do so in manufacturing rather than emerging later in the field.
Watchlist
- Oxide planned an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.
Unknowns
- What specific internal converter condition or mechanism produces the controlled-ramp droop while external pins and upstream nodes measure stable voltage?
- What is the true duration, shape, and recurrence distribution of the droop event across affected units under representative rack conditions?
- How accurate is the estimated 3% to 5% manufacturing incidence, and does it vary by lot, date code, or operating voltage?
- Does the mitigation of disabling undervoltage detection introduce any safety or reliability risks under genuine undervoltage conditions, and how is that validated end-to-end with upstream protection?
- What is the operational plan and reliability impact for mixed populations of converter revisions, including detection of revision and correct application or non-application of mitigations?