Vendor Relationship Friction And Observability Dispute

Issue 77 Edition 2026-03-18 8 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-03-25 17:55

Key takeaways

Flex initially claimed they had never seen the problem in the field, and Oxide argued the symptom could be subtle enough to evade typical field monitoring.
On Gimlet, drive reset behavior is mediated by the sharkfin board and MAX5970 hot-plug controller circuitry rather than directly by the AMD platform PERST signal.
MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
Oxide needs to avoid applying the prior mitigation to the fixed part revision to keep software complexity down, although applying it to the fixed revision is considered safe if done accidentally.
Historical logs showed the issue had occurred before at least one sled shipped, leading Oxide to add manufacturing-test checks to avoid shipping additional affected sleds.

Sections

Vendor Relationship Friction And Observability Dispute

Flex initially claimed they had never seen the problem in the field, and Oxide argued the symptom could be subtle enough to evade typical field monitoring.
Oxide could not obtain vendor-provided debug firmware or equivalent introspection for the intermediate bus converter, limiting visibility into its internal software behavior.
The supplier initially dismissed Oxide’s defect reports and eventually stopped responding to emails.
Oxide attributed early supplier dismissal to the supplier having produced thousands of parts while Oxide initially only had about three failing units, making it easier to label as anomalous.
A Flex contact provided a raw output mode exposing internal ADC readings, but it did not resolve the issue.
The supplier later recognized a root cause found elsewhere resembled Oxide’s issue and communicated that information to Oxide.

Fault-Domain Narrowing From Zfs To Device Resets

On Gimlet, drive reset behavior is mediated by the sharkfin board and MAX5970 hot-plug controller circuitry rather than directly by the AMD platform PERST signal.
Dump analysis suggested the pool hang was not a ZFS deadlock but an underlying device-layer failure where device state appeared uninitialized.
Investigators concluded all ten front drives experienced an unexpected PCIe fundamental reset event during runtime.
The droop triggered a simultaneous PCI reset across all ten drives and generated a hot-plug-like interrupt indicating the data link layer went down.
Oxide treated an unexpected runtime PCIe reset that effectively drops all ten disk drives at once as abnormal for their system.

Power Transient As Proximate Cause Of Storage Collapse

MAX5970 telemetry showed the 12V rail minimum dipped to about 8V during failure conditions.
Most downstream regulators and the CPU remained operational at an 8V input because of wide input tolerance, while drives and the 12V-generation path failed first.
The observed rail droop event was about 12 milliseconds and resembled a controlled ramp, implying a controller purposefully reduced output rather than passive discharge.
Oxide ruled out noise or instability on relevant external inputs and observed the intermediate bus converter output still drooped despite stable inputs, indicating the issue was within the converter.
The intermediate bus converter reported VN undervoltage events even though measured voltage at the converter pins and upstream nodes remained stable.

Vendor-Stated Root Cause And Mitigations, Plus Revision-Management Constraints

Oxide needs to avoid applying the prior mitigation to the fixed part revision to keep software complexity down, although applying it to the fixed revision is considered safe if done accidentally.
Flex issued a product change notice stating the auxiliary power circuit had insufficient margin and that component tolerance combinations could cause the converter to falsely sense low input voltage and shut down.
Flex’s mitigation involved programming undocumented register values to effectively disable the converter’s undervoltage detector and rely on upstream undervoltage protection in the hot-swap controller, and this appeared to eliminate the droop.
As Cosmo scaled, the issue appeared at roughly the same rate as on Gimlet; an accidental shipment without the mitigation failed at the expected rate; after mitigation deployment, failures dropped to zero.
A fixed revision of the part exists and includes significant rework of the relevant input circuitry involving roughly a dozen components.

Manufacturing Containment And Prevalence

Historical logs showed the issue had occurred before at least one sled shipped, leading Oxide to add manufacturing-test checks to avoid shipping additional affected sleds.
A new manufacturing test check detected two to three affected sleds in a rack that was about to ship.
The incidence of the issue in manufacturing was estimated at roughly 3% to 5% of sleds.
If a sled is going to exhibit the issue, it appears to do so in manufacturing rather than emerging later in the field.

Watchlist

Oxide planned an operating-system-level resilience change to help work around similar issues even though a hardware fix exists.

Unknowns

What specific internal converter condition or mechanism produces the controlled-ramp droop while external pins and upstream nodes measure stable voltage?
What is the true duration, shape, and recurrence distribution of the droop event across affected units under representative rack conditions?
How accurate is the estimated 3% to 5% manufacturing incidence, and does it vary by lot, date code, or operating voltage?
Does the mitigation of disabling undervoltage detection introduce any safety or reliability risks under genuine undervoltage conditions, and how is that validated end-to-end with upstream protection?
What is the operational plan and reliability impact for mixed populations of converter revisions, including detection of revision and correct application or non-application of mitigations?

Investor overlay

Read-throughs

Supplier support opacity and deteriorating responsiveness can extend time to root cause and raise integration costs, increasing delivery and reliability risk for hardware dependent on vendor debug access.
A millisecond scale 12V droop causing simultaneous drive resets suggests a systemic power integrity vulnerability that can produce correlated storage failures, potentially affecting uptime metrics and support burden until fully contained.
Mixed converter revisions plus a software mitigation that should not be applied to fixed units implies configuration management complexity, raising risk of operational errors and ongoing engineering overhead.

What would confirm

Clear vendor communication that provides debug access or definitive root cause details, plus improved responsiveness, aligning with faster closure and fewer recurring incidents.
Measured prevalence data by lot and date code and rack conditions, showing whether the 3% to 5% incidence is accurate and whether manufacturing gating reduces escapes to near zero.
Documented operating plan for mixed revision fleets, including reliable revision detection and mitigation application rules, with evidence of reduced drive reset events and stable storage behavior.

What would kill

Evidence that disabling undervoltage detection introduces safety or reliability regressions under true undervoltage conditions, undermining the mitigation approach.
Droop events persist on the fixed hardware revision or occur broadly outside the suspected population, indicating the PCN explanation is incomplete or incorrect.
Manufacturing test gates fail to prevent shipment of affected units or field incidence rises over time, contradicting the claim that manifestation is primarily manufacturing limited.

Sources

When Nine Nines Isn't Enough

2026-03-18 share.transistor.fm