On the evening of May 17, our production storage cluster experienced a near-simultaneous failure of 14 storage devices distributed across multiple physical hosts. The cluster, which uses Ceph (an industry-standard distributed storage system) to provide redundant storage for customer workloads, was unable to keep certain data accessible during the failure event.
The cluster has been fully recovered. The vast majority of customer workloads resumed normal operation. A small portion of data was unrecoverable due to the specific nature of the software defect; affected customers are being contacted directly.
Our investigation identified intermittent connections at multiple drive bays, across more than one of our storage servers (chassis), that were silently dropping write operations. A storage device in an affected bay would accept a write command, report it as successful, and then fail to actually persist the data. This failure occurs at the SAS communication layer between the drive bay and the storage controller — a category of failure that is not detected by standard drive health monitoring (SMART). We have so far confirmed this fault on two separate servers (drive bay 6 on host hv30, and drive bay 2 on host hv31) and are auditing our remaining storage servers for the same condition.
When a device failed in this manner, the storage system's internal metadata became inconsistent with what was actually on disk. These silent write losses were occurring on more than one server, which were the triggering events for the incident.
We were running Ceph storage software version 18.2.7, which is documented by the Ceph project as containing the fix for a known BlueFS subsystem defect (upstream Tracker Issue #70747, "BlueFS race condition between truncate() and unlink()"; reference: https://tracker.ceph.com/issues/70747). Despite running the supposedly-patched version, our cluster's storage daemons crashed with the assertion this fix was intended to prevent (ceph_assert(cut_off == p->length) in BlueFS::truncate).
The failure caused three storage placement groups (small partitions of the storage pool) to lose redundancy. Of those:
One placement group was recovered using an incomplete secondary copy. A small number of recently written data objects could not be recovered on this placement group.
Two placement groups had no recoverable copies due to the cascading nature of the failures, combined with the Ceph defect preventing data extraction.
The full recovery effort spanned approximately 36 hours and included:
Immediate stabilization to prevent further data loss
Capacity relief operations to allow the cluster to heal
Multiple attempts at OSD-level recovery using offline tools, including the --bluefs-replay-recovery flag (which is the documented Ceph workaround for this class of bug)
Direct upstream Ceph code review to attempt alternate recovery paths
Acceptance of bounded data loss after all recovery options were exhausted
Restoration of cluster health and customer service availability
We are individually contacting customers whose workloads may have been affected to provide specific remediation guidance
We have taken the affected drive bays out of service and are inspecting/repairing their backplane connections (the drives themselves tested healthy and have been redeployed to known-good locations); we are auditing the backplane connections on our remaining storage servers for the same fault
We are implementing kernel-level I/O error monitoring on all hosts (a category of warning that we now know SMART self-diagnostics will not surface)
We are validating each surviving drive with direct I/O testing in addition to SMART checks before redeploying it as a storage device
We are actively recreating the failed storage devices as fresh OSDs (most have already been recreated)
We are monitoring the cluster for any signs of recurrence
We will be upgrading from Ceph 18.2.7 to a newer version, where this defect class has been more comprehensively addressed
We will be implementing enhanced monitoring for the early-warning indicators of this bug (specifically, BlueFS slow-operation warnings, kernel SCSI/SAS layer errors, and lost-write detection)
We will be reviewing drive health validation procedures to ensure that SMART blind spots (transport-layer failures, lost async writes) are caught by complementary monitoring
We are formally tracking this incident in our internal postmortem process
We are reviewing our storage architecture to provide additional resilience against software defects of this nature (separate WAL/DB devices, increased redundancy options)
We are reviewing our backup coverage with customers to ensure rapid recovery options are available going forward
We are evaluating contingency plans for similar incidents, both hardware and software-driven
We deeply regret the impact this incident has had on your operations. We had engineered our infrastructure with multi-copy redundancy specifically to survive single hardware failures, but the combination of silent hardware faults at multiple drive bays (across more than one server) and an incompletely-fixed Ceph software defect produced a failure mode our redundancy could not isolate. We are implementing both hardware-monitoring and software-recovery improvements to ensure we are better protected against this class of event going forward.
Sincerely,
Matthew Midgett
Trick Solutions
Incident Reference: TS-2026-0517-01
Date of Incident: May 17, 2026 – May 18, 2026
Date of Notification: May 25th, 2026