FiDe: Reliable and Fast Crash Failure Detection to Boost Datacenter Coordination

  • Davide Rovelli
  • , Pavel Chuprikov
  • , Philipp Berdesinski
  • , Ali Pahlevan
  • , Patrick Jahnke
  • , Patrick Eugster

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

>Failure detection is one of the most fundamental primitives on which distributed fault tolerant services and applications rely to achieve liveness. Typical crash failure detectors resort to using timeouts that have to take into account the unpredictability in interaction times among remote processes, caused by resource contention in the network and in endhost processors. While modern (gray) failure detectors have improved in detecting a wide range of failures, the problem of prohibitively large and unreliable timeouts for crash failures still persists, hampering performance of both the failure detector themselves and modern µs-scale services sitting on top. We propose a novel fully reliable failure-detector (FiDe) that can report the crash of a remote process in a datacenter within less than 30 µs (7.2× faster than the current state of the art) with extremely high reliability, thanks to a ground-up design which provides stable end-to-end process interactions. By reliably lowering worst-case crash detection time, FiDe enables a class of algorithms that can be used to boost coordination services even in the absence of failures. We devise two novel, FiDe-based, highly efficient consensus protocols and integrate them into a key-value store and a synchronization service, improving throughput by up to 2.23× and reducing latency down to 0.46×.

Original languageEnglish
Title of host publicationProceedings of the 2025 USENIX Annual Technical Conference, ATC 2025
PublisherUSENIX Association
Pages765-788
Number of pages24
ISBN (Electronic)9781939133489
Publication statusPublished - 1 Jan 2025
Event2025 USENIX Annual Technical Conference, ATC 2025 - Boston, United States
Duration: 7 Jul 20259 Jul 2025

Publication series

NameProceedings of the 2025 USENIX Annual Technical Conference, ATC 2025

Conference

Conference2025 USENIX Annual Technical Conference, ATC 2025
Country/TerritoryUnited States
CityBoston
Period7/07/259/07/25

Fingerprint

Dive into the research topics of 'FiDe: Reliable and Fast Crash Failure Detection to Boost Datacenter Coordination'. Together they form a unique fingerprint.

Cite this