TY - GEN
T1 - FiDe
T2 - 2025 USENIX Annual Technical Conference, ATC 2025
AU - Rovelli, Davide
AU - Chuprikov, Pavel
AU - Berdesinski, Philipp
AU - Pahlevan, Ali
AU - Jahnke, Patrick
AU - Eugster, Patrick
N1 - Publisher Copyright:
© 2025 by The USENIX Association. All rights reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - >Failure detection is one of the most fundamental primitives on which distributed fault tolerant services and applications rely to achieve liveness. Typical crash failure detectors resort to using timeouts that have to take into account the unpredictability in interaction times among remote processes, caused by resource contention in the network and in endhost processors. While modern (gray) failure detectors have improved in detecting a wide range of failures, the problem of prohibitively large and unreliable timeouts for crash failures still persists, hampering performance of both the failure detector themselves and modern µs-scale services sitting on top. We propose a novel fully reliable failure-detector (FiDe) that can report the crash of a remote process in a datacenter within less than 30 µs (7.2× faster than the current state of the art) with extremely high reliability, thanks to a ground-up design which provides stable end-to-end process interactions. By reliably lowering worst-case crash detection time, FiDe enables a class of algorithms that can be used to boost coordination services even in the absence of failures. We devise two novel, FiDe-based, highly efficient consensus protocols and integrate them into a key-value store and a synchronization service, improving throughput by up to 2.23× and reducing latency down to 0.46×.
AB - >Failure detection is one of the most fundamental primitives on which distributed fault tolerant services and applications rely to achieve liveness. Typical crash failure detectors resort to using timeouts that have to take into account the unpredictability in interaction times among remote processes, caused by resource contention in the network and in endhost processors. While modern (gray) failure detectors have improved in detecting a wide range of failures, the problem of prohibitively large and unreliable timeouts for crash failures still persists, hampering performance of both the failure detector themselves and modern µs-scale services sitting on top. We propose a novel fully reliable failure-detector (FiDe) that can report the crash of a remote process in a datacenter within less than 30 µs (7.2× faster than the current state of the art) with extremely high reliability, thanks to a ground-up design which provides stable end-to-end process interactions. By reliably lowering worst-case crash detection time, FiDe enables a class of algorithms that can be used to boost coordination services even in the absence of failures. We devise two novel, FiDe-based, highly efficient consensus protocols and integrate them into a key-value store and a synchronization service, improving throughput by up to 2.23× and reducing latency down to 0.46×.
UR - https://www.scopus.com/pages/publications/105011648610
M3 - Conference contribution
AN - SCOPUS:105011648610
T3 - Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025
SP - 765
EP - 788
BT - Proceedings of the 2025 USENIX Annual Technical Conference, ATC 2025
PB - USENIX Association
Y2 - 7 July 2025 through 9 July 2025
ER -