Users want predictable latency, but providing that is hard in distributed datacenters. The authors make a parallel between 'fault tolerance' (tolerating vertex failure without request failure) and 'tail-tolerance' (tolerating slow vertexes without excessive request latencies).
They identify latencies reasons such as shared nodes and shared global resources (such as networks and locks), daemons executing on nodes, maintenance activities (such as garbage collection), queuing delays. Parallelization does not help, since breaking requests into parallel parts amplifies latency outliers - the request always waits for the stragglers.
Some solutions are proposed including: maintain QoS and classes-of-service. Break very expensive requests into slices to prevent them from excessively delaying all other requests ('head of line blocking').
However, to handle the inevitable tail-latency events, additional strategies are proposed including hedged requests (send a second request after a brief delay, and cancel outstanding requests after a response is received), tied requests (cancel backup request quickly).
Longer term solutions include micro-partition (splitting work to many more chunks than machine count) for smooth load balancing, selective (additional) replication for hot-spots, excluding slow machines or placing them on probation, and Canary requests, where requests targeted at 1000s of machines are first tried on a smaller set to prevent correlated failures due to untested code paths.
Our Take: Practical, Insightful and Comprehensive