distributed systems research: August 2013

Sunday, August 18, 2013

Review: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Article: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Authors: Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag
Published: April 2010
Link: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36356.pdf

Summary:
Dapper is Google's low-overhead distributed tracing system. Design goals are to be always turned on, to catch irreproducible behavior, and yet be low-overhead. It should also be transparent to the application level. The work on behalf of one request may span a whole tree of servers, and all of those requests must be captured.
Dapper is annotation based - requests are annotated so they can be traced back to parent user request : they maintain a parent id as well as a trace id, both random 64-bit GUIDs. A unified mechanism allows most code to pass annotations uniformly. Traces are sampled to reduce overhead, using adaptive sampling that emphasizes low-bandwidth services enough to collect meaningful samples.
Benefits gained from Dapper include performance analysis, identifying unnecessary requests and bottlenecks, ensuring correctness, and helping developers gain better understanding of the system and flow. It has also proven useful with testing and exception monitoring, and in debugging latency issues, service dependencies, and network usage. However, due to its request-based nature, Dapper does not handle batched workloads perfectly, as it is not clear which of the batched requests a workload is actually associated with.

Our take: While tracing is not perceived as sexy, it is certainly useful, and the Dapper paper was clear and through in describing Google's solution to this thorny problem

Wednesday, August 14, 2013

Paper Review: The Tail At Scale By Jeffrey Dean and Luiz andré Barroso

Summary:

Users want predictable latency, but providing that is hard in distributed datacenters. The authors make a parallel between 'fault tolerance' (tolerating vertex failure without request failure) and 'tail-tolerance' (tolerating slow vertexes without excessive request latencies).
They identify latencies reasons such as shared nodes and shared global resources (such as networks and locks), daemons executing on nodes, maintenance activities (such as garbage collection), queuing delays. Parallelization does not help, since breaking requests into parallel parts amplifies latency outliers - the request always waits for the stragglers.
Some solutions are proposed including: maintain QoS and classes-of-service. Break very expensive requests into slices to prevent them from excessively delaying all other requests ('head of line blocking').
However, to handle the inevitable tail-latency events, additional strategies are proposed including hedged requests (send a second request after a brief delay, and cancel outstanding requests after a response is received), tied requests (cancel backup request quickly).
Longer term solutions include micro-partition (splitting work to many more chunks than machine count) for smooth load balancing, selective (additional) replication for hot-spots, excluding slow machines or placing them on probation, and Canary requests, where requests targeted at 1000s of machines are first tried on a smaller set to prevent correlated failures due to untested code paths.

Our Take: Practical, Insightful and Comprehensive