Article: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Authors: Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan ShanbhagPublished: April 2010
Link: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36356.pdf
Summary:
Dapper is Google's low-overhead distributed tracing system. Design goals are to be always turned on, to catch irreproducible behavior, and yet be low-overhead. It should also be transparent to the application level. The work on behalf of one request may span a whole tree of servers, and all of those requests must be captured.
Dapper is annotation based - requests are annotated so they can be traced back to parent user request : they maintain a parent id as well as a trace id, both random 64-bit GUIDs. A unified mechanism allows most code to pass annotations uniformly. Traces are sampled to reduce overhead, using adaptive sampling that emphasizes low-bandwidth services enough to collect meaningful samples.
Benefits gained from Dapper include performance analysis, identifying unnecessary requests and bottlenecks, ensuring correctness, and helping developers gain better understanding of the system and flow. It has also proven useful with testing and exception monitoring, and in debugging latency issues, service dependencies, and network usage. However, due to its request-based nature, Dapper does not handle batched workloads perfectly, as it is not clear which of the batched requests a workload is actually associated with.
Our take: While tracing is not perceived as sexy, it is certainly useful, and the Dapper paper was clear and through in describing Google's solution to this thorny problem