Distributed Tracing with Zipkin and Spring

Dave Syer, 2016
Twitter: @david_syer
Email: dsyer@pivotal.io

Agenda

Correlation

Add a Correlation ID to logs, and propagate it between processes (e.g. in HTTP or message headers)

2016-03-05 12:07:03.748 INFO [service1,5b7b3ceafd06fceb,5b7b3ceafd06fceb,false] 18487 --- ...
2016-03-05 12:07:03.748 INFO [service1,5b7b3ceafd06fceb,5b7b3ceafd06fceb,false] 18487 --- ...
2016-03-05 12:07:05.540 INFO [service2,5b7b3ceafd06fceb,162e2f236273a756,false] 18487 --- ...
2016-03-05 12:07:05.541 INFO [service2,5b7b3ceafd06fceb,162e2f236273a756,false] 18487 --- ...

Correlated logs are a building block.

You can answer a lot of questions just with standard log analysis tools (ELK, Splunk, etc.)

Adding a Log Analyser

KibanaUI

Spring Cloud Sleuth

Recipe:

Distributed Tracing

Why is this POST /thing so slow?

           +-------------------------------------+
           |  POST /thing                        |
           +-------------------------------------+

Troubleshooting Latency Problems

When was it and how long did it take?

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+

First log statement: 15:31:29:103, last one: 15:31:30:530

Where did this happen?

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+
                            | peer.ipv4 | 1.2.3.4 |

Server is a shard in the wombat cluster, listening on 10.2.3.4:8080

Server log says Client IP was 1.2.3.4

Which Business Event Was It?

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+
                            | peer.ipv4       | 1.2.3.4  |
                            | http.request-id | abcd-ffe |

The http response header had request-id: abcd-ffe? Is that what you mean?

Is it Abnormal?

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+
                            | peer.ipv4       | 1.2.3.4  |
                            | http.request-id | abcd-ffe |

Well, average response time for POST /things in the last 2 days is 100ms

I’ll check other logs for this request id and see what I can find out.

Achieving Understanding

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+
                            | peer.ipv4           | 1.2.3.4                 |
                            | http.request-id     | abcd-ffe                |
                            | http.content.length | 15 MB                   |
                            | http.url            | ...&features=HD-uploads |

Ok, looks like this client is in the experimental group for HD uploads

I searched the logs for others in that group. took about the same time.

Often Two Sides to the Story

    Client Sent: 15:31:28:500                               Client Received: 15:31:31:000
    v                                                       v
    +-------------------------------------------------------+
    |  POST /some                                           | 2500ms
    +-------------------------------------------------------+
       +-------------------------------------+
       |  POST /thing                        | 1427ms
       +-------------------------------------+
       ^                                     ^
       Server Received: 15:31:29:103         Server Sent: 15:31:30:530

And Not All Operations are on the Critical Path

    +-------------------------------------------------------+
    |  POST /some                                           |
    +-------------------------------------------------------+
       +-------------------------------------+
       |  POST /thing                        |
       +-------------------------------------+
               +---------------------------+
               |  Sync Store               |
               +---------------------------+
                     +--------------------------------------------------------------+
                     |  Async Store                                                 |
                     +--------------------------------------------------------------+

Distributed Tracing adds structure to correlation data

It's like centralized logging, but structured at source

Distributed Tracing commoditizes knowledge

Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time.

You can compare traces to understand why certain requests take longer than others.

Distributed Tracing Vocabulary

A Span is an individual operation that took place.

A Span contains timestamped events and tags.

A Trace is an end-to-end latency graph, composed of spans.

In a Picture

           Server Received: 15:31:29:103         Server Sent: 15:31:30:530
           v                                     v
           +-------------------------------------+
           |  POST /thing                        | 1427ms
           +-------------------------------------+
                            | peer.ipv4           | 1.2.3.4                |
                            | http.request-id     | abcd-ffe               |
                            | http.content.length | 15 MB                  |
                            | http.url            |...&features=HD-uploads |

Tracers send trace data out of process

Tracers propagate IDs in-band, to tell the receiver there's a trace in progress

Completed spans are reported out-of-band, to reduce overhead and allow for batching

Zipkin

ZipkinUI

Getting Started

$ java -jar zipkin-server.jar
@EnableZipkinServer
@SpringBootApplication
public class ZipkinServer {

...

}

Then:

$ curl -s localhost:9411/api/v1/services | jq .
[ "service1", "service2" ]

Spring Cloud Sleuth and Zipkin

To instrument your application and have it send span data to a collector:

#