概述

我们现在所处的生产环境是一个集Nodejs, Go, Java, Ruby, Scala等多种语言程序的混合场景.Twitter的Finagle框架, 是一个基于Thrift协议的RPC框架,其中Zipkin是针对Finagle框架的一个基于Thrift协议的RPC调用链跟踪的工具，可搜集各服务调用数据，并提供分析查询展示功能。帮助识别分布式RPC调用链里，哪一个调用比较耗时，性能有问题，以及是否有异常等，使得诊断分布式系统性能成为可能。

基本概念

Trace

一次服务调用追踪链路，由一组Span组成。需在web总入口处生成TraceID，并确保在当前请求上下文里能访问到

Annotation

表示某个时间点发生的Event

Event类型

cs：Client Send 请求
sr：Server Receive到请求
ss：Server 处理完成、并Send Response
cr：Client Receive 到响应

什么时候生成
客户端发送Request、接受到Response、服务器端接受到Request、发送 Response时生成。Annotation属于某个Span，需把新生成的Annotation添加到当前上下文里Span的annotations数组里

thrift数据结构

/**
 * Associates an event that explains latency with a timestamp.
 *
 * Unlike log statements, annotations are often codes: for example "sr".
 */
struct Annotation {
  /**
   * Microseconds from epoch.
   *
   * This value should use the most precise value possible. For example,
   * gettimeofday or syncing nanoTime against a tick of currentTimeMillis.
   */
  1: i64 timestamp
  /**
   * Usually a short tag indicating an event, like "sr" or "finagle.retry".
   */
  2: string value
  /**
   * The host that recorded the value, primarily for query by service name.
   */
  3: optional Endpoint host
  // don't reuse 4: optional i32 OBSOLETE_duration         // how long did the operation take? microseconds
}

BinaryAnnotation

存放用户自定义信息，比如：sessionID、userID、userIP、异常等

什么时候生成?
在任意需要记录自定义跟踪信息时都可生成。比如：异常、SessionID等。如Annotation一样，BinaryAnnotation也属于某个Span。需把新生成的BinaryAnnotation，添加到当前上下文里Span的binary_annotations数组.

Thrift数据结构

    /**
 * Binary annotations are tags applied to a Span to give it context. For
 * example, a binary annotation of "http.uri" could the path to a resource in a
 * RPC call.
 *
 * Binary annotations of type STRING are always queryable, though more a
 * historical implementation detail than a structural concern.
 *
 * Binary annotations can repeat, and vary on the host. Similar to Annotation,
 * the host indicates who logged the event. This allows you to tell the
 * difference between the client and server side of the same key. For example,
 * the key "http.uri" might be different on the client and server side due to
 * rewriting, like "/api/v1/myresource" vs "/myresource. Via the host field,
 * you can see the different points of view, which often help in debugging.
 */
 struct BinaryAnnotation {
  /**
   * Name used to lookup spans, such as "http.uri" or "finagle.version".
   */
  1: string key,
  /**
   * Serialized thrift bytes, in TBinaryProtocol format.
   *
   * For legacy reasons, byte order is big-endian. See THRIFT-3217.
   */
  2: binary value,
  /**
   * The thrift type of value, most often STRING.
   *
   * annotation_type shouldn't vary for the same key.
   */
  3: AnnotationType annotation_type,
  /**
   * The host that recorded value, allowing query by service name or address.
   *
   * There are two exceptions: when key is "ca" or "sa", this is the source or
   * destination of an RPC. This exception allows zipkin to display network
   * context of uninstrumented services, such as browsers or databases.
   */
  4: optional Endpoint host
}

AnnotationType 结构

/**
* A subset of thrift base types, except BYTES.
*/
enum AnnotationType {
    BOOL,  BYTES,I16,  I32,  I64,  DOUBLE,  STRING
}

Span

表示一次完整RPC调用，是由一组Annotation和BinaryAnnotation组成。是追踪服务调用的基本结构，多span形成树形结构组合成一次Trace追踪记录。Span是有父子关系的，比如：Client A、Client A -> B、B ->C、C -> D、分别会产生4个Span。Client A接收到请求会时生成一个Span A、Client A -> B发请求时会再生成一个Span A-B，并且Span A是 Span A-B的父节点

什么时候生成
- 服务接受到 Request时，若当前Request没有关联任何Span，便生成一个Span，包括：Span ID、TraceID
- 向下游服务发送Request时，需生成一个Span，并把新生成的Span的父节点设置成上一步生成的Span

Thrift结构

/**
* A trace is a series of spans (often RPC calls) which form a latency tree.
*
* Spans are usually created by instrumentation in RPC clients or servers, but
* can also represent in-process activity. Annotations in spans are similar to
* log statements, and are sometimes created directly by application developers
* to indicate events of interest, such as a cache miss.
*
* The root span is where parent_id = Nil; it usually has the longest duration
* in the trace.
*
* Span identifiers are packed into i64s, but should be treated opaquely.
* String encoding is fixed-width lower-hex, to avoid signed interpretation.
*/
struct Span {
/**
* Unique 8-byte identifier for a trace, set on all spans within it.
*/
1: i64 trace_id
/**
* Span name in lowercase, rpc method for example. Conventionally, when the
* span name isn't known, name = "unknown".
*/
3: string name,
/**
* Unique 8-byte identifier of this span within a trace. A span is uniquely
* identified in storage by (trace_id, id).
*/
4: i64 id,
/**
* The parent's Span.id; absent if this the root span in a trace.
*/
5: optional i64 parent_id,
/**
* Associates events that explain latency with a timestamp. Unlike log
* statements, annotations are often codes: for example SERVER_RECV("sr").
* Annotations are sorted ascending by timestamp.
*/
6: list annotations,
/**
* Tags a span with context, usually to support query or aggregation. For
* example, a binary annotation key could be "http.uri".
*/
8: list binary_annotations
/**
* True is a request to store this span even if it overrides sampling policy.
*/
9: optional bool debug = 0
/**
* Epoch microseconds of the start of this span, absent if this an incomplete
* span.
*
* This value should be set directly by instrumentation, using the most
* precise value possible. For example, gettimeofday or syncing nanoTime
* against a tick of currentTimeMillis.
*
* For compatibilty with instrumentation that precede this field, collectors
* or span stores can derive this via Annotation.timestamp.
* For example, SERVER_RECV.timestamp or CLIENT_SEND.timestamp.
*
* Timestamp is nullable for input only. Spans without a timestamp cannot be
* presented in a timeline: Span stores should not output spans missing a
* timestamp.
*
* There are two known edge-cases where this could be absent: both cases
* exist when a collector receives a span in parts and a binary annotation
* precedes a timestamp. This is possible when..
*  - The span is in-flight (ex not yet received a timestamp)
*  - The span's start event was lost
*/
10: optional i64 timestamp,
/**
* Measurement in microseconds of the critical path, if known.
*
* This value should be set directly, as opposed to implicitly via annotation
* timestamps. Doing so encourages precision decoupled from problems of
* clocks, such as skew or NTP updates causing time to move backwards.
*
* For compatibility with instrumentation that precede this field, collectors
* or span stores can derive this by subtracting Annotation.timestamp.
* For example, SERVER_SEND.timestamp - SERVER_RECV.timestamp.
*
* If this field is persisted as unset, zipkin will continue to work, except
* duration query support will be implementation-specific. Similarly, setting
* this field non-atomically is implementation-specific.
*
* This field is i64 vs i32 to support spans longer than 35 minutes.
*/
11: optional i64 duration
}

服务之间需传递的信息

Trace的基本信息需在上下游服务之间传递，如下信息是必须的：

Trace ID：起始(根)服务生成的TraceID
Span ID：调用下游服务时所生成的Span ID
Parent Span ID：父Span ID
Is Sampled：是否需要采样
Flags：告诉下游服务，是否是debug Reqeust

Trace Tree组成

一个完整Trace 由一组Span组成，这一组Span必须具有相同的TraceID；Span具有父子关系，处于子节点的Span必须有parent_id，Span由一组 Annotation和BinaryAnnotation组成。整个Trace Tree通过Trace Id、Span ID、parent Span ID串起来的。

其他要求

Web入口处，需把SessionID、UserID(若登陆)、用户IP等信息记录到BinaryAnnotation里
关键子子调用也需用zipkin追踪，比如：订单调用了Mysql，也许把个调用的耗时情况记录到 Annotation里
关键出错日志或者异常也许记录到BinaryAnnotation里

经过上述三条，用户任何访问所引起的后台服务间调用，完全可以串起来，并形成一颗调用树。通过调用树，哪个调用耗时多久，是否有异常等都可清晰定位到。

完整示例

testService(Web服务) -> OrderServ(Thrift) -> StockServ & PayServ(Thrift)。一共有四个服务，testService 调用 OrderServ、OrderServ同时调用 StockServ和PayServ。需生成的Trace信息如下：

testService收到Http Reqeust时，需在入口处生成TraceID、SpanID，以及一个Span对象，假若叫Span1。
testService向OrderServ发送 Thrift Request时，需新生成一Span2，并把parent ID设置成Span1的spanID。同事需修改Thrift Header，把Span2的spanID、parent ID、TraceID 传递给下游服务。也需生成"cs" Annotation，关联到span2上；当接受到OrderServ的Response时，再生成"cr" Annotation，也关联到span2上。
OrderServ接受到Thrift Request后，从Thrift Header里解析到TraceID、parent ID、 Span ID(span2)、并保留到上下文里。同时生成"sr"Annotaition，并关联到span2上；当处理完成发送response时，再生成"ss"Annotation，并关联到span2上。
OrderServ向StockServ发送 Thrift Request时，需新生成一Span3，并把parentID设置成上一步(Span2)的span ID。Annotation处理如上。
Order Serv向PayServ发送请求时，新生成一Span4，并把parentID设置Span2的span ID。Annotation处理如上。

总结客户端要做哪些事情？

Thrift协议扩展（参照Finagle）

Thrift Server端向Thrift协议Header里增加TraceID、SpanID、ParentID、Flags、Is Sampled
Thrift Client从Thrift Request里解析出TraceID、SpanID、ParentID、flags、Is Sampled

Trace数据生成

Web入口处，生成TraceID、SpanID、以及Span对象，并把sessionID、userID、userIP记录到BinaryAnnotation里.
调用下游服务时生成Span对象.
Thrift客户端Send Request时，生成"cs" Annotation.
Thrift客户端Receive Response时，生成"cr"Annotation.
Thrift服务器端Receive Reqeust时，生成"sr" Annotation.
Thrift服务器端Send Response时，生成"ss"Annotation.
异常、关键系统调用数据也要记录到Annotation里.

Trace数据传递与收集

数据刷新不能影响业务性能，刷新失败更不能影响正常业务逻辑.
上下文传递
- 入口处生成的TraceID、SpanID需要在当前Request上下文里读取到
- Ruby 、java、NodeJS上下文传递要做到业务代码透明， Golang单独处理
客户端生成Trace数据，并写入到本地log文件
由logstash(类似工具)统一搜集日志，并写入kafka
由kafka consumer从kafka读取数据，并索引到ES里

Zipkin跟踪数据收集格式定义

我们针对跟踪数据的收集的统一接入点, 为kafka.所有应用产生的zipkin数据统一发送到Kafka集群中.具体的channel为: EAGLEYE_ZIPKIN_CHANNEL.

Span(记录每次调用信息)

JSON串范例(格式化)：

span{
    "app": "app", //所属应用
    "flag": "flag", //cscr标明是调用端，srss标明是被调用端。标识, 一个span数据 ,会有基本的cr, cs, sr, ss四个点. 但是获取数据时,一般只能两两获取, 所以, 一个span通常会被分割为cr, cs和 sr, ss分别发送
    "ip": "192.168.10.100", //ip地址
    "mname": "mname", //方法名
    "pid": "pid", //进程id
    "port": 1000, //端口, 如果是cscr客户端, 则为0
    "psid": "psid", //parent span id
    "sid": "sid", //spanId
    "sname": "sname", //服务名
    "etime": 1449072000000, //结束时间点, 单位到ms,如果是client端,则是cr时间点, 如果是server端,则是ss时间点
    "stime": 1449039924512, //开始时间点, 单位到ms, 如果是client端,则是cs时间点, 如果是server端,则是sr时间点
    "duration": 32075488, //结束时间点减去开始时间点的, 调用的区间时长, 单位ms
    "tid": "tid", //traceId
    "timestamp": 1449039924548 //产生的时间点
}

备注: span信息的json,添加'span'头信息, 在通过kafka做收集时, 通过该头信息将span信息路由到独立的es的index中. 每一条span记录其实是半个span信息, 要么是client端产生的, 要么是server端产生的.

BinaryAnnotation(可以用来传递用户session_id的信息, 也可以传递其他业务信息)

JSON串范例(格式化)

{
    "app": "app", //所属应用
    "ip": "ip", //ip地址,冗余信息
    "key": "key", //key, 可以设为存储用户session的key, 如果是用来传递用户session信息的, 可以统一约定为: session_id
    "mname": "mname",  //方法名
    "pid": "10000", //进程id,冗余信息
    "sid": "sid", //spanId
    "sname": "sname", //服务名
    "tid": "tid", //traceId
    "timestamp": 1449038780194, //产生的时间戳, 长整型, 精确到毫秒
    "type": "type", //类型,用来区分是记录异常的还是业务流程的等等, 默认是'common'即可
    "value": "value" //如果是传递用户session信息 ,可以直接写在该字段中.
}

备注:这里将BinaryAnnotation独立记录, 并发送到kafka消息队列, 其通过sid关联具体的span信息.

java使用zipkin经验

客户端调用服务端失败（服务端没启动），客户端的annotations有cs、cr
客户端超时（等待响应超时），服务端的annotations有sr、ss，客户端的annotations有cs、cr
服务端超时，服务端的annotations有timeout和sr，没有ss、没有binary_annotation,客户端的annotations有cs、cr
客户端无法记录binary_annotation

实际效果

收集了Zipkin产生的RPC调用链信息,并给服务治理框架提供跟踪信息检索(双击某行有问题的Span记录可以查看某次调用链详情,了解具体的网络情况和业务执行情况)

rpc的span信息

rpc调用链详情

基于Zipkin的Thrift服务RPC调用链跟踪

概述