BBR拥塞控制算法介绍和代码分析

网络模型

拥塞控制算法要解决网络传输中的拥塞问题，并且尽可能的高效的利用网络带宽。基于对网络的研究，在BBR算法中将网络模型简化成如下：

network_mode.png

抽象模型：

网络链路相当于管道，有一个最窄的地方，当发送带宽超过这个最窄的地方时，管道中会开始有排队，当排队队列超过管道长度时，会发生丢包。

关键概念：

BtlBw : 瓶颈带宽，即管道中最窄的地方, 相当于管道中最小的直径。
RTprop : 管道中没有排队时，重点是没有排队，发道一个包在管道中一个来回的时长，相当于管道的长度。
BDP(Bandwidth-delay product) : 已经发送了，但还没有收到acks（被称作inflight)，且管道刚好装满。BDP = BtlBw * RTprop. 比如BtlBw = 10, RTprop = 6, BDP = 60相当于，管道中有30，另有30到达了，但发送端还没收到这个60的数据的ack, 因为ack需要延迟3个时间单位才能到达。

三个阶段：

app limited : 发送的数据量很少， inflight的数据量小于BDP, 这时候可以用较大的速率发送数据（可大于BtlBw), 只受RTprop的影响。
bandwidth limited : 此时管道已满， inflight的数据量开始超过了BDP, 由于瓶颈带宽所在的网络节点存在缓冲区，继续发送数据，这个缓冲区开始排队，这个时候发送速率受BtlBw的限制了。
buffer limited: 当瓶颈缓冲区的队列满了后，开始出现丢包。

基于丢包率的拥塞控制算法作用在第3个阶段，显然此时已经较晚了， rtt此时也较高。

Leonard Kleinrock证明最佳的调节点在上图BDP线，但Jeffrey M. Jaffe证明了在这个点是无法得到解的。

BBR算法由google团队提出，作用在第二个阶段，较靠近BDP线的位置，也就是说它需要有少量的瓶颈缓冲区排队，来检测出BtlBw。

BBR算法

做两件事情：

周期性的探测瓶颈带宽BtlBw

如上图，要探测瓶颈带宽，需要进入第二阶段，让缓冲区有一些排队，因此bbr设计了一个pacing_gain, 如pacing_gain_cycle = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.25, 0.75}, 周期性的上探BtlBw, 为了快速的清除排队，在上探(1.25)后紧跟一个下降(0.75)在一个rtt完成, 当链路带宽突然增加时，BtlBw是1.25^n的指数增长，带宽跟随非常即时。

周期性的探测RTprop

要探测RTprop，需要进入第一阶段，因此需要排空缓冲队列，并且使得管道不是满的状态，即inflight < BDP。
可见在探测RTprop时，会减少throughtput, RTprop的探测周期相对长一些，一般是10s

发送窗口

发送窗口 =

在探测BtlBw时，增大cwnd_gain, 让缓冲区有一些排队；

在探测RTprop时，减少cwnd_gain, 排空缓冲区；

拥塞控制四个阶段

启动阶段，包括app limited和部分bandwidth limited, tcp拥塞算法称作慢启动。
排空阶段， inflight > BDP, 排空队列，计算RTprop, tcp拥塞算法称作拥塞避免
瓶颈带宽探测阶段，增大cwnd_gain, 缓冲区排队，计算BtlBw, tcp拥塞算法称作拥塞阶段
~~丢包恢复，缓冲区队列满了，链路开始丢包，减少cwnd_gain, 避免队列满了丢包~~ （BBR算法避免瓶颈节点缓冲队列满而丢包，因此没了这个阶段)

bbr_stage.png

代码分析

基于开源的picoquic代码做分析

启动阶段：

void BBREnterStartup(picoquic_bbr_state_t* bbr_state)
{
    bbr_state->state = picoquic_bbr_alg_startup;
    /*启动阶段快速提高发送速率*/
    bbr_state->pacing_gain = BBR_HIGH_GAIN; 
    bbr_state->cwnd_gain = BBR_HIGH_GAIN;
}

排空阶段:

void BBRCheckDrain(picoquic_bbr_state_t* bbr_state, uint64_t bytes_in_transit, uint64_t current_time)
{
    /*由启动阶段转入排空阶段，inflight达到BDP, 管道满了，缓冲区有排队*/
    if (bbr_state->state == picoquic_bbr_alg_startup && bbr_state->filled_pipe) {
        BBREnterDrain(bbr_state);
    }
    
    /*由排空阶段转入瓶颈带宽探测阶段, inflight <= BDP， 缓存区无排队*/
    if (bbr_state->state == picoquic_bbr_alg_drain && bytes_in_transit <= BBRInflight(bbr_state, 1.0)) {
        BBREnterProbeBW(bbr_state, current_time);  /* we estimate queue is drained */
    }
}

void BBREnterDrain(picoquic_bbr_state_t* bbr_state)
{
    bbr_state->state = picoquic_bbr_alg_drain;
    /*进入排空阶段，减少发送速率*/
    bbr_state->pacing_gain = 1.0 / BBR_HIGH_GAIN;  /* pace slowly */
    bbr_state->cwnd_gain = BBR_HIGH_GAIN;   /* maintain cwnd */
}

瓶颈带宽探测阶段

/*判断进入下一个pacing_gain*/
int BBRIsNextCyclePhase(picoquic_bbr_state_t* bbr_state, uint64_t prior_in_flight, uint64_t packets_lost, uint64_t current_time)
{
    /*一个循环至少要大于RTprop的时间*/
    int is_full_length = (current_time - bbr_state->cycle_stamp) > bbr_state->rt_prop;
    
    if (bbr_state->pacing_gain != 1.0) {
        if (bbr_state->pacing_gain > 1.0) {
            /*队列满buffer limited阶段或是带宽上探完成*/
            is_full_length &=
                (packets_lost > 0 ||
                    prior_in_flight >= BBRInflight(bbr_state, bbr_state->pacing_gain));
        }
        else {  /*  (BBR.pacing_gain < 1) */
            /*带宽恢复完成*/
            is_full_length &= prior_in_flight <= BBRInflight(bbr_state, 1.0);
        }
    }
    return is_full_length;
}

/*进入下一个pacing_gain， pacing_gain_cycle = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.25, 0.75}*/
void BBRAdvanceCyclePhase(picoquic_bbr_state_t* bbr_state, uint64_t current_time)
{
    bbr_state->cycle_stamp = current_time;
    bbr_state->cycle_index++;
    /*完成了一个循环*/
    if (bbr_state->cycle_index >= BBR_GAIN_CYCLE_LEN) {
        int start = bbr_state->cycle_start;
        /*瓶颈带宽有增长, 下一个循环更快的上探带宽*/
        if (bbr_state->btl_bw_increased) {
            bbr_state->btl_bw_increased = 0;
            start++;
            if (start > BBR_GAIN_CYCLE_MAX_START) {
                start = BBR_GAIN_CYCLE_MAX_START;
            }
        }
        else if (start > 0) {
        /*瓶颈带宽无增长， 下一个循环逐渐回归原点*/
            start--;
        }
        bbr_state->cycle_index = start;
        bbr_state->cycle_start = start;
    }
   
    bbr_state->pacing_gain = bbr_pacing_gain_cycle[bbr_state->cycle_index];
}

void BBREnterProbeBW(picoquic_bbr_state_t* bbr_state, uint64_t current_time)
{
    int start = 0;
    bbr_state->state = picoquic_bbr_alg_probe_bw;
    bbr_state->pacing_gain = 1.0;
    bbr_state->cwnd_gain = 1.5;

    /*开始pacing_gain的循环*/
    if (bbr_state->rt_prop > PICOQUIC_TARGET_RENO_RTT) {
        start = (int)(bbr_state->rt_prop / PICOQUIC_TARGET_RENO_RTT);
        if (start > BBR_GAIN_CYCLE_MAX_START) {
            start = BBR_GAIN_CYCLE_MAX_START;
        }
    }
    else {
        start = 2;
    }

    bbr_state->cycle_index = start;
    bbr_state->cycle_start = start;
    bbr_state->btl_bw_increased = 1;

    BBRAdvanceCyclePhase(bbr_state, current_time);
}

/* Track the round count using the "delivered" counter. The value carried per
 * packet is the delivered count when this packet was sent. If it is greater
 * than next_round_delivered, it means that the packet was sent at or after
 * the beginning of the round, and thus that at least one RTT has elapsed
 * for this round. */

void BBRUpdateBtlBw(picoquic_bbr_state_t* bbr_state, picoquic_path_t* path_x)
{
    uint64_t bandwidth_estimate = path_x->bandwidth_estimate;

    /*启动阶段，快速增加带宽*/
    if (bbr_state->state == picoquic_bbr_alg_startup &&
        bandwidth_estimate < (path_x->max_bandwidth_estimate / 2)) {
        bandwidth_estimate = path_x->max_bandwidth_estimate/2;
    }

    /*每个发送包里会携带当前周期发送了的数据量'delivered', 如果这个包携带的delivered大于等于next_round_delivered，则说明这个包是一个新的发送周的*/
    if (path_x->delivered_last_packet >= bbr_state->next_round_delivered)
    {
        bbr_state->next_round_delivered = path_x->delivered;
        bbr_state->round_count++;
        bbr_state->round_start = 1;
    }
    else {
        bbr_state->round_start = 0;
    }

    /*为新的发送周期的带宽值留个空位*/
    if (bbr_state->round_start) {
        /* Forget the oldest BW round, shift by 1, compute the max BTL_BW for
         * the remaining rounds, set current round max to current value */

        bbr_state->btl_bw = 0;

        for (int i = BBR_BTL_BW_FILTER_LENGTH - 2; i >= 0; i--) {
            uint64_t b = bbr_state->btl_bw_filter[i];
            bbr_state->btl_bw_filter[i + 1] = b;
            if (b > bbr_state->btl_bw) {
                bbr_state->btl_bw = b;
            }
        }

        bbr_state->btl_bw_filter[0] = 0;
    }

    /*瓶颈带宽是最大的ack bitrate*/
    if (bandwidth_estimate > bbr_state->btl_bw_filter[0]) {
        bbr_state->btl_bw_filter[0] =bandwidth_estimate;
        if (bandwidth_estimate > bbr_state->btl_bw) {
            bbr_state->btl_bw = bandwidth_estimate;
            bbr_state->btl_bw_increased = 1;
        }
    }
}

周期性的探测RTprop

void BBRCheckProbeRTT(picoquic_bbr_state_t* bbr_state, picoquic_path_t* path_x, uint64_t bytes_in_transit, uint64_t current_time)
{
    /*rt_prop_expired周期到了，进入RTprop的探测*/
    if (bbr_state->state != picoquic_bbr_alg_probe_rtt &&
        bbr_state->rt_prop_expired &&
        !bbr_state->idle_restart) {
        BBREnterProbeRTT(bbr_state);
        bbr_state->prior_cwnd = BBRSaveCwnd(bbr_state, path_x);
        bbr_state->probe_rtt_done_stamp = 0;
    }
    
    /*在RTprop探测过程中， 计算RTprop*/
    if (bbr_state->state == picoquic_bbr_alg_probe_rtt) {
        BBRHandleProbeRTT(bbr_state, path_x, bytes_in_transit, current_time);
        bbr_state->idle_restart = 0;
    }
}

void BBREnterProbeRTT(picoquic_bbr_state_t* bbr_state)
{
    bbr_state->state = picoquic_bbr_alg_probe_rtt;
    /*减少发送窗口到一个BDP, 开始队列排空*/
    bbr_state->pacing_gain = 1.0;
    bbr_state->cwnd_gain = 1.0;
}

void BBRHandleProbeRTT(picoquic_bbr_state_t* bbr_state, picoquic_path_t * path_x, uint64_t bytes_in_transit, uint64_t current_time)
{
#if 0
    /* Ignore low rate samples during ProbeRTT: */
    C.app_limited =
        (BW.delivered + bytes_in_transit) ? 0 : 1;
#endif

    /*inflight的包约4个时，探测了RTprop*/
    if (bbr_state->probe_rtt_done_stamp == 0 &&
        bytes_in_transit <= BBR_MIN_PIPE_CWND(path_x->send_mtu)) {
        bbr_state->probe_rtt_done_stamp =
            current_time + BBR_PROBE_RTT_DURATION;
        bbr_state->probe_rtt_round_done = 0;
        bbr_state->next_round_delivered = path_x->delivered;
    }
    else if (bbr_state->probe_rtt_done_stamp != 0) {
        if (bbr_state->round_start) {
            bbr_state->probe_rtt_round_done = 1;
        }
        
        if (bbr_state->probe_rtt_round_done &&
            current_time > bbr_state->probe_rtt_done_stamp) {
            bbr_state->rt_prop_stamp = current_time;
            BBRRestoreCwnd(bbr_state, path_x);
            BBRExitProbeRTT(bbr_state, current_time);
        }
    }
}

/* This will use one way samples if available */
/* Should augment that with common RTT filter to suppress jitter */
void BBRUpdateRTprop(picoquic_bbr_state_t* bbr_state, uint64_t rtt_sample, uint64_t current_time)
{
    bbr_state->rt_prop_expired =
        current_time > bbr_state->rt_prop_stamp + BBR_PROBE_RTT_INTERVAL;
    /*探测阶段最小的rtt更新RTprop*/
    if (rtt_sample <= bbr_state->rt_prop || bbr_state->rt_prop_expired) {
        bbr_state->rt_prop = rtt_sample;
        bbr_state->rt_prop_stamp = current_time;
    }
    else {
        uint64_t delta = rtt_sample - bbr_state->rt_prop;
        if (20 * delta < bbr_state->rt_prop) {
            bbr_state->rt_prop_stamp = current_time;
        }
    }
}

BBR算法大部分时间运行在瓶颈带宽BtlBw的探测和链路长度RTprop的探测中

每收到一个ack, 进行pacing_gain周期的检测和RTprop周期的检测

void BBRUpdateModelAndState(picoquic_bbr_state_t* bbr_state, picoquic_path_t* path_x,
    uint64_t rtt_sample, uint64_t bytes_in_transit, uint64_t packets_lost, uint64_t current_time)
{
    /*瓶颈带宽探测*/
    BBRUpdateBtlBw(bbr_state, path_x);
    BBRCheckCyclePhase(bbr_state, packets_lost, current_time);
    BBRCheckFullPipe(bbr_state, path_x->last_bw_estimate_path_limited);
    BBRCheckDrain(bbr_state, bytes_in_transit, current_time);
    /*RTprop探测*/
    BBRUpdateRTprop(bbr_state, rtt_sample, current_time);
    BBRCheckProbeRTT(bbr_state, path_x, bytes_in_transit, current_time);
}

BBR算法效果评测

待完成