x264、x265中cuTree原理分析

    mbtree是x264中引入的一项创新性技术,可以有效提高主客观质量(参考文章最后的表格1)。x265继承了这一算法,改名为cuTree,算法本身实现较为复杂,下面探讨一下cutree原理,结合代码来分析实现细节。

    cutree和mbtree都是根据当前块被参考的程度调整qpOffset,要知道当前块被参考的程度,很显然需要一个编码的反推过程

    对于帧间参考,参考帧的质量显然对当前帧质量有直接影响。即参考块的编码代价,除了要考虑本身的编码代价外,还需考虑对将来参考到当前块的那些块的影响力。因此,cutree在分析每个块的Cost时,引入了一个PropagateInCost的概念:即每个块的Cost,不仅是自己本身编码的Cost,还要加上后续块依赖于当前块的Cost,这个Cost称之为PropagateInCost,所以关键是如何确定PropagateInCost。

    考虑以下简化情形:假设B块完全参考了A块,B块帧内帧间预测分别为IntraCostB和InterCostB。分析趋势:如果IntraCostB与InterCostB差不多大,说明B块从A块获取的信息量很少;反之,如果IntraCostB比InterCostB大很多,说明B块大部分信息可以从A块获取。基于这个思想,B块本身从A块获取的信息量可以表达为:(IntraCostB - InterCostB) 。进一步考虑,B块也被其他块参考了,所以B块的Cost也包含了PropagateInCostB。综上:B块依赖于A块的Cost为:

    (IntraCostB - InterCostB) + PropagateInCostB * (IntraCostB - InterCostB) / IntraCostB = (1 + PropagateInCostB) * (IntraCostB - InterCostB) / IntraCostB

    其中:(IntraCostB - InterCostB) / IntraCostB表示B块的PropagateInCostB有多少比例要传递到A块。

    如下是x265计算PropagateCost的函数,其基本思想就是上面所述。

/* Estimate the total amount of influence on future quality that could be had if we
 * were to improve the reference samples used to inter predict any given CU. */
static void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len)
{
    double fps = *fpsFactor / 256;  // range[0.01, 1.00]
    for (int i = 0; i < len; i++)
    {
        int intraCost = intraCosts[i];
        int interCost = X265_MIN(intraCosts[i], interCosts[i] & LOWRES_COST_MASK);
        double propagateIntra = intraCost * invQscales[i]; // Q16 x Q8.8 = Q24.8
        double propagateAmount = (double)propagateIn[i] + propagateIntra * fps; // Q16.0 + Q24.8 x Q0.x = Q25.0
        double propagateNum = (double)(intraCost - interCost); // Q32 - Q32 = Q33.0
        double propagateDenom = (double)intraCost;             // Q32
        dst[i] = (int)(propagateAmount * propagateNum / propagateDenom + 0.5);
    }
}

    如前所述,B块完全参考A块,则A块的PropagateCostInA = (1 + PropagateInCostB) * (IntraCostB - InterCostB) / IntraCostB

    考虑更复杂的情况,由于MV不可能都指向一个完整的编码块,所以B块的PropateCostB在参考帧中要被按比例地加入到对应的参考块中。如下为x265的cutree函数:

void Lookahead::estimateCUPropagate(Lowres **frames, double averageDuration, int p0, int p1, int b, int referenced)
{
    uint16_t *refCosts[2] = { frames[p0]->propagateCost, frames[p1]->propagateCost };
    int32_t distScaleFactor = (((b - p0) << 8) + ((p1 - p0) >> 1)) / (p1 - p0);
    int32_t bipredWeight = m_param->bEnableWeightedBiPred ? 64 - (distScaleFactor >> 2) : 32;
    int32_t bipredWeights[2] = { bipredWeight, 64 - bipredWeight };
    int listDist[2] = { b - p0 - 1, p1 - b - 1 };

    memset(m_scratch, 0, m_8x8Width * sizeof(int));

    uint16_t *propagateCost = frames[b]->propagateCost;

    x265_emms();
    double fpsFactor = CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) / CLIP_DURATION(averageDuration);

    /* For non-referred frames the source costs are always zero, so just memset one row and re-use it. */
    if (!referenced)
        memset(frames[b]->propagateCost, 0, m_8x8Width * sizeof(uint16_t));

    int32_t strideInCU = m_8x8Width;
    for (uint16_t blocky = 0; blocky < m_8x8Height; blocky++)
    {
        int cuIndex = blocky * strideInCU;
        // 计算frames[b]每个块的PropagateInCost,结果存储到m_scratch中
        if (m_param->rc.qgSize == 8)
            primitives.propagateCost(m_scratch, propagateCost,
                       frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,
                       frames[b]->invQscaleFactor8x8 + cuIndex, &fpsFactor, m_8x8Width);
        else
            primitives.propagateCost(m_scratch, propagateCost,
                       frames[b]->intraCost + cuIndex, frames[b]->lowresCosts[b - p0][p1 - b] + cuIndex,
                       frames[b]->invQscaleFactor + cuIndex, &fpsFactor, m_8x8Width);

        if (referenced)
            propagateCost += m_8x8Width;

        // 将frames[b]中的PropagateInCost 按比例加到参考帧中每个块里
        for (uint16_t blockx = 0; blockx < m_8x8Width; blockx++, cuIndex++)
        {
            int32_t propagate_amount = m_scratch[blockx];
            /* Don't propagate for an intra block. */
            if (propagate_amount > 0)
            {
                /* Access width-2 bitfield. */
                int32_t lists_used = frames[b]->lowresCosts[b - p0][p1 - b][cuIndex] >> LOWRES_COST_SHIFT;
                /* Follow the MVs to the previous frame(s). */
                for (uint16_t list = 0; list < 2; list++)
                {
                    if ((lists_used >> list) & 1)
                    {
#define CLIP_ADD(s, x) (s) = (uint16_t)X265_MIN((s) + (x), (1 << 16) - 1)
                        int32_t listamount = propagate_amount;
                        /* Apply bipred weighting. */
                        if (lists_used == 3)
                            listamount = (listamount * bipredWeights[list] + 32) >> 6;

                        MV *mvs = frames[b]->lowresMvs[list][listDist[list]];

                        /* Early termination for simple case of mv0. */
                        // MV(0, 0),直接加到参考帧的PropateCost数组中
                        if (!mvs[cuIndex].word)
                        {
                            CLIP_ADD(refCosts[list][cuIndex], listamount);
                            continue;
                        }

                        // MV不为(0, 0)时,参考块为四个块的子区域,分别为idx0, idx1, idx2, idx3,比例为idx0weight, idx1weight, idx2weidht, idx3weidht
                        int32_t x = mvs[cuIndex].x;
                        int32_t y = mvs[cuIndex].y;
                        int32_t cux = (x >> 5) + blockx;
                        int32_t cuy = (y >> 5) + blocky;
                        int32_t idx0 = cux + cuy * strideInCU;
                        int32_t idx1 = idx0 + 1;
                        int32_t idx2 = idx0 + strideInCU;
                        int32_t idx3 = idx0 + strideInCU + 1;
                        x &= 31;
                        y &= 31;
                        int32_t idx0weight = (32 - y) * (32 - x);
                        int32_t idx1weight = (32 - y) * x;
                        int32_t idx2weight = y * (32 - x);
                        int32_t idx3weight = y * x;

                        /* We could just clip the MVs, but pixels that lie outside the frame probably shouldn't
                         * be counted. */
                        if (cux < m_8x8Width - 1 && cuy < m_8x8Height - 1 && cux >= 0 && cuy >= 0)
                        {
                            CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);
                            CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);
                        }
                        else /* Check offsets individually */
                        {
                            if (cux < m_8x8Width && cuy < m_8x8Height && cux >= 0 && cuy >= 0)
                                CLIP_ADD(refCosts[list][idx0], (listamount * idx0weight + 512) >> 10);
                            if (cux + 1 < m_8x8Width && cuy < m_8x8Height && cux + 1 >= 0 && cuy >= 0)
                                CLIP_ADD(refCosts[list][idx1], (listamount * idx1weight + 512) >> 10);
                            if (cux < m_8x8Width && cuy + 1 < m_8x8Height && cux >= 0 && cuy + 1 >= 0)
                                CLIP_ADD(refCosts[list][idx2], (listamount * idx2weight + 512) >> 10);
                            if (cux + 1 < m_8x8Width && cuy + 1 < m_8x8Height && cux + 1 >= 0 && cuy + 1 >= 0)
                                CLIP_ADD(refCosts[list][idx3], (listamount * idx3weight + 512) >> 10);
                        }
                    }
                }
            }
        }
    }

    if (m_param->rc.vbvBufferSize && m_param->lookaheadDepth && referenced)
        cuTreeFinish(frames[b], averageDuration, b == p1 ? b - p0 : 0);
}

    最后,当前Cu的QPOffset肯定是与PropagateInCost有关的,PropagateInCost越大,则CU的qp应该越小,QPOffset是负值,也应该越小,x265中cutree的QPOffset = -strength * log2(1 + PropagateInCost / IntraCost),具体代码,参考函数cuTreeFinish,如下所示。

void Lookahead::cuTreeFinish(Lowres *frame, double averageDuration, int ref0Distance)
{
    int fpsFactor = (int)(CLIP_DURATION(averageDuration) / CLIP_DURATION((double)m_param->fpsDenom / m_param->fpsNum) * 256);
    double weightdelta = 0.0;

    if (ref0Distance && frame->weightedCostDelta[ref0Distance - 1] > 0)
        weightdelta = (1.0 - frame->weightedCostDelta[ref0Distance - 1]);

    frame->qpAvgFrmCuTreeOffset = 0.0;
    for (int cuIndex = 0; cuIndex < m_cuCount; cuIndex++)
    {
        int intracost = (frame->intraCost[cuIndex] * frame->invQscaleFactor[cuIndex] + 128) >> 8;
        if (intracost)
        {
            int propagateCost = (frame->propagateCost[cuIndex] * fpsFactor + 128) >> 8;
            double log2_ratio = X265_LOG2(intracost + propagateCost) - X265_LOG2(intracost) + weightdelta;
            frame->qpCuTreeOffset[cuIndex] = frame->qpAqOffset[cuIndex] - m_cuTreeStrength * log2_ratio;
            frame->qpAvgFrmCuTreeOffset += frame->qpCuTreeOffset[cuIndex];
        }
    }
    frame->qpAvgFrmCuTreeOffset /= m_cuCount;
}

   下面表1、表2为x265 v2.4版本中,cutree对编码客观质量的影响,编码配置为:preset=medium, ratecontrol=ABR,BFrames = 3(or = 0),aq-mode=off,测试序列为HEVC中的Class B(1080p)。当BFrames=3时,cuTree开启后,Y的bitrate节省6.52%,U的码率节省15.38%,V的码率节省15.56%,压缩效率提升非常明显。当BFrames=0时,cuTree开启后,Y的bitrate增加0.72%,U的码率节省2.4%,V的码率节省1.4%,压缩效率没什么提升。这是因为BFrames=3时,CuTree对I和P,QP调小的幅度大,对B-Ref,QP适当调小,对B-Non-Ref,QP不做调整,本质与HM中的Hierarchichal QP差不多;当BFrames=0时,所有P帧的QP都被调小,幅度都差不多,这样其实相当于没有调整QP了。

                                                              表1、x265中cuTree对编码码率的节省(BFrames=3)

Sequence BD-Rate Y BD-Rate U BD-Rate V
BasketballDrive -4.8% -8.5% -3.5%
Bqterrace -3.5% -19.2% -21.3%
Cactus -9.6% -15.9% -13.9%
Kimono -3.3% -13.4% -17.1%
ParkScene -11.4% -19.9% -22.0%
Average -6.52% -15.38% -15.56%

                                                              表2、x265中cuTree对编码码率的节省(BFrames=0)

Sequence BD-Rate Y BD-Rate U BD-Rate V
BasketballDrive 2.4% 2.3% 4.9%
Bqterrace 4.1% -0.9% 0.8%
Cactus -1.8% -4.6% -2.3%
Kimono 2.3% -0.6% -2.7%
ParkScene -3.4% -8.2% -7.7%
Average 0.72% -2.4% -1.4%

    需要注意一点:如上所述,cuTree是从后往前推导,求qpOffset。x264、x265在开启码控时,会启用lookahead机制,所谓lookahead机制就是从当前帧往后看,根据后续帧的情况,给当前帧分配合适的QP,确定合适的帧类型等。代码中,cuTree往后看的帧数就等于lookahead_num的值。比如对x265的Preset Medium,lookahead_num默认为20,则cuTree会从当前帧之后第20帧开始往前推导,一直到当前帧,算出qpOffset,所以lookahead_num会对cuTree的结果有直接影响:不同lookahead_num,cuTree的QPOffset的值也稍有不同,但是影响不算很大。

    此外,还需要注意,x265的帧型决策以及cuTree的QPOffset的确定过程都是以MiniGop为单位的,即每次为一个MiniGop确定好编码所需的参数。因此,3个B帧的情况,每4帧(bBbP)调用一次cuTree过程。而0个B帧时,则每个P帧都要调用一次cuTree过程。cuTree每次要反推20(lookahead_num)帧,计算量很可观。所以对超高分辨率编码时,有时0B反而比3B更慢,问题很可能出于此。

你可能感兴趣的:(H265,码率控制,视频编码)