The Scalable Video Coding Amendment of the H.264/AVC Standard

The Scalable Video Coding Amendment of the H.264/AVC Standard

http://ip.hhi.de/imagecom_G1/savce/index.htm

 

Summary

The Scalable Video Coding amendment (SVC) of the H.264/AVC standard (H.264/AVC) provides network-friendly scalability at a bit stream level with a moderate increase in decoder complexity relative to single-layer H.264/AVC. It supports functionalities such as bit rate, format, and power adaptation, graceful degradation in lossy transmission environments as well as lossless rewriting of quality-scalable SVC bit streams to single-layer H.264/AVC bit streams. These functionalities provide enhancements to transmission and storage applications. SVC has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards.
 

Fig. 1. The Scalable Video Coding (SVC) principle.

 

Introduction
International video coding standards such as H.261, MPEG-1, H.262/MPEG-2 Video, H.263, MPEG-4 Visual, and H.264/AVC have played an important role in the success of digital video applications. They provide interoperability among products from different manufacturers while allowing a high flexibility for implementations and optimizations in various application scenarios. The H.264/AVC specification represents the current state-of-the-art in video coding. Compared to prior video coding standards, it significantly reduces the bit rate necessary to represent a given level of perceptual quality – a property also referred to as increase of the coding efficiency.


The desire for scalable video coding, which allows on-the-fly adaptation to certain application requirements such as display and processing capabilities of target devices, and varying transmission conditions, originates from the continuous evolution of receiving devices and the increasing usage of transmission systems that are characterized by a widely varying connection quality. Video coding today is used in a wide range of applications ranging from multimedia messaging, video telephony and video conferencing over mobile TV, wireless and Internet video streaming, to standard- and high-definition TV broadcasting. In particular, the Internet and wireless networks gain more and more importance for video applications. Video transmission in such systems is exposed to variable transmission conditions, which can be dealt with using scalability features. Furthermore, video content is delivered to a variety of decoding devices with heterogeneous display and computational capabilities (see Fig. 2). In these heterogeneous environments, flexible adaptation of once-encoded content is desirable, at the same time enabling interoperability of encoder and decoder products from different manufacturers.

 

Fig. 2. Example of video streaming with heterogeneous receiving devices and variable network conditions.


Scalability has already been present in the video coding standards MPEG-2 Video, H.263, and MPEG-4 Visual in the form of scalable profiles. However, the provision of spatial and quality scalability in these standards comes along with a considerable growth in decoder complexity and a significant reduction in coding efficiency (i.e., bit rate increase for a given level a reconstruction quality) as compared to the corresponding non-scalable profiles. These drawbacks, which reduced the success of the scalable profiles of the former specifications, are addressed by the new SVC amendment of the H.264/AVC standard.

Types of scalability
A video bit stream is called scalable when parts of the stream can be removed in a way that the resulting sub-stream forms another valid bit stream for some target decoder, and the sub-stream represents the source content with a reconstruction quality that is less than that of the complete original bit stream but is high when considering the lower quantity of remaining data. Bit streams that do not provide this property are referred to as single-layer bit streams. The usual modes of scalability are temporal, spatial, and quality scalability. Spatial scalability and temporal scalability describe cases in which subsets of the bit stream represent the source content with a reduced picture size (spatial resolution) or frame rate (temporal resolution), respectively. With quality scalability, the sub-stream provides the same spatio-temporal resolution as the complete bit stream, but with a lower fidelity – where fidelity is often informally referred to as signal-to-noise ratio (SNR). Quality scalability is also commonly referred to as fidelity or SNR scalability. More rarely required scalability modes are region-of-interest (ROI) and object-based scalability, in which the sub-streams typically represent spatially contiguous regions of the original picture area. The different types of scalability can also be combined, so that a multitude of representations with different spatio-temporal resolutions and bit rates can be supported within a single scalable bit stream.
 

 

Fig. 3. The basic types of scalability in video coding.


Application areas for scalable video coding
Efficient scalable video coding provides a number of benefits in terms of applications. Consider, for instance, the scenario of a video transmission service with heterogeneous clients, where multiple bit streams of the same source content differing in picture size, frame rate, and bit rate should be provided simultaneously. With the application of a properly configured scalable video coding scheme, the source content has to be encoded only once – for the highest required resolution and bit rate, resulting in a scalable bit stream from which representations with lower resolution and/or quality can be obtained by discarding selected data. For instance, a client with restricted resources (display resolution, processing power, or battery power) needs to decode only a part of the delivered bit stream. Similarly, in a multicast scenario, terminals with different capabilities can be served by a single scalable bit stream. In an alternative scenario, an existing video format can be extended in a backward compatible way by an enhancement video format.

 

 

Fig. 4. On-the-fly adaptation of scalable coded video content in a media-aware network element (MANE).


Another benefit of scalable video coding is that a scalable bit stream usually contains parts with different importance in terms of decoded video quality. This property in conjunction with unequal error protection is especially useful in any transmission scenario with unpredictable throughput variations and/or relatively high packet loss rates. By using a stronger protection of the more important information, error resilience with graceful degradation can be achieved up to a certain degree of transmission errors. Media-aware network elements (MANEs), which receive feedback messages about the terminal capabilities and/or channel conditions, can remove the non-required parts from a scalable bit stream, before forwarding it (see Fig. 4). Thus, the loss of important transmission units due to congestion can be avoided and the overall error robustness of the video transmission service can be substantially improved.

Basic concepts for extending H.264/AVC toward a scalable video coding standard
Apart from the required support of all common types of scalability, the most important design criteria for a successful scalable video coding standard are coding efficiency and complexity. Since SVC was developed as an extension of H.264/AVC with all of its well-designed core coding tools being inherited, one of the design principles of SVC was that new tools should only be added if necessary for efficiently supporting the required types of scalability.

Temporal scalability
A bit stream provides temporal scalability when the set of corresponding access units can be partitioned into a temporal base layer and one or more temporal enhancement layers with the following property. Let the temporal layers be identified by a temporal layer identifier T, which starts from 0 for the base layer and is increased by 1 from one temporal layer to the next. Then for each natural number k, the bit stream that is obtained by removing all access units of all temporal layers with a temporal layer identifier T greater than k forms another valid bit stream for the given decoder.


For hybrid video codecs, temporal scalability can generally be enabled by restricting motion-compensated prediction to reference pictures with a temporal layer identifier that is less than or equal to the temporal layer identifier of the picture to be predicted. The prior video coding standards MPEG-1, H.262/MPEG-2 Video, H.263, and MPEG-4 Visual all support temporal scalability to some degree. H.264/AVC provides a significantly increased flexibility for temporal scalability because of its reference picture memory control. It allows the coding of picture sequences with arbitrary temporal dependencies, which are only restricted by the maximum usable DPB size. Hence, for supporting temporal scalability with a reasonable number of temporal layers, no changes to the design of H.264/AVC were required. The only related change in SVC refers to the signalling of temporal layers.
 

Fig. 5. Hierarchical prediction structures for enabling temporal scalability: (a) coding with hierarchical B or P pictures, (b) non-dyadic hierarchical prediction structure, (c) hierarchical prediction structure with a structural encoder/decoder delay of zero. The numbers directly below the pictures specify the coding order; the symbols Tk specify the temporal layers with k representing the corresponding temporal layer identifier.
 


Temporal scalability with dyadic temporal enhancement layers can be very efficiently provided with the concept of hierarchical B or P pictures as illustrated in Fig. 5a. The enhancement layer pictures are typically coded as B pictures, where the reference picture lists 0 and 1 are restricted to the temporally preceding and succeeding picture, respectively, with a temporal layer identifier less than the temporal layer identifier of the predicted picture. Since backward prediction is not necessarily coupled with the use of B slices in H.264/AVC, the temporal coding structure of Fig. 5a can also be realized using P slices. Each set of temporal layers {T0,…,Tk} can be decoded independently of all layers with a temporal layer identifier T > k. In the following, the set of pictures between two successive pictures of the temporal base layer together with the succeeding base layer picture is referred to as a group of pictures (GOP).


Although the described prediction structure with hierarchical B or P pictures provides temporal scalability and also shows excellent coding efficiency, it represents a special case. In general, hierarchical prediction structures for enabling temporal scalability can always be combined with the multiple reference picture concept of H.264/AVC. This means that the reference picture lists can be constructed by using more than one reference picture, and they can also include pictures with the same temporal level as the picture to be predicted. Furthermore, hierarchical prediction structures are not restricted to the dyadic case. As an example, Fig. 5b illustrates a non-dyadic hierarchical prediction structure, which provides 2 independently decodable sub-sequences with 1/9-th and 1/3-rd of the full frame rate. It is further possible to arbitrarily adjust the structural delay between encoding and decoding a picture by restricting motion-compensated prediction from pictures that follow the picture to be predicted in display order. As an example, Fig. 5c shows a hierarchical prediction structure, which does not employ motion-compensated prediction from pictures in the future. Although this structure provides the same degree of temporal scalability as the prediction structure of Fig. 5a, its structural delay is equal to zero compared to 7 pictures for the prediction structure in Fig. 5a.

Spatial scalability
For supporting spatial scalable coding, SVC follows the conventional approach of multi-layer coding, which is also used in H.262/MPEG-2 Video, H.263, and MPEG-4 Visual. In each spatial layer, motion-compensated prediction and intra prediction are employed as for single-layer coding. In addition to these basic coding tools of H.264/AVC, SVC provides so-called inter-layer prediction methods (see Fig. 6), which allow an exploitation of the statistical dependencies between different layers for improving the coding efficiency (reducing the bit rate) of enhancement layers.

 


Fig. 6. Multi-layer structure with additional inter-layer prediction (black arrows).
 


In H.262/MPEG-2 Video, H.263, and MPEG-4 Visual, the only supported inter-layer prediction methods employ the reconstructed samples of the lower layer signal. The prediction signal is either formed by motion-compensated prediction inside the enhancement layer, by upsampling the reconstructed lower layer signal, or by averaging such an upsampled signal with a temporal prediction signal. Although the reconstructed lower layer samples represent the complete lower layer information, they are not necessarily the most suitable data that can be used for inter-layer prediction. Usually, the inter-layer predictor has to compete with the temporal predictor, and especially for sequences with slow motion and high spatial detail, the temporal prediction signal typically represents a better approximation of the original signal than the upsampled lower layer reconstruction. In order to improve the coding efficiency for spatial scalable coding, two additional inter-layer prediction concepts have been added in SVC: prediction of macroblock modes and associated motion parameters and prediction of the residual signal. All inter-layer prediction tools can be chosen on a macroblock or sub-macroblock basis allowing an encoder to select the coding mode that gives the highest coding efficiency.

 

Inter-layer intra prediction: For SVC enhancement layers, an additional macroblock coding mode (signalled by the syntax element base_mode_flag equal to 1) is provided, in which the macroblock prediction signal is completely inferred from co-located blocks in the reference layer without transmitting any additional side information. When the co-located reference layer blocks are intra-coded, the prediction signal is built by the up-sampled reconstructed intra signal of the reference layer – a prediction method also referred to as inter-layer intra prediction.
   
Inter-layer macroblock mode and motion prediction: When base_mode_flag is equal to 1 and at least one of the co-located reference layer blocks is not intra-coded, the enhancement layer macroblock is inter-picture predicted as in single-layer H.264/AVC coding, but the macroblock partitioning – specifying the decomposition into smaller block with different motion parameters – and the associated motion parameters are completely derived from the co-located blocks in the reference layer. This concept is also referred to as inter-layer motion prediction. For the conventional inter-coded macroblock types of H.264/AVC, the scaled motion vector of the reference layer blocks can also be used as replacement for usual spatial motion vector predictor.
   
Inter-layer residual prediction: A further inter-layer prediction tool referred to as inter-layer residual prediction targets a reduction of the bit rate required for transmitting the residual signal of inter-coded macroblocks. With the usage of residual prediction (signalled by the syntax element residual_prediction_flag equal to 1), the up-sampled residual of the co-located reference layer blocks is subtracted from the enhancement layer residual (difference between the original and the inter-picture prediction signal) and only the resulting difference, which often has a smaller energy then the original residual signal, is encoded using transform coding as specified in H.264/AVC.

 

 

 

Fig. 7. Illustration of inter-layer prediction tools: (left) upsampling of intra-coded macroblock for inter-layer intra prediction, (middle) upsampling of macroblock partition in dyadic spatial scalability for inter-layer prediction of macroblock modes, (right) upsampling of residual signal for inter-layer residual prediction.

 


As an important feature of the SVC design, each spatial enhancement layer can be decoded with a single motion compensation loop. For the employed reference layers, only the intra-coded macroblocks and residual blocks that are used for inter-layer prediction need to be reconstructed (including the deblocking filter operation) and the motion vectors need to be decoded. The computationally complex operations of motion-compensated prediction and the deblocking of inter-picture predicted macroblocks only need to be performed for the target layer to be displayed.


Similar to H.262/MPEG-2 Video and MPEG-4 Visual, SVC supports spatial scalable coding with arbitrary resolution ratios. The only restriction is that neither the horizontal nor the vertical resolution can decrease from one layer to the next. The SVC design further includes the possibility that an enhancement layer picture represents only a selected rectangular area of its corresponding reference layer picture, which is coded with a higher or identical spatial resolution. Alternatively, the enhancement layer picture may contain additional parts beyond the borders of the reference layer picture. This reference and enhancement layer cropping, which may also be combined, can even be modified on a picture-by-picture basis.

Quality scalability
Quality scalability can be considered as a special case of spatial scalability with identical picture sizes for base and enhancement layer. This case, which is also referred to as coarse-grain quality scalable coding (CGS), is supported by the general concept for spatial scalable coding as described above. The same inter-layer prediction mechanisms are employed, but without using the corresponding upsampling operations. When utilizing inter-layer prediction, a refinement of texture information is typically achieved by re-quantizing the residual texture signal in the enhancement layer with a smaller quantization step size relative to that used for the preceding CGS layer. As a specific feature of this configuration, the deblocking of the reference layer intra signal for inter-layer intra prediction is omitted. Furthermore, inter-layer intra and residual prediction are directly performed in the transform coefficient domain in order to reduce the decoding complexity.


The CGS concept only allows a few selected bit rates to be supported in a scalable bit stream. In general, the number of supported rate points is identical to the number of layers. Switching between different CGS layers can only be done at defined points in the bit stream. Furthermore, the CGS concept becomes less efficient, when the relative rate difference between successive CGS layers gets smaller. Especially for increasing the flexibility of bit stream adaptation and error robustness, but also for improving the coding efficiency for bit streams that have to provide a variety of bit rates, a variation of the CGS approach, which is also referred to as medium-grain quality scalability (MGS), is included in the SVC design. The differences to the CGS concept are a modified high-level signalling, which allows a switching between different MGS layers in any access unit, and the so-called key picture concept, which allows the adjustment of a suitable trade-off between drift and enhancement layer coding efficiency for hierarchical prediction structures.
 

 

Fig. 8. Various concepts for trading off enhancement layer coding efficiency and drift for packet-based quality scalable coding: (a) base layer only control, (b) enhancement layer only control, (c) two-loop control, (d) key picture concept of SVC for hierarchical prediction structures, where key pictures are marked by the black-framed boxes.
 


Drift describes the effect that the motion-compensated prediction loops at encoder and decoder are not synchronized, e.g., because quality refinement packets are discarded from a bit stream. Fig. 8 illustrates different concepts for trading off enhancement layer coding efficiency and drift for packet-based quality scalable coding.

 

Base layer only control: For fine-grain quality scalable (FGS) coding in MPEG-4 Visual, the prediction structure was chosen in a way that drift is completely omitted. As illustrated in Fig. 8a, motion compensation in MPEG-4 FGS is only performed using the base layer reconstruction as reference, and thus any loss or modification of a quality refinement packet doesn’t have any impact on the motion compensation loop. The drawback of this approach, however, is that it significantly decreases enhancement layer coding efficiency in comparison to single-layer coding. Since only base layer reconstruction signals are used for motion-compensated prediction, the portion of bit rate that is spent for encoding MPEG-4 FGS enhancement layers of a picture cannot be exploited for the coding of following pictures that use this picture as reference.
   
Enhancement layer only control: For quality scalable coding in H.262/MPEG-2 Video, the other extreme case of possible prediction structures was specified. Here, the reference with the highest available quality is always employed for motion-compensated prediction as depicted in Fig. 8b. This enables highly efficient enhancement layer coding and ensures low complexity, since only a single reference picture needs to be stored for each time instant. However, any loss of quality refinement packets results in a drift that can only be controlled by intra updates. – It should be noted that H.262/MPEG-2 Video does not allow partial discarding of quality refinement packets inside a video sequence, and thus the drift issue can be completely avoided in conforming H.262/MPEG-2 Video bit streams by controlling the reconstruction quality of both the base and the enhancement layer during encoding.
   
Two-loop control: As an alternative, a concept with two motion compensation loops as illustrated in Fig. 8c could be employed. This concept is similar to spatial scalable coding as specified in H.262/MPEG-2 Video, H.263, and MPEG-4 Visual. Although the base layer is not influenced by packet losses in the enhancement layer, any loss of a quality refinement packet results in a drift for the enhancement layer reconstruction.
   
SVC key picture concept: For MGS coding in SVC an alternative approach using so-called key pictures (see Fig. 8d) has been introduced. For each picture a flag is transmitted, which signals whether the base quality reconstruction or the enhancement layer reconstruction of the reference pictures is employed for motion-compensated prediction. In order to limit the memory requirements, a second syntax element signals whether the base quality representation of a picture is additionally reconstructed and stored in the decoded picture buffer. In order to limit the decoding overhead for such key pictures, SVC specifies that motion parameters must not change between the base and enhancement layer representations of key pictures, and thus also for key pictures, the decoding can be done with a single motion-compensation loop.

 


Fig. 8d illustrates how the key picture concept can be efficiently combined with hierarchical prediction structures. All pictures of the coarsest temporal layer are transmitted as key pictures, and only for these pictures the base quality reconstruction is inserted in the decoded picture buffer. Thus, no drift is introduced in the motion compensation loop of the coarsest temporal layer. In contrast to that, all temporal refinement pictures typically use the reference with the highest available quality for motion-compensated prediction, which enables a high coding efficiency for these pictures. Since the key pictures serve as re-synchronization points between encoder and decoder reconstruction, drift propagation is efficiently limited to neighbouring pictures of higher temporal layers. The trade-off between enhancement layer coding efficiency and drift can be adjusted by the choice of the GOP size or the number of hierarchy stages. It should be noted that both the quality scalability structure in H.262/MPEG-2 Video (no picture is coded as key picture) and the FGS coding approach in MPEG-4 Visual (all pictures are coded as key pictures) basically represent special cases of the SVC key picture concept.

 

 

Fig. 8. Example for a partitioning of transform coefficients.


With the MGS concept, any enhancement layer NAL unit can be discarded from a quality scalable bit stream, and thus packet-based quality scalable coding is provided. In addition, SVC supports the following two features for quality scalable coding:

 

 

Partitioning of transform coefficients: SVC provides the possibility to distribute the enhancement layer transform coefficients among several slices. To this end, the first and the last scan index for transform coefficients are signalled in the slice headers, and the slice data only include transform coefficient levels for scan indices inside the signalled range. Thus, the information for a quality refinement picture that corresponds to a certain quantization steps size can be distributed over several NAL units corresponding to different quality refinement layers with each of them containing refinement coefficients for particular transform basis functions only and the granularity of quality scalable coding can be increased.
   
SVC-to-AVC rewriting: The SVC design also supports the creation of quality scalable bit streams that can be converted into bit streams that conform to one of the non-scalable H.264/AVC profiles by using a low-complexity rewriting process. For this mode of quality scalability, the same syntax as for CGS or MGS is used, but two aspects of the decoding process are modified:
     
  (1) For the inter-layer intra prediction, the prediction signal is not formed by the upsampled intra signal of the reference layer, but instead the spatial intra prediction modes are inferred from the co-located reference layer blocks, and a spatial intra prediction as in single-layer H.264/AVC coding is performed in the target layer, i.e., the highest quality refinement layer that is decoded for a picture. Additionally, the residual signal is predicted as for motion-compensated macroblock types.
     
  (2) The residual prediction for inter-coded macroblocks and for inter-layer intra-coded macroblocks (base_mode_flag is equal to 1 and the co-located reference layer blocks are intra-coded) is performed in the transform coefficient level domain, i.e., not the scaled transform coefficients, but the quantization levels for transform coefficients are scaled and accumulated.
     
  These two modifications ensure that such a quality scalable bit stream can be converted into a non-scalable H.264/AVC bit stream that yields exactly the same decoding result as the quality scalable SVC bit stream. The conversion can be achieved by a rewriting process which is significantly less complex than transcoding the SVC bit stream. The usage of the modified decoding process in terms of inter-layer prediction is signalled by a flag in the slice header of the enhancement layer slices.

 

SVC High-Level Design
In the SVC extension of H.264/AVC, the basic concepts for temporal, spatial, and quality scalability can be combined. However, an SVC bit stream does not need to provide all types of scalability. Since the support of quality and spatial scalability usually comes along with a loss in coding efficiency relative to single-layer coding, the trade-off between coding efficiency and the provided degree of scalability can be adjusted according to the needs of an application.


As described above, temporal scalability is provided on the basis of access units (set of packets that correspond to a time instant), and each access unit is associated with a temporal layer identifier T. Inside an access unit, the coding structure is organized in dependency layers as illustrated in Fig. 10. A dependency layer usually represents a specific spatial resolution. In an extreme case it is also possible that the spatial resolution for two dependency layers is identical, in which case the different layers provide coarse-grain scalability (CGS) in terms of quality. Dependency layers are identified by a dependency identifier D. Each dependency layer contains one or more quality layers, which represent the video source for a time instant with a specific spatial resolution and a specific fidelity. All quality layers inside a dependency layer correspond to the same spatial resolution and are identified by a quality identifier Q. For quality refinement layers with a quality identifier Q > 0, always the preceding quality layer with a quality identifier Q – 1 is employed for inter-layer prediction. For quality layers with Q = 0, any present quality layer of a lower dependency layer can be selected as reference layer for inter-layer prediction.
 


     

    Fig. 10. Example for the structure of an SVC access unit.
     


    One important difference between the concept of dependency layers and quality refinements is that switching between different dependency layers is only envisaged at defined switching point (IDR access units). However, switching between different quality refinement layers is virtually possible in any access unit. Quality refinements can either be transmitted as new dependency layers (CGS: different D) or as additional quality refinement layers (MGS: same D, different Q). This does not change the basic decoding process. Only the high-level signalling and the error-detection capabilities are influenced. When quality refinements are coded inside a dependency layer (same D, different Q), the decoder cannot detect whether a quality refinement packet is missing or has been intentionally discarded. This configuration is mainly suitable in connection with hierarchical prediction structures and the usage of key pictures in order to enable efficient packet-based quality scalable coding.


    In the SVC context, the classification of an access unit as IDR access unit, and thus as a switching point between different dependency layer identifiers D, generally depends on the target layer. An IDR access unit for a dependency layer D signals that the reconstruction of layer D for the current and all following access units is independent of all previously transmitted access units. Thus, it is always possible to switch to the dependency layer (or to start the decoding of the dependency layer) for which the current access unit represents an IDR access unit. But it is not required that the decoding of any other dependency layer can be started at that point. IDR access units only provide random access points for a specific dependency layer. For instance, when an access unit represents an IDR access unit for an enhancement layer and thus no motion-compensated prediction can be used, it is still possible to employ motion-compensated prediction in the lower layers in order to improve their coding efficiency.

    System interface
    In order to assist easy bit stream manipulations, the 1-byte header of H.264/AVC is extended by additional 3 bytes for SVC NAL unit types. This extended header includes the identifiers D, Q, and T as well as additional information assisting bit stream adaptations. One of the additional syntax elements is a priority identifier P, which signals the importance of a NAL unit. It can be used either for simple bit stream adaptations with a single comparison per NAL unit.


    Each SVC bit stream includes a sub-stream, which is compliant to a non-scalable profile of H.264/AVC. Standard H.264/AVC NAL units (non-SVC NAL units) do not include the extended SVC NAL unit header. However, these data are not only useful for bit stream adaptations, but some of them are also required for the SVC decoding process. In order to attach this SVC related information to non-SVC NAL units, so-called prefix NAL units are introduced. These NAL units directly precede all non-SVC VCL NAL units in an SVC bit stream and contain the SVC NAL unit header extension.
    SVC also specifies additional SEI messages (SEI – Supplemental Enhancement Information), which for example contain information like spatial resolution or bit rate of the layers that are included in an SVC bit stream and which can further assist the bit stream adaptation process.

    Profiles
    The SVC Amendment of H.264/AVC specifies three profiles for scalable video coding:

     

    The Scalable Baseline profile is mainly targeted for mobile broadcast, conversational and surveillance applications that require a low decoding complexity. In this profile, the support for spatial scalable coding is restricted to resolution ratios of 1.5 and 2 between successive spatial layers in both horizontal and vertical direction and to macroblock-aligned cropping. Furthermore, the coding tools for interlaced sources are not included in this profile. Bit streams conforming to the Scalable Baseline profile contain a base layer bit stream that conforms to the restricted Baseline profile H.264/AVC. It should be noted that the Scalable Baseline profile supports B slices, weighted prediction, the CABAC entropy coding, and the 8×8 luma transform in enhancement layers (CABAC and the 8×8 transform are only supported for certain levels), although the base layer has to conform to the restricted Baseline profile, which does not support these tools.
       
    The Scalable High profile was designed for broadcast, streaming, and storage applications. In this profile, the restrictions of the Scalable Baseline profile are removed and spatial scalable coding with arbitrary resolution ratios and cropping parameters is supported. Bit streams conforming to the Scalable High profile contain a base layer bit stream that conforms to the High profile of H.264/AVC.
       
    The Scalable High Intra profile was mainly designed for professional applications. Bit streams conforming to this profile contain only IDR pictures (for all layers). Beside that, the same set of coding tools as for the Scalable High profile is supported.
       


    Performance
    The diagrams in Fig. 11 show two examples, where the coding efficiency of SVC is compared with that of single-layer H.264/AVC coding. The coding efficiency is measured in terms of bit rate and average luma peak signal-to-noise ratio (PSNR) of the video frames. Temporal scalability with 5 levels is provided by all bit streams used for comparison; it does not have any negative impact on the rate-distortion results and is already supported in single-layer H.264/AVC. For both the SVC bit streams and the H.264/AVC bit streams the same basic encoder configuration was used and a similar type of encoder optimization was applied. For the SVC bit streams, an additional cross-layer optimization was applied that enabled to trade off the coding efficiency of base and enhancement layer.
     

     


    Fig. 11. Rate-distortion comparison of SVC, single-layer H.264/AVC coding and simulcast with H.264/AVC using the test sequence Soccer: (a) SVC with quality scalability, (b) SVC with dyadic spatial scalability. CIF (Common Intermediate Format) and 4CIF denote sizes of pictures equal to 352x288 and 704x576 luma samples, respectively.
     


    In Fig. 11(a), SVC with quality scalability is compared to H.264/AVC single layer coding. All SVC rate-distortion points are extracted from one single bit stream, while for single layer coding each rate-distortion point represents a separate non-scalable bit stream. The diagram additionally shows a rate-distortion curve representing the simulcast of the single-layer H.264/AVC bit stream with the fidelity of the SVC base layer and another single-layer H.264/AVC bit stream with the fidelity specified in the diagram.


    In Fig. 11(b), SVC with spatial scalability is compared to H.264/AVC single layer coding and simulcast of two spatial resolutions with H.264/AVC. Two spatial layers with a dyadic relation are provided. This means that the horizontal and vertical resolutions of the enhancement layer (4CIF) doubles the horizontal and vertical resolutions of the base layer (CIF). For each point on the 4CIF curve, the spatial scalable bit stream comprises the corresponding point on the CIF curve with about 1/2 to 1/3 of the overall bit rate.


    The comparison shows that SVC can provide a suitable degree of scalability at the cost of approximately 10% bit rate increase in comparison to the bit rate of single-layer H.264/AVC coding. This bit rate increase usually depends on the degree of scalability, the bit rate range, and the spatial resolution of the included representations. The comparison also shows that SVC is clearly superior to simulcasting single-layer H.264/AVC streams for different spatial resolutions or bit rates.


    Further simulation results for evaluating the coding efficiency of SVC in relation to single-layer H.264/AVC coding that also include subjective test results are presented at our page on the  MPEG Verification Test for SVC.

    Conclusion
    In comparison to the scalable profiles of prior video coding standards, the H.264/AVC extension for scalable video coding (SVC) provides various tools for reducing the loss in coding efficiency relative to single-layer coding. The most important differences are:

     

    The possibility to employ hierarchical prediction structures for providing temporal scalability with several layers while improving the coding efficiency and increasing the effectiveness of quality and spatial scalable coding.
       
    New methods for inter-layer prediction of macroblock modes, motion, and residual improving the coding efficiency of spatial scalable and quality scalable coding.
       
    The concept of key pictures for efficiently controlling the drift for packet-based quality scalable coding with hierarchical prediction structures.
       
    Single motion compensation loop decoding for spatial and quality scalable coding providing a decoder complexity close to that of single-layer coding.
       
    The support of a modified decoding process that allows a lossless and low-complexity rewriting of a quality scalable bit stream into a bit stream that conforms to a non-scalable H.264/AVC profile.


    These new features provide SVC with a competitive rate-distortion performance while only requiring a single motion compensation loop at the decoder side. Our experiments further illustrate that:

     

    Temporal scalability: can be typically achieved without losses in rate-distortion performance.
       
    Spatial scalability: when applying a suitable SVC encoder control, the bit rate increase relative to non-scalable H.264/AVC coding at the same fidelity can be as low as 10% for dyadic spatial scalability. It should be noted that the results typically become worse as spatial resolution of both layers decreases and results improve as spatial resolution increases.
       
    Quality scalability: when applying a suitable SVC encoder control, the bit rate increase relative to non-scalable H.264/AVC coding at the same fidelity can be as low as 10% for all supported rate points when spanning a bit rate range with a factor of 2-3 between the lowest and highest supported rate point.

     


    Resources

    Overview Paper on SVC
    Final Draft of ITU-T Rec. H.264 & 14496-10 AVC (including SVC Amendment)
    SVC Reference Software (JSVM software)
    MPEG Verification Test for SVC


    Contact

    Heiko Schwarz
    Thomas Wiegand




    你可能感兴趣的:(The Scalable Video Coding Amendment of the H.264/AVC Standard)