关于IOS VideoToolBox的一些汇总

1. 至少从iPhone4开始,苹果就是支持硬件解码了,但是硬解码API一直是私有API,不开放给开发者使用,只有越狱才能使用,正常的App如果想提交到AppStore是不允许使用私有API的。

2. 从iOS8开始,可能是苹果想通了,开放了硬解码和硬编码API,就是名为 VideoToolbox.framework的API,需要用iOS 8以后才能使用,iOS 7.x上还不行。

这套硬解码API是几个纯C函数,在任何OC或者 C++代码里都可以使用。

首先要把 VideoToolbox.framework 添加到工程里,并且包含以下头文件。



VTDecompressionSessionCreate 创建解码 session

VTDecompressionSessionDecodeFrame 解码一个frame

VTDecompressionSessionInvalidate 销毁解码 session

首先要创建 decode session,方法如下:

        OSStatus status = VTDecompressionSessionCreate(kCFAllocatorDefault,
                                              NULL, attrs,

其中 decoderFormatDescription 是 CMVideoFormatDescriptionRef 类型的视频格式描述,这个需要用H.264的 sps 和 pps数据来创建,调用以下函数创建 decoderFormatDescription


需要注意的是,这里用的 sps和pps数据是不包含“00 00 00 01”的start code的。

attr是传递给decode session的属性词典

        CFDictionaryRef attrs = NULL;
        const void *keys[] = { kCVPixelBufferPixelFormatTypeKey };
//      kCVPixelFormatType_420YpCbCr8Planar is YUV420
//      kCVPixelFormatType_420YpCbCr8BiPlanarFullRange is NV12
        uint32_t v = kCVPixelFormatType_420YpCbCr8BiPlanarFullRange;
        const void *values[] = { CFNumberCreate(NULL, kCFNumberSInt32Type, &v) };
        attrs = CFDictionaryCreate(NULL, keys, values, 1, NULL, NULL);


callBackRecord 是用来指定回调函数的,解码器支持异步模式,解码后会调用这里的回调函数。

如果 decoderSession创建成功就可以开始解码了。
            VTDecodeFrameFlags flags = 0;
            //kVTDecodeFrame_EnableTemporalProcessing | kVTDecodeFrame_EnableAsynchronousDecompression;
            VTDecodeInfoFlags flagOut = 0;
            CVPixelBufferRef outputPixelBuffer = NULL;
            OSStatus decodeStatus = VTDecompressionSessionDecodeFrame(deocderSession,
其中 flags 用0 表示使用同步解码,这样比较简单。
其中 sampleBuffer是输入的H.264视频数据,每次输入一个frame。
先用CMBlockBufferCreateWithMemoryBlock 从H.264数据创建一个CMBlockBufferRef实例。
然后用 CMSampleBufferCreateReady创建CMSampleBufferRef实例。
这里要注意的是,传入的H.264数据需要Mp4风格的,就是开始的四个字节是数据的长度而不是“00 00 00 01”的start code,四个字节的长度是big-endian的。
一般来说从 视频里读出的数据都是 “00 00 00 01”开头的,这里需要自己转换下。

解码成功之后,outputPixelBuffer里就是一帧 NV12格式的YUV图像了。
    CVPixelBufferLockBaseAddress(outputPixelBuffer, 0);
    void *baseAddress = CVPixelBufferGetBaseAddress(outputPixelBuffer);



调用CVOpenGLESTextureCacheCreateTextureFromImage,可以直接创建OpenGL Texture

从 CVPixelBufferRef 创建 UIImage

    CIImage *ciImage = [CIImage imageWithCVPixelBuffer:pixelBuffer];
    UIImage *uiImage = [UIImage imageWithCIImage:ciImage];

解码完成后销毁 decoder session






3. 具体流程可以参考github上的一个例子,github.com/adison/-Vide 注意create session和decode session时codecCtx和packet的数据可能会有一些小差别,可以根据具体情况修改



[PATCH 1/2] video chroma: add a Nv12 copy function which outputs I420
[PATCH 2/2] Add VideoToolbox based decoder


NALUs: NALUs are simply a chunk of data of varying length that has a NALU start code header 0x00 00 00 01 YY where the first 5 bits of YY tells you what type of NALU this is and therefore what type of data follows the header. (Since you only need the first 5 bits, I use YY & 0x1F to just get the relevant bits.) I list what all these types are in the method NSString * const naluTypesStrings[], but you don't need to know what they all are. 

Parameters: Your decoder needs parameters so it knows how the H.264 video data is stored. The 2 you need to set are Sequence Parameter Set (SPS) and Picture Parameter Set (PPS) and they each have their own NALU type number. You don't need to know what the parameters mean, the decoder knows what to do with them.

H.264 Stream Format: In most H.264 streams, you will receive with an initial set of PPS and SPS parameters followed by an i frame (aka IDR frame or flush frame) NALU. Then you will receive several P frame NALUs (maybe a few dozen or so), then another set of parameters (which may be the same as the initial parameters) and an i frame, more P frames, etc. i frames are much bigger than P frames. Conceptually you can think of the i frame as an entire image of the video, and the P frames are just the changes that have been made to that i frame, until you receive the next i frame.


  1. Generate individual NALUs from your H.264 stream. I cannot show code for this step since it depends a lot on what video source you're using. I made this graphic to show what I was working with ("data" in the graphic is "frame" in my following code), but your case may and probably will differ. My method receivedRawVideoFrame: is called every time I receive a frame (uint8_t *frame) which was one of 2 types. In the diagram, those 2 frame types are the 2 big purple boxes.

  2. Create a CMVideoFormatDescriptionRef from your SPS and PPS NALUs with CMVideoFormatDescriptionCreateFromH264ParameterSets( ). You cannot display any frames without doing this first. The SPS and PPS may look like a jumble of numbers, but VTD knows what to do with them. All you need to know is that CMVideoFormatDescriptionRef is a description of video data., like width/height, format type (kCMPixelFormat_32BGRAkCMVideoCodecType_H264 etc.), aspect ratio, color space etc. Your decoder will hold onto the parameters until a new set arrives (sometimes parameters are resent regularly even when they haven't changed).

  3. Re-package your IDR and non-IDR frame NALUs according to the "AVCC" format. This means removing the NALU start codes and replacing them with a 4-byte header that states the length of the NALU. You don't need to do this for the SPS and PPS NALUs. (Note that the 4-byte NALU length header is in big-endian, so if you have a UInt32 value it must be byte-swapped before copying to the CMBlockBuffer using CFSwapInt32. I do this in my code with the htonlfunction call.)

  4. Package the IDR and non-IDR NALU frames into CMBlockBuffer. Do not do this with the SPS PPS parameter NALUs. All you need to know about CMBlockBuffers is that they are a method to wrap arbitrary blocks of data in core media. (Any compressed video data in a video pipeline is wrapped in this.)

  5. Package the CMBlockBuffer into CMSampleBuffer. All you need to know about CMSampleBuffersis that they wrap up our CMBlockBuffers with other information (here it would be the CMVideoFormatDescription and CMTime, if CMTime is used).

  6. Create a VTDecompressionSessionRef and feed the sample buffers into VTDecompressionSessionDecodeFrame( ). Alternatively, you can use AVSampleBufferDisplayLayer and its enqueueSampleBuffer: method and you won't need to use VTDecompSession. It's simpler to set up, but will not throw errors if something goes wrong like VTD will.

  7. In the VTDecompSession callback, use the resultant CVImageBufferRef to display the video frame. If you need to convert your CVImageBuffer to a UIImage, see my StackOverflow answer here.

Other notes:

  • H.264 streams can vary a lot. From what I learned, NALU start code headers are sometimes 3 bytes (0x00 00 01and sometimes 4 (0x00 00 00 01). My code works for 4 bytes; you will need to change a few things around if you're working with 3.

  • If you want to know more about NALUs, I found this answer to be very helpful. In my case, I found that I didn't need to ignore the "emulation prevention" bytes as described, so I personally skipped that step but you may need to know about that.

  • If your VTDecompressionSession outputs an error number (like -12909) look up the error code in your XCode project. Find the VideoToolbox framework in your project navigator, open it and find the header VTErrors.h. If you can't find it, I've also included all the error codes below in another answer.

Code Example:

So let's start by declaring some global variables and including the VT framework (VT = Video Toolbox). 


@property (nonatomic, assign) CMVideoFormatDescriptionRef formatDesc;
@property (nonatomic, assign) VTDecompressionSessionRef decompressionSession;
@property (nonatomic, retain) AVSampleBufferDisplayLayer *videoLayer;
@property (nonatomic, assign) int spsSize;
@property (nonatomic, assign) int ppsSize;

The following array is only used so that you can print out what type of NALU frame you are receiving. If you know what all these types mean, good for you, you know more about H.264 than me :) My code only handles types 1, 5, 7 and 8.

NSString * const naluTypesStrings[] =
    @"0: Unspecified (non-VCL)",
    @"1: Coded slice of a non-IDR picture (VCL)",    // P frame
    @"2: Coded slice data partition A (VCL)",
    @"3: Coded slice data partition B (VCL)",
    @"4: Coded slice data partition C (VCL)",
    @"5: Coded slice of an IDR picture (VCL)",      // I frame
    @"6: Supplemental enhancement information (SEI) (non-VCL)",
    @"7: Sequence parameter set (non-VCL)",         // SPS parameter
    @"8: Picture parameter set (non-VCL)",          // PPS parameter
    @"9: Access unit delimiter (non-VCL)",
    @"10: End of sequence (non-VCL)",
    @"11: End of stream (non-VCL)",
    @"12: Filler data (non-VCL)",
    @"13: Sequence parameter set extension (non-VCL)",
    @"14: Prefix NAL unit (non-VCL)",
    @"15: Subset sequence parameter set (non-VCL)",
    @"16: Reserved (non-VCL)",
    @"17: Reserved (non-VCL)",
    @"18: Reserved (non-VCL)",
    @"19: Coded slice of an auxiliary coded picture without partitioning (non-VCL)",
    @"20: Coded slice extension (non-VCL)",
    @"21: Coded slice extension for depth view components (non-VCL)",
    @"22: Reserved (non-VCL)",
    @"23: Reserved (non-VCL)",
    @"24: STAP-A Single-time aggregation packet (non-VCL)",
    @"25: STAP-B Single-time aggregation packet (non-VCL)",
    @"26: MTAP16 Multi-time aggregation packet (non-VCL)",
    @"27: MTAP24 Multi-time aggregation packet (non-VCL)",
    @"28: FU-A Fragmentation unit (non-VCL)",
    @"29: FU-B Fragmentation unit (non-VCL)",
    @"30: Unspecified (non-VCL)",
    @"31: Unspecified (non-VCL)",

Now this is where all the magic happens.

-(void) receivedRawVideoFrame:(uint8_t *)frame withSize:(uint32_t)frameSize isIFrame:(int)isIFrame
    OSStatus status;

    uint8_t *data = NULL;
    uint8_t *pps = NULL;
    uint8_t *sps = NULL;

    // I know what my H.264 data source's NALUs look like so I know start code index is always 0.
    // if you don't know where it starts, you can use a for loop similar to how i find the 2nd and 3rd start codes
    int startCodeIndex = 0;
    int secondStartCodeIndex = 0;
    int thirdStartCodeIndex = 0;

    long blockLength = 0;

    CMSampleBufferRef sampleBuffer = NULL;
    CMBlockBufferRef blockBuffer = NULL;

    int nalu_type = (frame[startCodeIndex + 4] & 0x1F);
    NSLog(@"~~~~~~~ Received NALU Type \"%@\" ~~~~~~~~", naluTypesStrings[nalu_type]);

    // if we havent already set up our format description with our SPS PPS parameters, we
    // can't process any frames except type 7 that has our parameters
    if (nalu_type != 7 && _formatDesc == NULL)
        NSLog(@"Video error: Frame is not an I Frame and format description is null");

    // NALU type 7 is the SPS parameter NALU
    if (nalu_type == 7)
        // find where the second PPS start code begins, (the 0x00 00 00 01 code)
        // from which we also get the length of the first SPS code
        for (int i = startCodeIndex + 4; i < startCodeIndex + 40; i++)
            if (frame[i] == 0x00 && frame[i+1] == 0x00 && frame[i+2] == 0x00 && frame[i+3] == 0x01)
                secondStartCodeIndex = i;
                _spsSize = secondStartCodeIndex;   // includes the header in the size

        // find what the second NALU type is
        nalu_type = (frame[secondStartCodeIndex + 4] & 0x1F);
        NSLog(@"~~~~~~~ Received NALU Type \"%@\" ~~~~~~~~", naluTypesStrings[nalu_type]);

    // type 8 is the PPS parameter NALU
    if(nalu_type == 8)
        // find where the NALU after this one starts so we know how long the PPS parameter is
        for (int i = _spsSize + 4; i < _spsSize + 30; i++)
            if (frame[i] == 0x00 && frame[i+1] == 0x00 && frame[i+2] == 0x00 && frame[i+3] == 0x01)
                thirdStartCodeIndex = i;
                _ppsSize = thirdStartCodeIndex - _spsSize;

        // allocate enough data to fit the SPS and PPS parameters into our data objects.
        // VTD doesn't want you to include the start code header (4 bytes long) so we add the - 4 here
        sps = malloc(_spsSize - 4);
        pps = malloc(_ppsSize - 4);

        // copy in the actual sps and pps values, again ignoring the 4 byte header
        memcpy (sps, &frame[4], _spsSize-4);
        memcpy (pps, &frame[_spsSize+4], _ppsSize-4);

        // now we set our H264 parameters
        uint8_t*  parameterSetPointers[2] = {sps, pps};
        size_t parameterSetSizes[2] = {_spsSize-4, _ppsSize-4};

        status = CMVideoFormatDescriptionCreateFromH264ParameterSets(kCFAllocatorDefault, 2, 
                                                (const uint8_t *const*)parameterSetPointers, 
                                                parameterSetSizes, 4, 

        NSLog(@"\t\t Creation of CMVideoFormatDescription: %@", (status == noErr) ? @"successful!" : @"failed...");
        if(status != noErr) NSLog(@"\t\t Format Description ERROR type: %d", (int)status);

        // See if decomp session can convert from previous format description 
        // to the new one, if not we need to remake the decomp session.
        // This snippet was not necessary for my applications but it could be for yours
        /*BOOL needNewDecompSession = (VTDecompressionSessionCanAcceptFormatDescription(_decompressionSession, _formatDesc) == NO);
             [self createDecompSession];

        // now lets handle the IDR frame that (should) come after the parameter sets
        // I say "should" because that's how I expect my H264 stream to work, YMMV
        nalu_type = (frame[thirdStartCodeIndex + 4] & 0x1F);
        NSLog(@"~~~~~~~ Received NALU Type \"%@\" ~~~~~~~~", naluTypesStrings[nalu_type]);

    // create our VTDecompressionSession.  This isnt neccessary if you choose to use AVSampleBufferDisplayLayer
    if((status == noErr) && (_decompressionSession == NULL))
        [self createDecompSession];

    // type 5 is an IDR frame NALU.  The SPS and PPS NALUs should always be followed by an IDR (or IFrame) NALU, as far as I know
    if(nalu_type == 5)
        // find the offset, or where the SPS and PPS NALUs end and the IDR frame NALU begins
        int offset = _spsSize + _ppsSize;
        blockLength = frameSize - offset;
        data = malloc(blockLength);
        data = memcpy(data, &frame[offset], blockLength);

        // replace the start code header on this NALU with its size.
        // AVCC format requires that you do this.  
        // htonl converts the unsigned int from host to network byte order
        uint32_t dataLength32 = htonl (blockLength - 4);
        memcpy (data, &dataLength32, sizeof (uint32_t));

        // create a block buffer from the IDR NALU
        status = CMBlockBufferCreateWithMemoryBlock(NULL, data,  // memoryBlock to hold buffered data
                                                    blockLength,  // block length of the mem block in bytes.
                                                    kCFAllocatorNull, NULL,
                                                    0, // offsetToData
                                                    blockLength,   // dataLength of relevant bytes, starting at offsetToData
                                                    0, &blockBuffer);

        NSLog(@"\t\t BlockBufferCreation: \t %@", (status == kCMBlockBufferNoErr) ? @"successful!" : @"failed...");

    // NALU type 1 is non-IDR (or PFrame) picture
    if (nalu_type == 1)
        // non-IDR frames do not have an offset due to SPS and PSS, so the approach
        // is similar to the IDR frames just without the offset
        blockLength = frameSize;
        data = malloc(blockLength);
        data = memcpy(data, &frame[0], blockLength);

        // again, replace the start header with the size of the NALU
        uint32_t dataLength32 = htonl (blockLength - 4);
        memcpy (data, &dataLength32, sizeof (uint32_t));

        status = CMBlockBufferCreateWithMemoryBlock(NULL, data,  // memoryBlock to hold data. If NULL, block will be alloc when needed
                                                    blockLength,  // overall length of the mem block in bytes
                                                    kCFAllocatorNull, NULL,
                                                    0,     // offsetToData
                                                    blockLength,  // dataLength of relevant data bytes, starting at offsetToData
                                                    0, &blockBuffer);

        NSLog(@"\t\t BlockBufferCreation: \t %@", (status == kCMBlockBufferNoErr) ? @"successful!" : @"failed...");

    // now create our sample buffer from the block buffer,
    if(status == noErr)
        // here I'm not bothering with any timing specifics since in my case we displayed all frames immediately
        const size_t sampleSize = blockLength;
        status = CMSampleBufferCreate(kCFAllocatorDefault,
                                      blockBuffer, true, NULL, NULL,
                                      _formatDesc, 1, 0, NULL, 1,
                                      &sampleSize, &sampleBuffer);

        NSLog(@"\t\t SampleBufferCreate: \t %@", (status == noErr) ? @"successful!" : @"failed...");

    if(status == noErr)
        // set some values of the sample buffer's attachments
        CFArrayRef attachments = CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, YES);
        CFMutableDictionaryRef dict = (CFMutableDictionaryRef)CFArrayGetValueAtIndex(attachments, 0);
        CFDictionarySetValue(dict, kCMSampleAttachmentKey_DisplayImmediately, kCFBooleanTrue);

        // either send the samplebuffer to a VTDecompressionSession or to an AVSampleBufferDisplayLayer
        [self render:sampleBuffer];

    // free memory to avoid a memory leak, do the same for sps, pps and blockbuffer
    if (NULL != data)
        free (data);
        data = NULL;

The following method creates your VTD session. Recreate it whenever you receive newparameters. (You don't have to recreate it every time you receive parameters, pretty sure.) 

If you want to set attributes for the destination CVPixelBuffer, read up on CoreVideo PixelBufferAttributes values and put them in NSDictionary *destinationImageBufferAttributes

-(void) createDecompSession
    // make sure to destroy the old VTD session
    _decompressionSession = NULL;
    VTDecompressionOutputCallbackRecord callBackRecord;
    callBackRecord.decompressionOutputCallback = decompressionSessionDecodeFrameCallback;

    // this is necessary if you need to make calls to Objective C "self" from within in the callback method.
    callBackRecord.decompressionOutputRefCon = (__bridge void *)self;

    // you can set some desired attributes for the destination pixel buffer.  I didn't use this but you may
    // if you need to set some attributes, be sure to uncomment the dictionary in VTDecompressionSessionCreate
    NSDictionary *destinationImageBufferAttributes = [NSDictionary dictionaryWithObjectsAndKeys:
                                                      [NSNumber numberWithBool:YES],

    OSStatus status =  VTDecompressionSessionCreate(NULL, _formatDesc, NULL,
                                                    NULL, // (__bridge CFDictionaryRef)(destinationImageBufferAttributes)
                                                    &callBackRecord, &_decompressionSession);
    NSLog(@"Video Decompression Session Create: \t %@", (status == noErr) ? @"successful!" : @"failed...");
    if(status != noErr) NSLog(@"\t\t VTD ERROR type: %d", (int)status);

Now this method gets called every time VTD is done decompressing any frame you sent to it. This method gets called even if there's an error or if the frame is dropped. 

void decompressionSessionDecodeFrameCallback(void *decompressionOutputRefCon,
                                             void *sourceFrameRefCon,
                                             OSStatus status,
                                             VTDecodeInfoFlags infoFlags,
                                             CVImageBufferRef imageBuffer,
                                             CMTime presentationTimeStamp,
                                             CMTime presentationDuration)
    THISCLASSNAME *streamManager = (__bridge THISCLASSNAME *)decompressionOutputRefCon;

    if (status != noErr)
        NSError *error = [NSError errorWithDomain:NSOSStatusErrorDomain code:status userInfo:nil];
        NSLog(@"Decompressed error: %@", error);
        NSLog(@"Decompressed sucessfully");

        // do something with your resulting CVImageBufferRef that is your decompressed frame
        [streamManager displayDecodedFrame:imageBuffer];

This is where we actually send the sampleBuffer off to the VTD to be decoded.

- (void) render:(CMSampleBufferRef)sampleBuffer
    VTDecodeFrameFlags flags = kVTDecodeFrame_EnableAsynchronousDecompression;
    VTDecodeInfoFlags flagOut;
    NSDate* currentTime = [NSDate date];
    VTDecompressionSessionDecodeFrame(_decompressionSession, sampleBuffer, flags,
                                      (void*)CFBridgingRetain(currentTime), &flagOut);


    // if you're using AVSampleBufferDisplayLayer, you only need to use this line of code
    // [videoLayer enqueueSampleBuffer:sampleBuffer];

If you're using AVSampleBufferDisplayLayer, be sure to init the layer like this, in viewDidLoad or inside some other init method.

-(void) viewDidLoad
    // create our AVSampleBufferDisplayLayer and add it to the view
    videoLayer = [[AVSampleBufferDisplayLayer alloc] init];
    videoLayer.frame = self.view.frame;
    videoLayer.bounds = self.view.bounds;
    videoLayer.videoGravity = AVLayerVideoGravityResizeAspect;

    // set Timebase, you may need this if you need to display frames at specific times
    // I didn't need it so I haven't verified that the timebase is working
    CMTimebaseRef controlTimebase;
    CMTimebaseCreateWithMasterClock(CFAllocatorGetDefault(), CMClockGetHostTimeClock(), &controlTimebase);

    //videoLayer.controlTimebase = controlTimebase;
    CMTimebaseSetTime(self.videoLayer.controlTimebase, kCMTimeZero);
    CMTimebaseSetRate(self.videoLayer.controlTimebase, 1.0);

    [[self.view layer] addSublayer:videoLayer];
Frequently Asked Questions

  1. What kinds of encoders are supported?

    The protocol specification does not limit the encoder selection. However, the current Apple implementation should interoperate with encoders that produce MPEG-2 Transport Streams containing H.264 video and AAC audio (HE-AAC or AAC-LC). Encoders that are capable of broadcasting the output stream over UDP should also be compatible with the current implementation of the Apple provided segmenter software.

  2. What are the specifics of the video and audio formats supported?

    Although the protocol specification does not limit the video and audio formats, the current Apple implementation supports the following formats:

    • Video:

      • H.264 Baseline Level 3.0, Baseline Level 3.1, Main Level 3.1, and High Profile Level 4.1.

    • Audio: 

      • HE-AAC or AAC-LC up to 48 kHz, stereo audio

      • MP3 (MPEG-1 Audio Layer 3) 8 kHz to 48 kHz, stereo audio

      • AC-3 (for Apple TV, in pass-through mode only)

      Note: iPad, iPhone 3G, and iPod touch (2nd generation and later) support H.264 Baseline 3.1. If your app runs on older versions of iPhone or iPod touch, however, you should use H.264 Baseline 3.0 for compatibility. If your content is intended solely for iPad, Apple TV, iPhone 4 and later, and Mac OS X computers, you should use Main Level 3.1.

  3. What duration should media files be?

    The main point to consider is that shorter segments result in more frequent refreshes of the index file, which might create unnecessary network overhead for the client. Longer segments will extend the inherent latency of the broadcast and initial startup time. A duration of 10 seconds of media per file seems to strike a reasonable balance for most broadcast content.

  4. How many files should be listed in the index file during a continuous, ongoing session?

    The normal recommendation is 3, but the optimum number may be larger.

    The important point to consider when choosing the optimum number is that the number of files available during a live session constrains the client's behavior when doing play/pause and seeking operations. The more files in the list, the longer the client can be paused without losing its place in the broadcast, the further back in the broadcast a new client begins when joining the stream, and the wider the time range within which the client can seek. The trade-off is that a longer index file adds to network overhead—during live broadcasts, the clients are all refreshing the index file regularly, so it does add up, even though the index file is typically small.

  5. What data rates are supported?

    The data rate that a content provider chooses for a stream is most influenced by the target client platform and the expected network topology. The streaming protocol itself places no limitations on the data rates that can be used. The current implementation has been tested using audio-video streams with data rates as low as 64 Kbps and as high as 3 Mbps to iPhone. Audio-only streams at 64 Kbps are recommended as alternates for delivery over slow cellular connections.

    For recommended data rates, see Preparing Media for Delivery to iOS-Based Devices.

    Note: If the data rate exceeds the available bandwidth, there is more latency before startup and the client may have to pause to buffer more data periodically. If a broadcast uses an index file that provides a moving window into the content, the client will eventually fall behind in such cases, causing one or more segments to be dropped. In the case of VOD, no segments are lost, but inadequate bandwidth does cause slower startup and periodic stalling while data buffers.

  6. What is a .ts file?

    .ts file contains an MPEG-2 Transport Stream. This is a file format that encapsulates a series of encoded media samples—typically audio and video. The file format supports a variety of compression formats, including MP3 audio, AAC audio, H.264 video, and so on. Not all compression formats are currently supported in the Apple HTTP Live Streaming implementation, however. (For a list of currently supported formats, see Media Encoder.

    MPEG-2 Transport Streams are containers, and should not be confused with MPEG-2 compression.

  7. What is an .M3U8 file?

    An .M3U8 file is a extensible playlist file format. It is an m3u playlist containing UTF-8 encoded text. The m3u file format is a de facto standard playlist format suitable for carrying lists of media file URLs. This is the format used as the index file for HTTP Live Streaming. For details, see IETF Internet-Draft of the HTTP Live Streaming specification.

  8. How does the client software determine when to switch streams?

    The current implementation of the client observes the effective bandwidth while playing a stream. If a higher-quality stream is available and the bandwidth appears sufficient to support it, the client switches to a higher quality. If a lower-quality stream is available and the current bandwidth appears insufficient to support the current stream, the client switches to a lower quality.

    Note: For seamless transitions between alternate streams, the audio portion of the stream should be identical in all versions.

  9. Where can I find a copy of the media stream segmenter from Apple?

    The media stream segmenter, file stream segmenter, and other tools are frequently updated, so you should download the current version of the HTTP Live Streaming Tools from the Apple Developer website. See Download the Tools for details.

  10. What settings are recommended for a typical HTTP stream, with alternates, for use with the media segmenter from Apple?

    See Preparing Media for Delivery to iOS-Based Devices.

    These settings are the current recommendations. There are also certain requirements. The current mediastreamsegmenter tool works only with MPEG-2 Transport Streams as defined in ISO/IEC 13818. The transport stream must contain H.264 (MPEG-4, part 10) video and AAC or MPEG audio. If AAC audio is used, it must have ADTS headers. H.264 video access units must use Access Unit Delimiter NALs, and must be in unique PES packets.

    The segmenter also has a number of user-configurable settings. You can obtain a list of the command line arguments and their meanings by typing man mediastreamsegmenter from the Terminal application. A target duration (length of the media segments) of 10 seconds is recommended, and is the default if no target duration is specified.

  11. How can I specify what codecs or H.264 profile are required to play back my stream?

    Use the CODECS attribute of the EXT-X-STREAM-INF tag. When this attribute is present, it must include all codecs and profiles required to play back the stream. The following values are currently recognized:







    H.264 Baseline Profile level 3.0

    "avc1.42001e" or "avc1.66.30"

    Note: Use "avc1.66.30" for compatibility with iOS versions 3.0 to 3.1.2.

    H.264 Baseline Profile level 3.1


    H.264 Main Profile level 3.0

    "avc1.4d001e" or "avc1.77.30"

    Note: Use "avc1.77.30" for compatibility with iOS versions 3.0 to 3.12.

    H.264 Main Profile level 3.1


    H.264 Main Profile level 4.0


    H.264 High Profile level 3.1


    H.264 High Profile level 4.0


    H.264 High Profile level 4.1


    The attribute value must be in quotes. If multiple values are specified, one set of quotes is used to contain all values, and the values are separated by commas. An example follows.

    #EXT-X-STREAM-INF:PROGRAM-ID=1, BANDWIDTH=3000000, CODECS="avc1.4d001e,mp4a.40.5", RESOLUTION=1920x1080
  12. How can I create an audio-only stream from audio/video input?

    Add the -audio-only argument when invoking the stream or files segmenter.

  13. How can I add a still image to an audio-only stream?

    Use the -meta-file argument when invoking the stream or file segmenter with -meta-type=picture to add an image to every segment. For example, this would add an image named poster.jpg to every segment of an audio stream created from the file track01.mp3:

    mediafilesegmenter -f /Dir/outputFile -a --meta-file=poster.jpg --meta-type=picture track01.mp3

    Remember that the image is typically resent every ten seconds, so it’s best to keep the file size small.

  14. How can I specify an audio-only alternate to an audio-video stream?

    Use the CODECS and BANDWIDTH attributes of the EXT-X-STREAM-INF tag together. 

    The BANDWIDTH attribute specifies the bandwidth required for each alternate stream. If the available bandwidth is enough for the audio alternate, but not enough for the lowest video alternate, the client switches to the audio stream.

    If the CODECS attribute is included, it must list all codecs required to play the stream. If only an audio codec is specified, the stream is identified as audio-only. Currently, it is not required to specify that a stream is audio-only, so use of the CODECS attribute is optional.

    The following is an example that specifies video streams at 500 Kbps for fast connections, 150 Kbps for slower connections, and an audio-only stream at 64 Kbps for very slow connections. All the streams should use the same 64 Kbps audio to allow transitions between streams without an audible disturbance.

  15. What are the hardware requirements or recommendations for servers?

    See question #1 for encoder hardware recommendations.

    The Apple stream segmenter is capable of running on any Intel-based Mac. We recommend using a Mac with two Ethernet network interfaces, such as a Mac Pro or an XServe. One network interface can be used to obtain the encoded stream from the local network, while the second network interface can provide access to a wider network.

  16. Does the Apple implementation of HTTP Live Streaming support DRM?

    No. However, media can be encrypted, and key access can be limited by requiring authentication when the client retrieves the key from your HTTPS server.

  17. What client platforms are supported?

    iPhone, iPad, and iPod touch (requires iOS version 3.0 or later), Apple TV (version 2 and later), and Mac OS X computers.

  18. Is the protocol specification available?

    Yes. The protocol specification is an IETF Internet-Draft, at http://tools.ietf.org/html/draft-pantos-http-live-streaming.

  19. Does the client cache content?

    The index file can contain an instruction to the client that content should not be cached. Otherwise, the client may cache data for performance optimization when seeking within the media.

  20. Is this a real-time delivery system?

    No. It has inherent latency corresponding to the size and duration of the media files containing stream segments. At least one segment must fully download before it can be viewed by the client, and two may be required to ensure seamless transitions between segments. In addition, the encoder and segmenter must create a file from the input; the duration of this file is the minimum latency before media is available for download. Typical latency with recommended settings is in the neighborhood of 30 seconds.

  21. What is the latency?

    Approximately 30 seconds, with recommended settings. See question #15.

  22. Do I need to use a hardware encoder?

    No. Using the protocol specification, it is possible to implement a software encoder.

  23. What advantages does this approach have over RTP/RTSP?

    HTTP is less likely to be disallowed by routers, NAT, or firewall settings. No ports need to be opened that are commonly closed by default. Content is therefore more likely to get through to the client in more locations and without special settings. HTTP is also supported by more content-distribution networks, which can affect cost in large distribution models. In general, more available hardware and software works unmodified and as intended with HTTP than with RTP/RTSP. Expertise in customizing HTTP content delivery using tools such as PHP is also more widespread.

    Also, HTTP Live Streaming is supported in Safari and the media player framework on iOS. RTSP streaming is not supported.

  24. Why is my stream’s overall bit rate higher than the sum of the audio and video bitrates?

    MPEG-2 transport streams can include substantial overhead. They utilize fixed packet sizes that are padded when the packet contents are smaller than the default packet size. Encoder and multiplexer implementations vary in their efficiency at packing media data into these fixed packet sizes. The amount of padding can vary with frame rate, sample rate, and resolution.

  25. How can I reduce the overhead and bring the bit rate down?

    Using a more efficient encoder can reduce the amount of overhead, as can tuning the encoder settings.

  26. Do all media files have to be part of the same MPEG-2 Transport Stream?

    No. You can mix media files from different transport streams, as long as they are separated by EXT-X-DISCONTINUITY tags. See the protocol specification for more detail. For best results, however, all video media files should have the same height and width dimensions in pixels.

  27. Where can I get help or advice on setting up an HTTP audio/video server?

    You can visit the Apple Developer Forum at http://devforums.apple.com/.

    Also, check out Best Practices for Creating and Deploying HTTP Live Streaming Media for the iPhone and iPad.
