Firstly a link to a new useful Technote for AV Foundation released 1 December 2014.

TechNote 2404 – A short note on added AV Foundation API in Yosemite, specifically see section on AVSampleCursor and AVSampleBufferGenerator

Now onto WWDC 2013 session 612 Advanced Editing with AV Foundation on this page

Talk Overview

  • Custom Video compositing
    • Existing architecture
    • New custom video compositing
    • Choosing pixel formats
    • Tweening
    • Performance
  • Debugging compositions

    • Common pitfalls

Existing Architecture

AV Foundation editing today

  • Available since iOS 4.0 and OS X Lion
  • Used in video editing apps from Apple and in the store
  • Video editing
    • Temporal composition
    • Video composition
    • Audio mixing

Custom Video Compositor

What is a Video Compositor?

  • Unit of video mixing code
    • A chunk of video mixing code takes multiple sources.
  • Receives multiple source frames
  • Blends or transforms pixels
  • Delivers single output frame
  • Part of the composition architecture

What is a Composition model


Instruction objects in a AVVideoComposition


The Video compositor takes multiple source frames in and produces a single frame out.

For example we can encode a dissolve as a property of an instruction. For example an opacity ramp for 1 down to 0.

New Custom Video Compositing

As of Mavericks there is a new custom compositing API. You can replace the built in compositor with your own compositor. Instructions with mixing parameters are bundled up together with the source frames into a request object. You’ll be implementing the protocol @AVVideoCompositing which receives the new object AVAsynchronousVideoCompositionRequest and also implementing the protocol AVVideoCompositionInstruction.

func startVideoCompositionRequest(_ asyncVideoCompositionRequest: AVAsynchronousVideoCompositionRequest!)

Once you have rendered the frame you deliver it with

func finishWithComposedVideoFrame(_ composedVideoFrame: CVPixelBuffer!)

But you can also finish with one of:

func finishCancelledRequest()
func finishWithError(_ error: NSError!)

Choosing Pixel Formats

  • Source Pixel Formats – small subset
    • YUV 8-bit 4:2:0
    • YUV 8-bit 4:4:4
    • YUV 10-bit 4:2:2
    • YUV 10-bit 4:4:4
    • RGB 24-bit
    • BGRA 32-bit
    • ARGB 32-bit
    • ABGR 32-bit

When decoding H264 Video your source pixel format is typically YUV 8-bit 4:2:0

You may not be able to deal with that format or whatever the native format of the source pixels are that you require. You can specify what format you require by your custom video compositor using method: func sourcePixelBufferAttributes which should return a dictionary. The key kCVPixelBufferPixelFormatTypeKey should be specified and it takes an array of possible pixel formats and if you want the compositor to work with a CoreAnimation video layer then you should provide a single entry in the array with a value kCVPixelFormatType_32BGRA.

This will cause the source frames to be converted into the format required by your custom compositor.

Output Pixel Formats

For the output pixel formats there is also a method requiredPixelBufferAttributesForRenderContext where you specify the formats your custom renderer can provide.

To get hold of a new empty frame to render into, we go back to the request object ask it for the render context which contains information about the aspect ratio and size that we are rendering to and also the required pixel format. We ask for a new pixel buffer which comes from a managed pool and we can then render into it to get our dissolve.

The Hello World equivalent for a custom compositor

@interface MyCompositor1 : NSObject<AVVideoCompositing>

// Sources as BGRA please
-(NSDictionary *)sourcePixelBufferAttributes {
    return @{ (id)kCVPixelBufferPixelFormatTypeKey :
            @[ @(kCVPixelFormatType_32BGRA) ] }

// We'll output BGRA
-(NSDictionary *)requiredPixelBufferAttributesForRenderContext {
    return @{ (id)kCVPixelBufferPixelFormatTypeKey :
            @[ @(kCVPixelFormatType_32BGRA) ] }

// Render a frame - the action happens here on receiving a request object.
-(void)startVideoCompositionRequest:(AVAsynchronousVideoCompositionRequest *)request {
    if (request.sourceTrackIDs count] != 2) {

    // There'll be an attempt to back the pixel buffer with an IOSurface which means
    // that they will be in GPU memory.
    CVPixelBufferRef srcPixelsBackground = [request sourceFrameByTrackID:[request.sourceTrackIDs[0] intValue]];
    CVPixelBufferRef srcPixelsForeground = [request sourceFrameByTrackID:[request.sourceTrackIDs[1] intValue]];
    CVPixelBufferRef outPixels = [[request renderContext] newPixelBuffer];

    // render - this is really only in its own scope so that code folding is poss in demo.
        // However here because we want to manipulate pixels ourselves we will lock
        // the pixel buffer base address which I think makes sure the pixel data is in
        // main memory so that we can access it.
        CVPixelBufferLockBaseAddress(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferLockBaseAddress(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferLockBaseAddress(outPixels, 0);

        // Calculate the tween block.
        float tween;
        CMTime renderTime = request.compositionTime;
        CMTimeRange range = request.videoCompositionInstruction.timeRange;
        CMTime elapsed = CMTimeSubtract(renderTime, range.start);
        tween = CMTimeGetSeconds(elapsed) / CMTimeGetSeconds(range.duration)

        size_t height = CVPixelBufferGetHeight(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        size_t foregroundBytesPerRow = CVPixelBufferGetBytesPerRow(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        size_t backgroundBytesPerRow = CVPixelBufferGetBytesPerRow(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        outBytesPerRow = CVPixelBufferGetBaseAddress(outPixels);
        const char *foregroundRow = CVPixelBufferGetBaseAddress(srcPixelsForeground)
        const char *backgroundRow = CVPixelBufferGetBaseAddress(srcPixelsBackground)
        outRow = CVPixelBufferGetBaseAddress(outputPixels)
        for (size_t y = 0; y < height; ++y)
            // Some hacky code for copy bytes from one buffer to another.
        CVPixelBufferUnlockBaseAddress(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferUnlockBaseAddress(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferUnlockBaseAddress(outPixels, 0);

    // deliver output
    [request finishWithcomposedVideoFrame:outPixels];


Tweening is the parameterisation of the transition from one state to another. In the case where you are transitioning so that the new video track generated images start with images from one video track and end with images from another, for a dissolve transition the tween is an opacity ramp where the input for the opacity ramp is time.


The image above shows the two input video tracks and the opacity ramp. The image below shows the calculation of the tween value once you are 10% of the way through the transition. In this case the output video frame will display the first video at 90% opacity and the second video at 10%.



Instruction properties for the AVVideoCompositionInstruction protocol help with the compositor optimising performance.

@protocol AVVideoCompositionInstruction<NSObject>
    @property CMPersistentTrackID passthroughTrackID;
    @property NSArray *requiredSourceTrackIDs;
    @property BOOL containsTweening;

By setting these values appropriately there are performance wins to be had.


Some instructions are simpler than others, they might just take one source and often not even change the frames, for example in the frames leading up to a transition, the output frames are just the input frames from a particular track. In the instruction if you set the passthroughTrackID to the id of a particular track then the compositor will be bypassed.


Use this to specify the required tracks and that we do want the compositor to be called. If we have just a single track but we want to modify the contents of the frame in some then requiredSourceTrackIDs will contain just the single track. If you leave requiredSourceTrackIDs set to nil then that means deliver all frames from all tracks.


Even if source frames are the same, two static images for example but if we want to have a picture in a picture effect where the smaller image moves within the bigger picture then containsTweening needs to be set to YES. We have time extended source. If the smaller image doesn’t move then if we leave containsTweening to be YES then we are just re-rendering identical output so instead containsTweening should be set to NO. Then after the initial frame is rendered the compositor can optimise by just reusing the identical output.

Pixel buffer formats

  • Performance hit converting sources
    • H.264 decodes to YUV 4:2:0 natively.
    • For best performance, work in YUV 4:2:0
  • Output format less critical, display can accept multiple formats for example:

    • BGRA
    • YUV 4:2:0

The AVCustomEdit example code is available here.

Debugging Compositions

  • Common pitfalls
    • Gaps between segments
      • Results in black frames or hanging onto the last frame.
    • Misaligned track segments

      • Rounding errors when working with CMTime etc.
      • Results in a short gap between end of one segment & beginning of next.
    • Misaligned layer instructions

      • Track/Layers are rendered in wrong order.
    • Misaligned opacity/audio ramps

      • Opacity/Audio ramps over or undershoot their final value.
    • Bogus layer transforms

      • Errors in your transformation matrix so layers disappear
      • Outside boundaries of output frame

Being able to view the structure of the composition is useful and this is where AVCompositionDebugView comes in.

There is also the composition validation API which you can implement. You will receive callbacks when something appears to not be correct in the video composition.