Posts Tagged avfoundation

The Pleasure and Pain of AVAssetWriterInputPixelBufferAdaptor

For a class with such a small interface it seems remarkable that I would feel that it deserves a blog post all on its own. But a mixture of appreciation for what it does, and the pain I have had working with it requires catharsis.

AVAssetWriterInputPixelBufferAdaptor objects are used for taking in pixel data and providing it to a AVAssetWriterInput in a suitable format for writing that pixel data into a movie file.

Read the rest of this entry »


AV Foundation editing movie file content

Firstly a link to a new useful Technote for AV Foundation released 1 December 2014.

TechNote 2404 – A short note on added AV Foundation API in Yosemite, specifically see section on AVSampleCursor and AVSampleBufferGenerator

Now onto WWDC 2013 session 612 Advanced Editing with AV Foundation on this page

Talk Overview

  • Custom Video compositing
    • Existing architecture
    • New custom video compositing
    • Choosing pixel formats
    • Tweening
    • Performance
  • Debugging compositions

    • Common pitfalls

Existing Architecture

AV Foundation editing today

  • Available since iOS 4.0 and OS X Lion
  • Used in video editing apps from Apple and in the store
  • Video editing
    • Temporal composition
    • Video composition
    • Audio mixing

Custom Video Compositor

What is a Video Compositor?

  • Unit of video mixing code
    • A chunk of video mixing code takes multiple sources.
  • Receives multiple source frames
  • Blends or transforms pixels
  • Delivers single output frame
  • Part of the composition architecture

What is a Composition model


Instruction objects in a AVVideoComposition


The Video compositor takes multiple source frames in and produces a single frame out.

For example we can encode a dissolve as a property of an instruction. For example an opacity ramp for 1 down to 0.

New Custom Video Compositing

As of Mavericks there is a new custom compositing API. You can replace the built in compositor with your own compositor. Instructions with mixing parameters are bundled up together with the source frames into a request object. You’ll be implementing the protocol @AVVideoCompositing which receives the new object AVAsynchronousVideoCompositionRequest and also implementing the protocol AVVideoCompositionInstruction.

func startVideoCompositionRequest(_ asyncVideoCompositionRequest: AVAsynchronousVideoCompositionRequest!)

Once you have rendered the frame you deliver it with

func finishWithComposedVideoFrame(_ composedVideoFrame: CVPixelBuffer!)

But you can also finish with one of:

func finishCancelledRequest()
func finishWithError(_ error: NSError!)

Choosing Pixel Formats

  • Source Pixel Formats – small subset
    • YUV 8-bit 4:2:0
    • YUV 8-bit 4:4:4
    • YUV 10-bit 4:2:2
    • YUV 10-bit 4:4:4
    • RGB 24-bit
    • BGRA 32-bit
    • ARGB 32-bit
    • ABGR 32-bit

When decoding H264 Video your source pixel format is typically YUV 8-bit 4:2:0

You may not be able to deal with that format or whatever the native format of the source pixels are that you require. You can specify what format you require by your custom video compositor using method: func sourcePixelBufferAttributes which should return a dictionary. The key kCVPixelBufferPixelFormatTypeKey should be specified and it takes an array of possible pixel formats and if you want the compositor to work with a CoreAnimation video layer then you should provide a single entry in the array with a value kCVPixelFormatType_32BGRA.

This will cause the source frames to be converted into the format required by your custom compositor.

Output Pixel Formats

For the output pixel formats there is also a method requiredPixelBufferAttributesForRenderContext where you specify the formats your custom renderer can provide.

To get hold of a new empty frame to render into, we go back to the request object ask it for the render context which contains information about the aspect ratio and size that we are rendering to and also the required pixel format. We ask for a new pixel buffer which comes from a managed pool and we can then render into it to get our dissolve.

The Hello World equivalent for a custom compositor

@interface MyCompositor1 : NSObject<AVVideoCompositing>

// Sources as BGRA please
-(NSDictionary *)sourcePixelBufferAttributes {
    return @{ (id)kCVPixelBufferPixelFormatTypeKey :
            @[ @(kCVPixelFormatType_32BGRA) ] }

// We'll output BGRA
-(NSDictionary *)requiredPixelBufferAttributesForRenderContext {
    return @{ (id)kCVPixelBufferPixelFormatTypeKey :
            @[ @(kCVPixelFormatType_32BGRA) ] }

// Render a frame - the action happens here on receiving a request object.
-(void)startVideoCompositionRequest:(AVAsynchronousVideoCompositionRequest *)request {
    if (request.sourceTrackIDs count] != 2) {

    // There'll be an attempt to back the pixel buffer with an IOSurface which means
    // that they will be in GPU memory.
    CVPixelBufferRef srcPixelsBackground = [request sourceFrameByTrackID:[request.sourceTrackIDs[0] intValue]];
    CVPixelBufferRef srcPixelsForeground = [request sourceFrameByTrackID:[request.sourceTrackIDs[1] intValue]];
    CVPixelBufferRef outPixels = [[request renderContext] newPixelBuffer];

    // render - this is really only in its own scope so that code folding is poss in demo.
        // However here because we want to manipulate pixels ourselves we will lock
        // the pixel buffer base address which I think makes sure the pixel data is in
        // main memory so that we can access it.
        CVPixelBufferLockBaseAddress(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferLockBaseAddress(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferLockBaseAddress(outPixels, 0);

        // Calculate the tween block.
        float tween;
        CMTime renderTime = request.compositionTime;
        CMTimeRange range = request.videoCompositionInstruction.timeRange;
        CMTime elapsed = CMTimeSubtract(renderTime, range.start);
        tween = CMTimeGetSeconds(elapsed) / CMTimeGetSeconds(range.duration)

        size_t height = CVPixelBufferGetHeight(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        size_t foregroundBytesPerRow = CVPixelBufferGetBytesPerRow(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        size_t backgroundBytesPerRow = CVPixelBufferGetBytesPerRow(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        outBytesPerRow = CVPixelBufferGetBaseAddress(outPixels);
        const char *foregroundRow = CVPixelBufferGetBaseAddress(srcPixelsForeground)
        const char *backgroundRow = CVPixelBufferGetBaseAddress(srcPixelsBackground)
        outRow = CVPixelBufferGetBaseAddress(outputPixels)
        for (size_t y = 0; y < height; ++y)
            // Some hacky code for copy bytes from one buffer to another.
        CVPixelBufferUnlockBaseAddress(srcPixelsForeground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferUnlockBaseAddress(srcPixelsBackground, kCVPixelBufferLock_ReadOnly);
        CVPixelBufferUnlockBaseAddress(outPixels, 0);

    // deliver output
    [request finishWithcomposedVideoFrame:outPixels];


Tweening is the parameterisation of the transition from one state to another. In the case where you are transitioning so that the new video track generated images start with images from one video track and end with images from another, for a dissolve transition the tween is an opacity ramp where the input for the opacity ramp is time.


The image above shows the two input video tracks and the opacity ramp. The image below shows the calculation of the tween value once you are 10% of the way through the transition. In this case the output video frame will display the first video at 90% opacity and the second video at 10%.



Instruction properties for the AVVideoCompositionInstruction protocol help with the compositor optimising performance.

@protocol AVVideoCompositionInstruction<NSObject>
    @property CMPersistentTrackID passthroughTrackID;
    @property NSArray *requiredSourceTrackIDs;
    @property BOOL containsTweening;

By setting these values appropriately there are performance wins to be had.


Some instructions are simpler than others, they might just take one source and often not even change the frames, for example in the frames leading up to a transition, the output frames are just the input frames from a particular track. In the instruction if you set the passthroughTrackID to the id of a particular track then the compositor will be bypassed.


Use this to specify the required tracks and that we do want the compositor to be called. If we have just a single track but we want to modify the contents of the frame in some then requiredSourceTrackIDs will contain just the single track. If you leave requiredSourceTrackIDs set to nil then that means deliver all frames from all tracks.


Even if source frames are the same, two static images for example but if we want to have a picture in a picture effect where the smaller image moves within the bigger picture then containsTweening needs to be set to YES. We have time extended source. If the smaller image doesn’t move then if we leave containsTweening to be YES then we are just re-rendering identical output so instead containsTweening should be set to NO. Then after the initial frame is rendered the compositor can optimise by just reusing the identical output.

Pixel buffer formats

  • Performance hit converting sources
    • H.264 decodes to YUV 4:2:0 natively.
    • For best performance, work in YUV 4:2:0
  • Output format less critical, display can accept multiple formats for example:

    • BGRA
    • YUV 4:2:0

The AVCustomEdit example code is available here.

Debugging Compositions

  • Common pitfalls
    • Gaps between segments
      • Results in black frames or hanging onto the last frame.
    • Misaligned track segments

      • Rounding errors when working with CMTime etc.
      • Results in a short gap between end of one segment & beginning of next.
    • Misaligned layer instructions

      • Track/Layers are rendered in wrong order.
    • Misaligned opacity/audio ramps

      • Opacity/Audio ramps over or undershoot their final value.
    • Bogus layer transforms

      • Errors in your transformation matrix so layers disappear
      • Outside boundaries of output frame

Being able to view the structure of the composition is useful and this is where AVCompositionDebugView comes in.

There is also the composition validation API which you can implement. You will receive callbacks when something appears to not be correct in the video composition.


Image generation performance

I’m wanting random access to individual movie frames so I’m using the AVAssetImageGenerator class but for this part of the project using generateCGImagesAsynchronously is not appropriate. Now clearly performance is not the crucial component here but at the same time you don’t want to do something that is stupidly slow.

I’d like to not have to hold onto a AVAssetImageGenerator object to use each time I need an image but just create one at the time a image is requested. So I thought I’d find out the penalty of creating a AVAssetImageGenerator object each time.

To compare the performance I added some performance tests and ran those tests on my i7 MBP with an SSD and on my iPad Mini 2. I’ve confirmed that the images are generated. See code at end.

On my iPad Mini 2 the measure block in performance test 1 took between 0.25 and 0.45 seconds to run. Most results clustering around 0.45 seconds. It was the second run that returned the 0.25 second result. Performance test 2 on the iPad Mini 2 was much more consistent with times ranging between 0.5 and 0.52 seconds. But reversing the order in which the tests run reverses these results. I’m not sure what to think about this, but in relation to what I’m testing for I feel comfortable that the cost of creating an AVAssetImageGenerator object before generating an image is minimal in comparison to generating the CGImage.

Strangely my MBP pro is slower, but doesn’t have the variation observed on the iPad. The measure block in both performance tests take 1.1 seconds.

Whatever performance difference there is in keeping a AVAssetImageGenerator object around or not is inconsequential.

    func testAVAssetImageGeneratorPerformance1() {
        let options = [
            AVURLAssetPreferPreciseDurationAndTimingKey: true,
        let asset = AVURLAsset(URL: movieURL, options: options)!
        self.measureBlock() {
            let generator = AVAssetImageGenerator(asset: asset)
            var actualTime:CMTime = CMTimeMake(0, 600)
            for i in 0..<10 {
                let image = generator.copyCGImageAtTime(
                    CMTimeMake(i * 600 + 30, 600),
                    actualTime: &actualTime, error: nil)
    func functionToTestPerformance(#movieAsset:AVURLAsset, index:Int) -> Void {
        let generator = AVAssetImageGenerator(asset: movieAsset)
        var actualTime:CMTime = CMTimeMake(0, 600)
        let image = generator.copyCGImageAtTime(
            CMTimeMake(index * 600 + 30, 600),
            actualTime: &actualTime, error: nil)
    func testAVAssetImageGeneratorPerformance2() {
        let options = [
            AVURLAssetPreferPreciseDurationAndTimingKey: true,
        let asset = AVURLAsset(URL: movieURL, options: options)!
        self.measureBlock() {
            for i in 0..<10 {
                self.functionToTestPerformance(movieAsset: asset, index: i)

Tags: ,

Getting tracks from an AVAsset

I’ve been playing around with the AVAsset AVFoundation API.

The AVAsset object is at the core of representing an imported movie. An AVAssetTrack is AVFoundation’s representation of a track in a movie. There are multiple ways to get AVAssetTracks from an AVAsset.

You can get a list of all the tracks:

let movie:AVAsset = ...
let tracks:[AVAssetTrack] = movie.tracks

You can get a list of tracks with a specific characteristic, for example a visual characteristic:

let movie:AVAsset = ...
let tracks:[AVAssetTrack] = movie.tracksWithMediaCharacteristic(AVMediaCharacteristicVisual)

Or you can get a list of tracks which have a specific media type, for example audio:

let movie:AVAsset = ...
let tracks:[AVAssetTrack] = movie.tracksWithMediaType(AVMediaTypeAudio)

You can obtain a single AVAssetTrack object if you know it’s persistent track identifier value.

The persistent track identifier is of type CMPersitentTrackID which is 32 bit integer typedef and the invalid track reference kCMPersistentTrackID_Invalid is an anonymous enum with value 0.

Unfortunately the only way to get the track id of a track in an imported movie is querying a AVAssetTrack object so the persistent track id is useful when later on you want to reference a track that you have previously identified.

From what I understand the AVAssetTrack objects are fairly lightweight so keeping a list of AVAssetTrack objects is not going to be too much of a drain, but you might still prefer to keep a list of persistent track id values to request a AVAssetTrack object when you need it rather than holding onto a reference to an AVAssetTrack object.

To get a track using the tracks persistent identifier

let track:AVAssetTrack = movie.trackWithTrackID(2)

Tracks have segments and a segment specifies when one bit of content in a track starts and finishes based on a within the time range of a track. Each segment contains a time mapping between the source and target. You can get a list of all the segments in a track and this will often be a list of 1 segment which lasts the full length of the track but this is not necessarily the case.

To get a list of segments from a track:

let segments = track.segments

You can get the segment that corresponds to a specific track time:

let trackTime = CMTimeMake(60000, 600)
let segment = track.segmentForTrackTime(trackTime)

I’ve created a gist which is a very simple command line tool written in swift demonstrating this blog post.


SSD versus HDD, Movie Frame Grabs and the importance of profiling

There was an e-mail to Apple’s cocoa-dev e-mail list that provoked a bit of discussion. The discussion thread starts with this e-mail to cocoa dev by Trygve.

Basically Trygve was wanting to get better performance from his code for taking frame grabs from movies and drawing those frame grabs as thumbnails to what I call a cover sheet. Trygve was using NSImage to do the drawing and was complaining that based on profiling his code was spending 90% of the time in a method called drawInRect.

Read the rest of this entry »

Tags: , , , , , , , ,