Update: Nov 28 2011 – The OpenCV framework has been rebuilt using opencv svn revision 7017
Introduction
Hot on the heels of our last article, in which we showed you how to build an OpenCV framework for iOS, we are turning our attention to capturing live video and processing video frames with OpenCV. This is the foundation for augmented reality, the latest buzz topic in computer vision. The article is accompanied by a demo app that detects faces in a real-time video feed from your iOS device’s camera. You can check out the source code for the app at GitHub or follow the direct download link at the end of the article.
As shown in our last article, OpenCV supports video capture on iOS devices using the cv::VideoCapture
class from thehighgui
module. Calling the grab
method of this class allows you to capture a single video frame and return it as a cv::Mat object for processing. However, the class is not optimized for processing live video:
- Each video frame is copied several times before being made available to your app for processing.
- You are required to ‘pull’ frames from
cv::VideoCapture
at a rate that you decide rather than being ‘pushed’ frames in real time as they become available. - No video preview is supported. You are required to display frames manually in your UI.
In designing image processing apps for iOS devices we recommend that you use OpenCV for what it excels at – image processing – but use standard iOS support for accessing hardware and implementing UI. It may be a philosophical standpoint, but we find that cross-platform layers such as OpenCV’s highgui
module always incur performance and design restrictions in trying to support the lowest common denominator. With that in mind, we have implemented a re-useable view controller subclass (VideoCaptureViewController
) that enables high performance processing of live video using video capture support provided by the AVFoundation framework. The controller automatically manages a video preview layer and throttles the rate at which video frames are supplied to your processing implementation to accomodate processing load. The components of the underlying AVFoundation video capture stack are also made available to you so that you can tweak behaviour to match your exact requirements.
The Video Capture View Controller
The AVFoundation video capture stack and video preview layer are conveniently wrapped up in theVideoCaptureViewController
class provided with the demo source code. This class handles creation of the video capture stack, insertion of the view preview layer into the controller’s view hierarchy and conversion of video frames to cv::Mat instances for processing with OpenCV. It also provides convenience methods for turning the iPhone 4′s torch on and off, switching between the front and back cameras while capturing video and displaying the current frames per second.
The details of how to set up the AVFoundation video capture stack are beyond the scope of this article and we refer you to the documentation from Apple and the canonical application sample AVCam. If you are interested in how the stack is created, however, then take a look at the implementation of thecreateCaptureSessionForCamera:qualityPreset:grayscale:
method, which is called from viewDidLoad
. There are a number of interesting aspects of the implementation, which we will go into next.
Hardware-acceleration of grayscale capture
For many image processing applications the first processing step is to reduce the full-color BGRA data received from the video hardware to a grayscale image to maximize processing speed when color information is not required. With OpenCV, this is usually achieved using the cv::cvtColor
function, which produces a single channel image by calculating the weighted average of the R, G and B components of the original image. InVideCaptureViewController
we perform this conversion in hardware using a little trick and save processor cycles for the more interesting parts of your image processing pipeline.
If grayscale mode is enabled then the video format is set to kCVPixelFormatType_420YpCbCr8BiPlanarFullRange
. The video hardware will then supply YUV formatted video frames in which the Y channel contains luminance data and the color information is encoding in the U and V chrominance channels. The luminance channel is used by the controller to create a single-channel grayscale image and the chrominance channels are ignored. Note that the video preview layer will still display the full-color video feed whether grayscale mode is enabled or not.
Processing video frames
VideoCaptureViewController implements the AVCaptureVideoDataOutputSampleBufferDelegate
protocol and is set as the delegate for receiving video frames from AVFoundation via thecaptureOutput:didOutputSampleBuffer:fromConnection:
method. This method takes the supplied sample buffer containing the video frame and creates a cv::Mat object. If grayscale mode is enabled then a single-channel cv::Mat is created; for full-color mode a BGRA format cv::Mat is created. This cv::Mat object is then passed on toprocessFrame:videoRect:videoOrientation:
where the OpenCV heavy-lifting is implemented. Note that no video data is copied here: the cv::Mat that is created points right into the hardware video buffer and must be processed beforecaptureOutput:didOutputSampleBuffer:fromConnection:
returns. If you need to keep references to video frames then use the cv::Mat::clone method to create a deep copy of the video data.
Note that captureOutput:didOutputSampleBuffer:fromConnection:
is called on a private GCD queue created by the view controller. Your overridden processFrame:videoRect:videoOrientation:
method is also called on this queue. If you need to update UI based on your frame processing then you will need to use dispatch_sync or dispatch_async to dispatch those updates on the main application queue.
VideoCaptureViewController also monitors video frame timing information and uses it to calculate a running average of performance measured in frames per second. Set the showDebugInfo
property of the controller to YES
to display this information in an overlay on top of the video preview layer.
Video orientation and the video coordinate system
Video frames are supplied by the iOS device hardware in landscape orientation irrespective of the physical orientation of the device. Specifically, the front camera orientation isAVCaptureVideoOrientationLandscapeLeft
(as if you were holding the device in landscape with the Home button on the left) and the back camera orientation isAVCaptureVideoOrientationLandscapeRight
(as if you were holding the device in landscape with the Home button on the left). The video preview layer automatically rotates the video feed to the upright orientation and also mirrors the feed from the front camera to give the reflected image that we are used to seeing when we look in a mirror. The preview layer also scales the video according to its current videoGravity mode: either stretching the video to fill its full bounds or fitting the video while maintaining the aspect ratio.
All these transformations create a problem when we need to map from a coordinate in the original video frame to the corresponding coordinate in the view as seen by the user and vice versa. For instance, you may have the location of a feature detected in the video frame and need to draw a marker at the corresponding position in the view. Or a user may have tapped on the view and you need to convert that view coordinate into the corresponding coordinate in the video frame.
All this complexity is handled in -[VideoCaptureController affineTransformForVideoRect:orientation:], which creates an affine transform that you can use to convert CGPoints and CGRects between the video coordinate system and the view coordinate system. If you need to convert in the opposite direction then create the inverse transform using the CGAffineTransformInvert
function. If you are not sure what an affine transform is then just look at the following code snippet for how to use them to convert CGPoints and CGRects between different coordinate systems.
// Create the affine transform for converting from the video coordinate system to the view coordinate system
CGAffineTransform t = [self affineTransformForVideoRect:videoRect orientation:videoOrientation];
// Convert CGPoint from video coordinate system to view coordinate system
viewPoint = CGPointApplyAffineTransform(videoPoint, t);
// Convert CGRect from video coordinate system to view coordinate system
viewRect = CGRectApplyAffineTransform(videoRect, t);
// Create inverse transform for converting from view coordinate system to video coordinate system
CGAffineTransform invT = CGAffineTransformInvert(t);
videoPoint = CGPointApplyAffineTransform(viewPoint, t);
videoRect = CGRectApplyAffineTransform(viewRect, t);
Using VideoCaptureViewController in your own projects
VideoCaptureViewController
is designed to be re-useable in your own projects by subclassing it just as you would subclass Apple-provided controllers like UIViewController and UITableViewController. Add the header and implementation files (VideoCaptureViewController.h
and VideoCaptureViewController.mm
) to your project and modify your application-specific view controller(s) to derive from VideoCaptureViewController instead of UIViewController. If you want to add additional controls over the top of the video preview you can use Interface Builder and connect up IBOutlets as usual. See the demo app for how this is done to overlay the video preview with UIButtons. You implement your application-specific video processing by overriding theprocessFrame:videoRect:videoOrientation:
method in your controller. Which leads us to face tracking…
Face tracking
Face tracking seems to be the ‘Hello World’ of computer vision and judging by the number of questions about it on StackOverflow many developers are looking for an iOS implementation. We couldn’t resist choosing it as the subject for our demo app either. The implementation can be found in the DemoVideoCaptureViewController
class. This is a subclass of VideoCaptureViewController
and, as described above, we’ve added our app-specific processing code by overriding the processFrame:videoRect:videoOrientation:
method of the base class. We have also added three UIButton controls in InterfaceBuilder to demonstrate how to extend the user interface. These buttons allow you to turn the iPhone4 torch on and off, switch between the front and back cameras and toggle the frames-per-second display.
Processing the video frames
The VideoCaptureViewController base class handles capturing frames and wrapping them up as cv::Mat
instances. Each frame is supplied to our app-specific subclass via the processFrame:videoRect:videoOrientation:
method, which is overridden to implement the detection.
The face detection is performed using OpenCV’s CascadeClassifier and the ‘haarcascade_frontalface_alt2′ cascade provided with the OpenCV distribution. The details of the detection are beyond the scope of this article but you can find lots of information about the Viola-Jones method and Haar-like features on Wikipedia.
The first task is to rotate the video frame from the hardware-supplied landscape orientation to portrait orientation. We do this to match the orientation of the video preview layer and also to allow OpenCV’s CascadeClassifier to operate as it will only detect upright features in an image. Using this technique, the app can only detect faces when the device is held in the portrait orientation. Alternatively, we could have rotated the video frame based on the current physical orientation of the device to allow faces to be detected when the device is held in any orientation.
The rotation is performed quickly by combining a cv::transpose, which swaps the x axis and y axis of a matrix, and a cv::flip, which mirrors a matrix about a specified axis. Video frames from the front camera need to be mirrored to match the video preview display so we can perform the rotation with just a transpose and no flip.
Once the video frame is in the correct orientation, it is passed to the CascadeClassifier for detection. Detected faces are returned as an STL vector of rectangles. The classification is run using the CV_HAAR_FIND_BIGGEST_OBJECT flag, which instructs the classifier to look for faces at decreasing size and stop when it finds the first face. You can remove this flag at the start of DemoVideoCaptureViewController.mm
, which instructs the classifier to start small, look for faces at increasing size and return all the faces it detects in the frame.
The STL vector of face rectangles (if any) is passed to the displayFaces:forVideoRect:videoOrientation:
method for display. We use GCD’s dispatch_sync
here to dispatch the call on the main application thread. Remember that processFrame:videoRect:videoOrientation:
is called on our private video processing thread but UI updates must be performed on the main application thread. We use dispatch_sync rather than dispatch_async so that the video processing thread is blocked while the UI updates are being performed on the main thread. This will cause AVFoundation to discard video frames automatically while our UI updates are taking place and ensures that we are not processing video frames faster than we can display the results. In practice, processing the frame will take longer than any UI update associated with the frame but its worth bearing in mind if your app is doing simple processing accompanied by lengthy UI updates.
// Dispatch updating of face markers to main queue
dispatch_sync(dispatch_get_main_queue(), ^{
[self displayFaces:faces
forVideoRect:videoRect
videoOrientation:videOrientation];
});
Displaying the face markers
For each detected face, the method creates an empty CALayer of the appropriate size with a 10 pixel red border and adds it into the layer hierarchy above the video preview layer. These ‘FaceLayers’ are re-used from frame to frame and repositioned within a CATransaction block to disable the default layer animation. This technique gives us a high-performance method for adding markers without having to do any drawing.
// Create a new feature marker layer
featureLayer = [[CALayer alloc] init];
featureLayer.name = @"FaceLayer";
featureLayer.borderColor = [[UIColor redColor] CGColor];
featureLayer.borderWidth = 10.0f;
[self.view.layer addSublayer:featureLayer];
[featureLayer release];
The face rectangles passed to this method are in the video frame coordinate space. For them to line up correctly with the video preview they need to be transformed into the view’s coordinate space. To do this we create a CGAffineTransform using the affineTransformForVideoRect:orientation:
method of the VideoCaptureViewController class and use this to transform each rectangle in turn.
The displayFaces:forVideoRect:videoOrientation: method supports display of multiple face markers even though, with the current settings, OpenCV’s CascadeClassifier will return the single largest face that it detects. Remove theCV_HAAR_FIND_BIGGEST_OBJECT
flag at the start of DemoVideoCaptureViewController.mm
to enable detection of multiple faces in a frame.
Performance
On an iPhone 4 using the CV_HAAR_FIND_BIGGEST_OBJECT
option the demo app achieves up to 4 fps when a face is in the frame. This drops to around 1.5 fps when no face is present. Without the CV_HAAR_FIND_BIGGEST_OBJECT
option multiple faces can be detected in a frame at around 1.8 fps. Note that the live video preview always runs at the full 30 fps irrespective of the processing frame rate and processFrame:videoRect:videoOrientation:
is called at 30 fps if you only perform minimal processing.
The face detection could obviously be optimized to achieve a faster effective frame rate and this has been discussed at length elsewhere. However, the purpose of this article is to demonstrate how to efficiently capture live video on iOS devices . What you do with those frames and how you process them is really up to you. We look forward to seeing all your augmented reality apps in the App Store!
Links to demo project source code
Git – https://github.com/aptogo/FaceTracker
Download zip – https://github.com/aptogo/FaceTracker/zipball/master