Using FPGAs for DSP Image Processing

Using FPGAs for DSP Image Processing
R. Williams, Hunt Engineering

original source http://www.fpgajournal.com/articles/imaging_hunt.htm

Abstract - The article looks at the commonly used image processing functions which it turns out break down into three distinct categories. Then it looks at using FPGAs to perform those functions and compares this approach with that of using a conventional processor. Finally it looks at how such an approach could be used in a real image processing system, considering the image acquisition through the processing to the presentation of results. It is then demonstrated that using a modular approach like that of HERON real time systems can quickly find you a solution for image processing with FPGAs.

The main goal of image processing is to create systems that can scan objects and make judgements on objects at rates many times faster than that of a skilled human observer. Naturally, the first step in building such systems is to identify the imaging functions that allow a computer-based system to behave like a trained human operator. With this achieved the emphasis can then be placed on making that system run faster, and in order to do this you need to find the biggest performance bottleneck in the system and remove it.

For any imaging system of average complexity the biggest bottleneck in performance is the time taken to process each image captured. Naturally, the simple solution is to use more advanced processors to implement the algorithms. The faster the processor, the faster the production line. An alternative is to use dedicated hardware built specially for the job. Typically this has always been a far more expensive alternative. The third solution that presents itself is the use of programmable electronics in the form of Field Programmable Gate Arrays.

Figure 1: The imaging framework that is provided by HUNT ENGINEERING to demonstrate the use of FPGA image processing at frame rate.

Application example

Visiglas SA are a customer of HUNT ENGINEERING that use DSP based boards to offer inspection of Glass containers. The systems are successfully installed all over the world inspecting hundreds of objects per minute. With the increased processing speed offered by the FPGA based modules the systems can be improved by processing higher resolution images - allowing better detection of faults, and also more objects per minute - allowing the production line speed to be increased.

The Maths of Image Processing

In looking at the kind of operations performed on images we see that there are actually a small number of techniques that make up the majority of the processing. Typically the processing involves applying the same repetitive function to each pixel in the image, creating a new output image in the process.

Figure 2: A real world example of where image processing is used – FPGA processing can significantly increase performance.

The bulk of image processing can be split into three main types of operation. The first type of operation is where one fixed-coefficient operation is performed identically on each pixel in the image. The second type of operation is where there are two input images rather than one. In this type of operation, the maths performed may be the same as for the first type but now the operation is done based on the position of the pixel in the image. The third type of operation is neighbourhood processing, an example of this being convolution. In this case there is only one input frame and the result that is created for each pixel location is related to a window of pixels centred at that location.
What can be seen from these operations is that although the exact mathematical operator may vary there is a high degree of repetition of that processing across the entire image. This kind of processing is ideally suited to a hardware pipeline that is able to perform the same fixed mathematical operation over and over on a stream of data.

DSP versus FPGA

For a Digital Signal Processor (DSP) to process one frame of data it will have to fetch the data to be processed, perform the required mathematical operation, and then store the result back to memory. For an average sized image it is likely that the whole frame may be too large to be stored in on-chip memory and will therefore need to be fetched from and stored back to external memory. This process will add to the total number of cycles involved in processing each pixel of the image.

In addition to this memory overhead several cycles may be required by the CPU just to perform the mathematical operations that have been specified. Add to this the possibility of processor branches to handle interrupts and servicing of other high priority threads and the overall data rate drops significantly. The end result is that many image-processing functions tend to require several processor clock cycles per pixel to complete.

As the majority of image processing can be broken down into highly repetitive tasks FPGAs present a very interesting alternative to the DSP. What is important when using an FPGA is that the data rate through the FPGA is better than through a DSP. This can easily be achieved as the individual logic cells of FPGAs map well to the individual mathematical steps involved in image processing.

FPGAs such as the Xilinx Virtex-II series provide a large two-dimensional array of logic blocks where each block contains several flip-flops and look-up-tables capable of implementing many logic functions. In addition, there are also dedicated resources for multiplication and memory storage that can be used to further improve performance. Through the use of Virtex-II FPGAs we can implement image-processing tasks at very high data rates, rates which reach hundreds of MHz. These functions can be directly performed on a stream of camera data as it arrives without introducing any extra processing delay, significantly reducing and in some cases removing the performance bottleneck that currently exists.

In particular, the more complex functions such as convolution can be mapped very successfully to FPGAs. When convolving an image, a window of pixels are treated with a mask where individual locations in the window are ‘weighted’ according to a set of previously defined co-efficients. For each position of the window all pixels are multiplied against their respective co-efficients. The final result is then scaled to produce a single output pixel for the centre location of the window.

In essence, the whole convolution process is a matrix-multiplication and as such requires several multiplications to be performed for each pixel. The exact number of multipliers that are required is dependant on the size of window used for convolution. For a 3x3 kernel (window) 9 multipliers are required and for a 5x5 kernel 25 are required.
If a conventional DSP were to be used the overall performance would be limited by the number of multiplications that could be done in parallel. In practice a DSP will require several clock cycles to perform all of the necessary multiplications and additions to calculate a single pixel. The FPGA on other hand can implement as many multipliers as are necessary in order to calculate one pixel at the full input data rate, whether the convolution uses a 3x3 kernel or a larger 5x5. For example, with the one-million gate Virtex-II, 40 multipliers are available and in the eight-million gate part this number increases to 168. By mapping convolution to FPGAs that already provide dedicated multipliers among their sea-of-gates, it becomes very easy to build a processing pipeline that can convolve at very high data rates.

A Role for the DSP?

Although a large proportion of image processing algorithms are simply highly repetitive processes there is still a role for the DSP. In a system that can benefit from the performance advantages of the FPGA there is often a point in the data flow where a decision has to be made, and this decision will often take the form of if, then, else logic rather than a pixel-by-pixel iteration.

For control loops and complex branches in operation, the DSP can still prove to be a highly effective tool. Implementing equivalent logic in an FPGA can quickly eat-up the available gates and reduce the overall data rate, so a compromise may be necessary. The simple solution to this situation is to use both types of resources in a single system. The high data rate FPGA used as the data-reducing engine feeding results downstream to a DSP that can effectively reach the accept or reject, pass fail decision for the overall system.

Image Acquisition and Processing with HERON

In providing a flexible high performance solution to image processing it is necessary to have the ability to mix both FPGAs and DSPs in one system. Where a system does not require the DSP it may then be possible to leave the DSP out altogether.

With the HERON module range, exactly that is possible. HERON-FPGA modules that include the Virtex-II series of FPGA present resource nodes that are highly suited to a wide range of tasks, in particular the repetitive tasks of image processing. These FPGA modules are also entirely capable of being directly connected to cameras, accepting data in formats such as Camera Link and RS422. Combine that with the HERON processor modules based around the DSPs from the Texas Instruments TMS320C6000 series and a complete imaging solution becomes possible.

In addition to the hardware resources required at the heart of the system, firmware and software are also necessary to implement the appropriate algorithms in the FPGA and DSP. In addressing this, HUNT ENGINEERING have written imaging libraries for both the DSP and FPGA, downloadable from www.hunteng.co.uk, enabling the key algorithm components to be quickly and easily assembled into a working imaging system. VHDL is provided for the development of the FPGA and C libraries are provided for the DSP.

Adding Memory

Where an image processing pipeline needs to perform operations between multiple frames, such as the addition of two images then it becomes necessary to have an area of memory available that can store entire frames.
Unless the image is very small in size, the internal RAM resources of the FPGA will not be enough for this type of operation. In this situation, a module like the HERON-FPGA5 can be used. The reference image is stored in SDRAM external to the FPGA and read into the FPGA as it is required.

Figure 3: HUNT ENGINEERING HERON-FPGA5 module with Virtex-II FPGA, 256Mbytes of SDRAM and digital I/Os for connecting cameras.

Although this operation involves an external data source the operation can still be performed at very high pixel rates (greater than 100Mpixels/sec) as the accessing of the SDRAM, the incoming image, and the output of results will all use dedicated hardware resources of the FPGA to perform the operations in parallel. Contrasting that with a processor based approach, the processor has more memory accessing to perform in multi-frame processing, and it is likely that these operations will be slower than the pixel based operations when using a processor.

Neighbourhood processing on the other hand requires several lines of image data to be stored before processing can begin. The image size determines the amount of storage required to store one line, and the kernel size of the operation determines the number of lines that need to be stored. It may be possible to use the Block RAM that is internal to the FPGA for this storage, but the amount available depends on the size of FPGA being used and what else the design requires. As an example, a one-million gate Virtex-II FPGA has 90Kbytes of Block RAM. If nothing else in the design requires Block RAM then the convolution can use all 90Kbytes. With 8-bit monochrome data, 90Kpixels can be stored. If the image is 2K pixels-per-line then 45 lines of data – more than enough for a large convolution function!

Where the FPGA design uses Block RAM for other funtions, it then becomes interesting to use hardware like the HERON-FPGA5 where the image can be stored in off-chip SDRAM.

Conclusions

Many of the key imaging functions break down into highly repetitive tasks that are well suited to modern FPGAs, especially those with built-in hardware multipliers and on-chip RAM. The remaining tasks require hardware more suited to control flows and decision making such as DSPs.

To gain the full benefit of both approaches systems are required that can effectively combine both FPGAs and DSPs. With the addition of standard imaging functions written in either VHDL or C all of the key building blocks are available for making an image processing system.

The next logical step beyond this is the addition to the HERON module range of devices such as the Virtex-II Pro series from Xilinx. With a processor core in the form of the PowerPC, a sea-of-gates, built-in multipliers and on-chip RAM, a self-contained high performance imaging solution becomes possible in a single chip. For more information keep an eye on www.hunteng.co.uk where new information will be added as it becomes available.

Using HERON to combine DSP and FPGA for image processing

R. Williams, Hunt Engineering

你可能感兴趣的:(process)