skip to main content
research-article
Open Access
Artifacts Available / v1.1

A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm: An Efficient Implementation for the Stratix V and Zynq UltraScale+ FPGA Technology

Authors Info & Claims
Published:27 January 2024Publication History

Skip Abstract Section

Abstract

The semi-global matching stereo algorithm is a top performing algorithm in stereo vision. The recursive nature of the computations involved in this algorithm introduces an inherent data dependency problem, hindering the progressive computations of disparities at pixel clock. In this work, a novel hardware implementation of the semi-global matching algorithm is presented. A hardware structure of parallel comparators is proposed for the fast computation of the minima among large cost arrays in one clock cycle. Also, a hardware-friendly algorithm is proposed for the computation of the minima among far-indexed disparity costs, shortening the length of computations in the datapath. As a result, the recursive path cost computation is accelerated considerably. The system is implemented in a Stratix V device and in a Zynq UltraScale+ device. A throughput of 55,1 million disparities per second is achieved with maximum disparity 128 pixels and frame resolution 1280 × 720. The proposed architecture is less elaborate and more resource efficient than other systems in the literature and its performance compares favorably to them. An implementation on an actual FPGA board is also presented and serves as a real-world verification of the proposed system.

Skip 1INTRODUCTION Section

1 INTRODUCTION

1.1 Review of Stereo Methods

Dense stereo matching is one of the most interesting problems in computer vision. It relates to applications like navigation, obstacle avoidance [1] and 3D reconstruction [2]. Using stereo-camera systems to derive dense depth maps is an established technique in robotics, virtual reality and automotive applications. Developments in these fields push for higher performance and better accuracy. The limits are pushed further by the developments in video technology. High-definition images require processing acceleration, in order to achieve high frame rates.

The implementation of sophisticated, accurate stereo algorithms in hardware is of paramount importance for high-performance vision applications. Typical image resolution for robot navigation, as in visual SLAM, is 640 × 480 pixels. Image size of 1280 × 720 pixels or higher is required for object recognition and industrial inspection. Vehicle applications require a dense depth frame rate of 30 frames per second (fps), while a rate of 50 fps is required for high speed and extended reality (XR) applications.

The goal of stereo matching algorithms is to establish the disparity between corresponding points in the left and right image of the stereo pair. This process is complicated and demanding in terms of computational resources, therefore the design of an accurate real-time implementation of a state-of the-art stereo algorithm is often an open and challenging problem. Although there has been progress in incorporating fast versions of stereo algorithms in software libraries like OpenCV, high-performing real-time systems are usually implemented using specific processors in FPGAs or ASICs.

FPGA technology provides an unparalleled substrate for the hardware implementation of stereo processors. They present advantages over alternative solutions, like ASICs and GPUs, mainly because of their flexible reconfigurable nature, and their low-power consumption. FPGAs can carry out massive parallel computations and they can significantly reduce the design time and development cost.

Typically, stereo algorithms apply to rectified images, where image scanlines are identical to epipolar lines. Rectification can be accomplished using standard methods [3] once the stereo camera system has been calibrated. Algorithms for stereo correspondence are classified into local and global methods [4, 5]. Local methods establish correspondences by finding matching windows of interest around each pixel in the reference and target images. A common method for computing matches in local methods is by minimizing a metric, like the aggregated absolute differences in the comparison windows. Local methods are simple to implement and they are usually fast, however, they are prone to noise and inaccuracies, especially in regions of low texture and occlusion, where there is not much local support for finding correspondences. Global methods, like Belief Propagation [6] and Graph Cuts [7], formulate the stereo problem in terms of a Markov Random Field (MRF) problem. They define an energy function E(D) that measures the quality of the overall disparity map D. The best disparity map is computed as the map that minimizes the energy function over all image pixels. The cost function usually incorporates both the pixel dissimilarity cost and a smoothness term computed for all possible neighboring disparities in the map D. Both local and global methods have been accelerated using special purpose hardware [810] or using GPUs [11]. However, global methods usually require huge hardware resources, including large memory resources and are not easy to implement. As a result, most existing hardware implementations use local methods.

The semi-global matching stereo algorithm (SGM) [12, 13] is one of the top performing algorithms in stereo vision. In essence, it is a global method applying disparity optimization along specific 1D paths to each image pixel p. The path costs L along the selected directions are summed for all possible disparities, as presented in Section 2. The disparity that minimizes the overall path cost to pixel p is selected and enters the disparity map D. Energy minimization along 1D lines is a more manageable problem than 2D global minimization, however performing the above computation in real-time with a high frame-rate is still a very challenging problem, mainly because of the recurrent nature of the computation of path costs at each pixel.

The effort to devise solutions to the bottlenecks of the implementation of global and semi-global stereo methods in hardware can lead to advances in the state-of-the-art of industrial stereo cameras. Novel hardware designs can lead to new processors with more accurate depth maps. In this paper, we present the stages of a hardware system that accelerates the computations of SGM in the recursive loop and achieves purely streaming, low-power and real-time performance at pixel-clock, with low complexity and minimal hardware resources.

1.2 Related Hardware Architectures

The hardware implementation of the semiglobal matching algorithm for real-time depth extraction has drawn the attention of several researchers [1418]. They proposed solutions for problems related to computation speed, high complexity, data dependency, the demand for resources and the need for large memory buffers. Such problems are especially relevant in the case of large disparity ranges and large image resolutions. Various parallelization techniques have been proposed in order to increase the inherently low rate of processing. The limitation of performance is due to the recurrent nature of the computations and the large range of disparities required for large image resolutions. Using disparity-level parallelism, many computation elements are allocated for the parallel processing at each disparity value. Using row-level parallelism, a number of rows are buffered and processed in parallel.

Gehrig et al. [14] proposed the first FPGA implementation of the SGM algorithm. Their system works on images with active resolution 340 × 200 and processes in parallel 64 levels of disparity. Path cost values are stored in fast memory inside the FPGA. The SGM system processes four directions in parallel and the final accumulated costs over the whole image are stored in external RAM. At a second pass, the pair of images is read bottom-up from external memory and the same calculations are performed in the opposite direction, achieving the computation of costs along all eight main directions. In real-world experiments, the system achieves a maximum frame rate of 27 Hz.

Banz et al. [15] employ pixel-level parallelism, without disparity level parallelism. Processing of each pixel is carried out sequentially over all pixel disparities. Their main SGM processing structure is a systolic 2D array, performing heterogeneous tasks, which are synchronized by a custom RISC modular co-processor. Some tasks compute the minima among cost values in the disparity range, while another task computes the disparity corresponding to the overall minimum. All tasks, together with the co-processor are mapped on a Virtex-5 FPGA. The authors clock their system at 133 MHz and therefore, they can process sequentially 128 disparities per pixel at approximately 1 MHz. However, they employ row-level parallelism, processing simultaneously up to 30 image rows, thus increasing the overall disparity throughput to 30 MHz.

Wang et al. [16, 17] arrange the computation of the disparity of a single pixel into a complex scheme of multiple clock cycles, each one processing PD disparities, where PD represents the disparity-level parallelism. In K clock cycles they compute path costs along the whole range of ND = K × PD disparities. In order to arrange the cost values correctly, between the K passes, they adopt a buffering scheme, an approach which is demanding in terms of memory resources. They further divide the computation period into two parts, each one consisting of K/2 cycles. In the first interval of K/2 cycles, they compute the minimum path cost among path costs of the previous pixel p-r, namely mindLr(p-r, d), while in the second interval they compute the path costs Lr(p, d) for the current pixel p. In order to increase the throughput of disparities in this scheme, they use row-level parallelism, implementing the above computations in parallel for a number of rows PR.

Cambuim et al. [18] introduce serialized processing of pixels in a number of lines that have been buffered on chip. They divide the whole frame into slices of Nr lines and store lines into FIFO buffers. Pixels are read serially, one pixel from each line, until all Nr lines are read, before the reading returns to the first FIFO buffer. A FIFO controller reads the pixels in a non-aligned spatial manner, in order to ensure that the path costs of neighboring pixels are ready to be used in the processing of the current pixel. The SGM module is implemented as a systolic array, where all computation steps of the SGM take place serially. The authors implement the system in a Stratix IV FPGA and achieve a practical frame rate of 25 Hz for a resolution of 1024 × 768.

1.3 Contributions of the Present Work

There is a consensus in the literature that the main challenge in the implementation of the SGM algorithm in hardware “is imposed by the recursively defined paths and complex, non-scan-aligned data dependencies” (Banz et al. in [15]). As Wang et al. put it, “the disparity level parallelism is a big problem due to data dependency in adjacent pixels” [17]. This problem is better clarified in our Section 2, where the algorithm intricacies are presented. As it is summarized in Section 1.2, previous works either abandon the parallel processing of the whole disparity range and process the disparities of each pixel serially, as in [15], or divide the processing of each pixel in many cycles. The latter strategy is followed in [17], where “the minimum of ND costs could not be computed in one cycle”, so they insert additional cycles between adjacent pixels. In the works cited above, the overall throughput is increased using an ingenious but complex and resource-consuming row-level parallelism scheme. Row-level parallelism means the simultaneous processing of properly buffered and aligned multiple image rows.

The present work focuses on the data dependency problem and addresses it using single-row processing. Processing a single row instead of multiple rows reduces the need for large row-buffers and as a result, the memory requirement of the proposed implementation is minimized in comparison to other implementations in the literature. Contrary to other solutions, we come up with a fast way to compute the minima of large sets of values in a single clock cycle. For this purpose, we propose a cycle-efficient hardware structure for the fast computation of the partial minima of the previous cost values L(p-r, d) and thus we avoid cycle-consuming large comparison trees.

Another bottleneck in the hardware implementation of SGM is the parallel computation of the minimum among the path costs of far-indexed displacements, for each one of the possible displacement values. This is a formidable task, since it requires multiple computations of minima, as discussed in Section 4.4. In the present paper, a new hardware-friendly approach is proposed for the computation of this array of minimum costs. Reducing this part of the algorithm to only the necessary computations is key to our implementation. Also, we present implementation details in the various stages of the computations, with emphasis on the novel design elements.

The performance of the proposed implementation is also studied. The pipeline produces one output disparity value at each cycle of the pixel clock, following progressive scan and achieves clock frequencies 63,5 MHz, for an Intel Stratix V device. This performance corresponds to a frame rate of 206,7 frames per second for image resolution 640 × 480, when the disparity range is 64 pixels. We also implement the design using a Zynq UltraScale+ device by Xilinx/AMD and compare the performance and resources for both implementations. The design is fully scalable in terms of maximum disparity range and image resolution.

A practical implementation and assessment of the design is presented using a PYNQ-Z1 board and a StereoPi camera. Pixel clock has been successfully elevated up to 74 MHz. Our tests processed depth maps in real-time, with a refresh rate of 60 Hz at VGA resolution.

The rest of the paper is organized as follows. In Section 2, the SGM algorithm is outlined. In Section 3, an overview of the main stages of the proposed system is presented. In Section 4, the details of the design are laid out and hardware-friendly algorithm modifications are proposed. Emphasis is given to novel design elements. In Section 5, the extension of the system to multiple directions is described. Section 6 presents results obtained with hardware simulation, using reference stereo pairs. In Section 7, resource requirements and performance evaluation are given. Also, the real-world implementation of the proposed system is presented. Section 8 concludes the paper.

Skip 2THE SGM STEREO ALGORITHM Section

2 THE SGM STEREO ALGORITHM

The semi-global stereo matching algorithm (SGM) proposed by Hirschmuller [13] is a well-known method that minimizes path cost along the main directions around each pixel of image stereo frames. Contrary to Dynamic Programming, it does not make use of the monotonic ordering constraint [19]. SGM assigns to each pixel and for all possible disparities a path cost L. This cost is dependent on the pixelwise matching cost C(p, d) computed for each possible disparity value d, plus the minimum transition cost from the previous pixel to the present pixel, along an image direction r. This is described by the following recursive relationship: (1) \(\begin{equation} \small L\left( {p,{\rm{\ }}d} \right) = C\left( {p,{\rm{\ }}d} \right) + \min \left\{ {L\left( {{p}_{ - 1},{\rm{\ }}d - 1} \right) + {P}_1,{\rm{\ }}L\left( {{p}_{ - 1},{\rm{\ }}d} \right),{\rm{\ }}L\left( {{p}_{ - 1},{\rm{\ }}d + 1} \right) + {P}_1,\min \left\{ {L\left( {{p}_{ - 1},{\rm{\ }}{d}_{far}} \right)} \right\} + {P}_2} \right\} \end{equation}\)

p represents the current pixel, processed at the present clock cycle and p-1 represents the pixel prior to the current pixel p, along the direction r on which the path cost is being minimized. In the above formula, L(p, d) denotes the path cost to the current pixel p, for a horizontal displacement d. Displacement d takes values in a range bounded by dmax. Therefore, path cost L is a quantity computed in the disparity space (for all different image displacements) instead of the pixel space (for all different pixel correspondences). First, for all different horizontal displacements d, the displacement cost C(p, d) is computed for a specific image point p. The second term in Equation (1) is the minimum among four values, explained in detail below. Since Equation (1) is computed for all possible integer displacements, L constitutes a set of values attributed to each image pixel, for all possible disparity values in the given range. Therefore, Equation (1) is a multivalued equation.

Computation of Equation (1) gathers cost information along one image direction and attributes a disparity value to a particular pixel p, by minimizing the cost L of the path to pixel p, along the selected direction. In a semi-global implementation of the SGM algorithm, path cost is computed along the main straight directions around a pixel p, by repeating the computation of L(p-1, d) for all eight neighbors to pixel p. In the above equation, d represents a displacement value, with d - 1 and d + 1 being the adjacent disparity values. dfar denotes all other disparity values, excluding d - 1, d, d + 1. P1 and P2 are penalty values, with P2 > P1.

After the above computation of the L set, we assign to the current pixel p the disparity value attributed to the minimum L. Since the above computation is performed along various image directions r, then for each pixel p we add the costs Lr corresponding to each disparity value d: (2) \(\begin{equation} S\left( {{\rm{p}},{\rm{\ }}d} \right) = {\rm{\ }}\mathop \sum \limits_r {L}_r\left( {{\rm{p}},{\rm{\ }}d} \right) \end{equation}\)

Then, we find the argument d, corresponding to the minimum total cost S: (3) \(\begin{equation} d = ar{g}_{min}{\rm{\ }}S\left( {{\rm{p}},{\rm{\ }}d} \right) \end{equation}\)

In the following, we expand on the above equations, in order to clarify the required computations. For the computation of path costs according to Equation (1), we need

(a)

all matching costs corresponding to all possible horizontal displacements d of the left image with regard to the right image, at point p: C(p, d) = abs(Ιp-left – Ip-right)d. Here we consider, for simplicity, absolute difference (AD) cost and displacements to the left.

(b)

the path costs L(p-1) of the adjacent pixel (p-1) along the direction r, for all possible disparity values d. This is an array of path costs. Each path cost in the array is indexed with the corresponding displacement d.

Now, we consider the array of the above L(p-1, d) values, which are computed for each disparity d and proceed as follows:

(c)

for each displacement value d, we take the triad of the path cost values L(p-1, dclose), corresponding to the closest disparity indices dclose ∈ {d - 1, d, d + 1}. We add a penalty P with values P1, 0, P1, respectively, and we find the minimum value: min{L (p-1, dclose) + P}.

(d)

then, for each index d, we exclude from the L(p-1) set the above path costs L(p-1, dclose) corresponding to the close disparities dclose. Considering the remaining far-indexed path costs L(p-1, dfar), we add the penalty value P2 and compute the minimum value: min{L(p-1, dfar) + P2}.

(e)

for each displacement d, we compute the minimum among the two minima described in steps c and d. This minimum is computed for all possible disparities d in the range 0 to dmax.

The index d that corresponds to the minimum cost L(p, d), is considered as the optimum disparity value for pixel p. For multiple directions r, the arrays of path costs are added according to (2) and then the minimizing argument d is produced.

As noted in the original publication of the algorithm by Hirschmuller [12], the computation in Equation (1) inevitably produces values that are growing over time, since Equation (1) is a cumulative sum. For this reason, it is suggested that the path cost value L is normalized by subtracting at every step the constant overall minimum value among the set of all L(p-1, dall) values. This is the minimum L computed for the previous pixel p-1, among all Ls computed for all disparities. The new equation for path cost computation becomes: (4) \(\begin{align} L ( {p, d} ) &= C ( {p, d} ) + \min \{ L ( {{p}_{ - 1}, d - 1} ) + {P}_1, L ( {{p}_{ - 1}, d} ), L ( {{p}_{ - 1}, d + 1} )\nonumber\\ & \quad + {P}_1,\min \{ {L ( {{p}_{ - 1},{d}_{far}} )} \} + {P}_2 \} - {\rm{min}} \{ {L( {{p}_{ - 1}, {d}_{all}} )} \} \end{align}\)

At the image border, L takes the value of C(0, d) and then the recursive computation sets in.

The end result of the above recursive path cost computation is that one assigns to the next pixel a disparity value that minimizes the total path cost along a particular image direction, taking into account the net assignment cost C, and the homogeneity of particular image objects with respect to their distance from the camera. Applying a larger penalty P2 for paths corresponding to larger disparity differences than to paths corresponding to adjacent disparities, as in Equation (1), results in a homogeneous disparity map, without large jumps among neighboring disparity values.

Let us discuss the intricacies of implementing the above algorithm in hardware. The recursive nature of the computations in Equation (4) means that for each new pixel p, one actually needs the set of computations in the previous pixel p-1 along the specific direction r. In a purely streaming pipeline, where a new disparity value is computed at each pixel-clock cycle, this requirement means that it is necessary to compute all the minima detailed in points c and d, above, in one clock cycle. For a path cost array L with dmax elements, it is necessary for the required computations to occur simultaneously for all d values. In other words, disparity-level parallelism is needed. In addition to the minima detailed in points c, d, e, above, the overall minimum among all dmax path cost elements L(p-1) is also required. In the latter computation, dmax can be a large number (e.g., 128). Moreover, in points c and d, dmax minima have to be computed in parallel, in one clock cycle. Point c computes the minima among triads of path costs, which is not demanding in terms of timing, however, point d can be a real bottleneck, because in principle it requires the parallel computation of dmax minima, each one among a large set of path cost values.

In the present work, a streaming pipeline for SGM algorithm is designed based on a timing-efficient block for the computation of the minima of large sets of values. This block is detailed in Section 4.3. Also, the computations in point d, above, are minimized according to a hardware-friendly approach, detailed in Section 4.4. The overall solution is efficient in terms of resources as well.

Timing efficiency in the above computations results in high pixel-clock frequency and increased frame rate. We set the lower target frequency of the pixel-clock at 25 MHz, which corresponds to streaming images of VGA resolution at 60 fps. A pixel-clock frequency of 74 MHz corresponds to streaming images of resolution 1280 × 720 at a refresh rate of 60 fps. In the above frequencies, the blackening intervals of the video signal have also been considered. Apart from adhering to streaming video refresh rates, another crucial specification in the design process is keeping the system resources as low as possible. Increasing maximum disparity and adding many path directions can easily cause the required resources to spin beyond the available resources in a medium FPGA chip.

Initially, we used VHDL to design a basic system, where the computation of path cost L is limited along the direction of the horizontal scanline. Then, we repeat our design for multiple directions, namely the horizontal (0ο), the diagonal (45ο), the vertical (90ο) and the opposite diagonal (135o).

In the next section, an overview of the proposed system is presented. The description is given mainly for the horizontal direction. For all other directions, only the shift registers storing cost values of the previous pixel are redesigned and presented accordingly.

Skip 3OVERVIEW OF THE PROPOSED ARCHITECTURE Section

3 OVERVIEW OF THE PROPOSED ARCHITECTURE

The main system parameters are dmax and line_width, the maximum disparity range, and the horizontal image width. The hardware description language entities are designed to scale automatically with dmax and line_width. Versions of the system with dmax = 32, 64, 96 and 128 pixels have been implemented for an Intel Stratix V device and a Xilinx/AMD Zynq UltraScale+ device and have been studied with respect to resource requirements and Fmax.

The block diagram of the system is shown in Figure 1. The main system blocks in Figure 1 are the introductory block (intro_block), where the net cost value C(p, d) is computed for all displacements of the right image with regard to the left image. For simplicity, cost computation is shown in Figure 1 as the absolute difference (AD) between the intensities of the left and right image and is implemented with a shift register and a series of subtractors, followed by the computation of the absolute value. An array of cost values is produced at the output, containing all costs for displacements from -1 to dmax. These costs are indexed using natural indices, in the range 0 to dmax + 1. A detailed image of the hardware in this block, in terms of AD, is outlined in the upper part of Figure 2.

Fig. 1.

Fig. 1. The block diagram of the implemented system.

Fig. 2.

Fig. 2. Detailed presentation of the cost computation intro_block (upper part) and of the computation of the minimum adjacent cost (lower part), for all disparities in the range 0 to dmax – 1.

The second block in the pipeline (recursive_path) adds recursively the minimum value between min{L(p-1, dclose) + P (P = P1, 0, P1)} and min{L(p-1, dfar) + P2} to the cost value C(p, d), as postulated in points c, d and e of the algorithm steps outlined in Section 2. Also, it subtracts the overall minimum min{L(p-1, dall)}. These minima are computed among the path cost values L(p-1, d) of the previous pixel, over all possible integer disparity values. The computation follows Equation (4).

The output of the subtractor in recursive_path is pipelined using a layer of registers. In this way, in the proposed streaming system, the output L at time t is the value computed at t - 1, which corresponds to the cost value L(p-1, d) of the previous pixel. A single layer of registers is sufficient when the selected direction for cost computation is the horizontal scanline. More elaborate buffer schemes are needed in order to store the previous pixel path costs on other straight directions, for example, on the vertical or the diagonal direction.

The output L(p-1, d) of the recursive_path computation is input to blocks where the minimum values among sets of values are computed. The array of the minima of L values corresponding to adjacent disparities is computed by taking triads of Ls and finding the minimum L among the three values L(p-1, d - 1) + P1, L(p-1, d), L(p-1, d + 1) + P1. This computation is detailed in the lower part of Figure 2. The minimum among the three cost values of adjacent disparities is computed using a block for finding the minimum in a two-tier comparison tree structure. These values are exported in the form of the min_close array with dmax elements, indexed from 0 to dmax – 1. A for…generate structure is employed in order to scale this computation automatically for any dmax value.

Apart from the path costs indexed by the triad d – 1, d, d + 1 of disparities close to the indexed disparity d, there is also a set of far disparities, resulting from excluding the set of the above path costs close to the disparity index d. The set of path costs of far disparities is obviously different for each index value d. A hardware-friendly approach is detailed in Section 4.4, for the computation of the minimum among the costs of far-indexed disparities, for each index d.

In Figure 1, the sub-block min_of_far_L_paths contains multiple circuits, each one producing the minimum among path costs corresponding to far disparities, for a particular d index. This circuit is part of the block far/overall_min in Figure 1 and its output is the min_far array, indexed for d.

In the same far/overall_min block, a circuit for the fast computation of the overall minimum of path costs L(p-1, dall) to the previous pixel p-1 is also instantiated (see overall min among previous Ls, in Figure 1). A fully parallel structure has been devised for the fast computation of the minimum value among a set of values. This structure is central to our implementation and is detailed in Section 4.3.

The penalty P2 is added to the array min_far and each minimum value is compared with the min_close value corresponding to the same disparity d. The comparison is made with the bunch of min2 circuits, shown at the lower center of Figure 1. The output of these circuits forms an array, indexed for all disparities, which gives feedback to the recursive_path block for the computation of Equation (4). The overall_previous_min output also gives feedback to the recursive_path block and is subtracted for normalization purposes.

A final block labeled tag_of_min_value is shown at the lower right of Figure 1. It receives as input the array of path costs and produces the tag of the minimum path cost. This tag is an index between 0 and dmax + 1 and represents the disparity value.

The various system blocks are described in their hardware details in the next section.

Skip 4COMPUTATIONAL DETAILS Section

4 COMPUTATIONAL DETAILS

In this section, details of the design principles used to accelerate the recursive computations in the computational loop of Figure 1 are presented. In Section 4.1, the computational details of the introductory block are described. In Section 4.2, design details of the blocks in the recursive computation loop are given. In Section 4.3, a structure for the fast computation of the minimum among a set of N values is proposed and its implementation in the context of the SGM stereo algorithm is discussed. In Section 4.4, a fast and compact solution to the problem of computing the array of minima of the far-indexed path costs L(p-1, dfar), for each displacement value d, is proposed.

4.1 The Introductory Block

The stereo-pair is introduced in the system in the form of streaming 8-bit grayscale pixel values. The two images, namely image_left and image_right, are processed as separate image streams.

The introductory block of the stereo processing system (intro_block) computes the net cost of matching the current pixel of the left image with all possible candidate pixels of the right image. The possible candidate correspondences belong to the range 0 to dmax - 1, where dmax stands for the maximum levels of disparities in a given application. However, the correspondence costs in the region -1 to dmax are computed, since for every value d, the algorithm takes into account the costs of adjacent disparities d - 1 and d + 1. Therefore, the additional border values -1 and dmax, allow the computation of the previous or the next displacement costs of the borderline pixels.

For the cost computation, signed integer arithmetic is employed. For this purpose, all cost values are considered to be 11-bit signed numbers and all pixel values are likewise extended to 11-bit signed integers. It has been verified [13] that the normalized path cost computation described by Equation (4) is always viable using 11 bits signed arithmetic.

The shift-register, in the upper part of Figure 1, hosts in an array q all pixel intensities of the right image required for the computation of net cost values C(p, d), for all displacements indexed from 0 to dmax + 1. The left image pixels in Figure 1 are registered for a single step, so that the current pixel t of the right image corresponds to the t - 1 pixel of the left image, in the same clock cycle. Thus, displacements start from -1 and range up to dmax. The cost array is indexed using natural indices in the range 0 to dmax + 1.

In the above description of Figure 1, we consider the image pixels to flow left-wise through the pipeline, as is the case in hardware. Therefore, the right image pixel arrives in the datapath always before its left counterpart in the same scanline, as exemplified in Figure 3. For this reason, in our hardware implementation, the right image pixels are buffered in array q, while the registered left pixel is subtracted from all right pixels in the buffer. Net cost values C(i) are computed for all pixels i in the buffer, using a for loop, for i in the range 0 to dmax + 1: \(\begin{equation*} C(i):= {\rm{ }}abs(q(i)-im\_left\_reg); \end{equation*}\) where q is the right image buffer array, implemented as a shift-register and im_left_reg is the registered left pixel at step t-1. In the above statement, the simple absolute difference cost is considered: (5) \(\begin{equation} {\rm{C}}\left( {{\rm{p}},{\rm{\ }}d} \right) = {\rm{\ }}\left| {{I}_L - {I}_R} \right| \end{equation}\)By using the naïve formula in Equation (5) for cost computation, we focus on the demands of the recursive cost computation in the subsequent blocks of the system (Figure 1 and Figure 2). However, the correspondence cost can be measured with more sophisticated methods, which can alleviate problems of photometric differences between the stereo pair images or the effects of image sampling. Examples of such measures are the cost calculation based on Mutual Information [20] and the Birchfield and Tomasi cost [21], based on interpolated intensities. Therefore, after the implementation of the basic system as a proof of the proposed concept, we proceeded to implement a more sophisticated intro_block, based on the Birchfield and Tomasi (BT) definition of pixel dissimilarity. The computation of BT dissimilarity requires a linear interpolation between adjacent intensities, in the left and right image pixels. For this reason, for every pixel p in the left and right scanline, with intensities IL(p) and IR(p), we also need the adjacent intensities IL(p + 1), IL(p - 1) and IR(p + 1), IR(p - 1). A shift-register with a depth of two pixel-intensities suffices to store the adjacent intensities. The computation then follows a signed integer version of the simple interpolation mathematics introduced by Birchfield and Tomasi in [21]. This dissimilarity measure alleviates the effect of image sampling and produces smoother cost functions. The BT-based implementation of the introductory block is more demanding in hardware resources than pure AD, however the intro_block is implemented only once, no matter the number of parallel stages for cost optimization along different directions.

Fig. 3.

Fig. 3. Left pixel and its right counterpart in the stereo pair. In hardware, the scanline streams left-wise through the datapath.

The net cost C array computed above is registered for one step before it is output to the next block, in order to achieve better clock rates.

4.2 The Recursive Computation Blocks

The blocks downstream compute the other terms of path cost, beyond net displacement cost, according to Equation (4). As already explained, these computations apply to the path cost values L(p-1, d) computed for the previous pixel along the same image direction. These previous pixel path costs are given in the form of an array at the output of the recursive_path block. In the recursive_path block, the net cost array C(p, d) is added to the array of minima, derived from the second term of Equation (4) and computed as described in Section 3.

The array size of the inputs and outputs of the above recursive computation blocks needs to be resolved. Net cost values C(p, d), which are output from the intro_block, are indexed in the range 0 to dmax + 1. C(p, d) values include at index 0 the displacement value -1, as already explained in earlier sections. This value is adjacent to the first useful disparity value, which is the zero-displacement value and is indexed as 1. On the other hand, the upper cost index limit dmax + 1, corresponds to displacement dmax and is adjacent to the last useful disparity value dmax – 1.

Cost values corresponding to the whole range of disparities, from -1 to dmax are needed for the computation of the various minima described above. However, the output arrays min_far and min_close computed by min_of_far_L_paths and close_min blocks, are indexed from 0 to dmax – 1. The reason is that in these computations, borderline adjacent disparities are dropped. Therefore, when the minimum of the path cost array L(p-1) gives feedback to the recursive_path block, it needs to be properly extended in order to comply with the size of the C(p, d) array. Border extension is performed by shifting the cost array one step to the right and filling the 0 and dmax + 1 borders with maximum 11-bit signed numbers (+1023). The recursive_path block supports integer signed arithmetic.

The design of the block computing the array of minima min_far of far indexed path costs, in the range of indices 0 to dmax – 1, is described in detail in Section 4.4.

4.3 Fast Computation of the Minimum Value

In this subsection we present the proposed solution to the problem of the fast computation of the minimum among a large set of N values. As explained in Section 2, the proposed implementation of the SGM algorithm requires the parallel computation of many minima of path cost arrays along the disparity range. It is necessary that these computations are performed in the feedback loop in one pixel-clock cycle. The commonly used comparison trees cannot suffice for this purpose, since they require many comparison cycles and therefore, they are not efficient in terms of timing. Our solution to the problem of the computation of the minima is implemented using a rectangular structure of comparators, as shown in Figure 4. Every value a(i) is compared with every other value using a row of comparators a(i) ≤ a(j), where j = 0 to N - 1. The output of the comparators of every row are ANDed and in case that the overall output is 1, then a(i) is chosen as the overall minimum. A total of N ANDs, each with N inputs is used. Obviously, only one row of comparisons will produce 1 at the AND output, unless the minimum value is repeatedly found among the set of inputs. The first row-index corresponding to a 1 at the output of the row AND gate is used as a tag to denote the corresponding index. In the proposed system it is measured as an integer in the range 0 to dmax - 1.

Fig. 4.

Fig. 4. Rectangular structure of comparators for the fast computation of the minimum value (parallel_min block). Comparators a(i) < a(j) are shown with dots, with i denoting the row and j denoting the column of the rectangular structure.

The above structure is named parallel_min. It can be designed effectively using HDL and it can release the bottleneck when fast computations of the minimum of N values are required. This block is the heart of the computation of the overall minimum in the overall_min sub-block. Also, this block is critical for the computation of the minima of the far-indexed path costs, for each disparity i. In terms of resources, the number of comparators grows quadratically with the number of inputs N. However, the number of comparators can be drastically reduced by observing that only the comparisons above the diagonal part of the rectangular structure are required (red dots in Figure 4), while comparisons in the lower diagonal part and on the diagonal can be eliminated. The overall number of comparators for a comparison of N inputs becomes: (6) \(\begin{equation} number{\rm{\ }}of{\rm{\ }}comparators = {\rm{\ }}({N}^2 - N)/2 \end{equation}\)Still, for large N values, this block can result in suffocation of available resources, in the case of small and medium FPGA chips. For this reason, we opt to use it in a two-tier design, as in Figure 5, where k blocks are used to determine minima in groups of N/k inputs and a second layer of comparisons delivers the final result with a k-input parallel_min block. In this way, we can compute minima for a range of d values 0 to 63 using eight blocks, 8-input each, and can produce the final minimum with a second tier of comparisons using one more 8-input parallel_min block. This reduces the total required comparators from 2016, as derived from Equation (6) for N = 64, to 252. The cost of the two-tier structure is a reduction of Fmax. However, the maximum frequency is far greater than using simple cascading comparison trees for the computation of the minima. The number k of blocks is an additional parameter to the design and determines resource utilization and Fmax. The number of total comparators in the design of Figure 5 now becomes

Fig. 5.

Fig. 5. Two-tier parallel_min structure for the computation of the minimum of a large array of cost values, as it is the case for depth maps with large disparity range (0 to dmax). The index of the minimum is also produced as min_tag.

(7) \(\begin{equation} number{\rm{\ }}of{\rm{\ }}comparators = (N\left( {k + 1} \right)/2k) \cdot \left( {N/k - 1} \right) \end{equation}\)

4.4 Computation of Minima among Far L Path Costs

In this subsection, a solution to the problem of the computation of the minima among far indexed path costs is proposed. The minima in question enter Equation (4) as min{L(p-1, dfar)} and they are computed for each displacement d. dfar in Equation (4) stands for path cost indices far from d. As explained in Section 2, computing a large number of different minima among large path cost arrays can become a bottleneck in the pipeline. The computation of the array of far-indexed minima is performed by the far/overall_min block of Figure 1. For this block, one can follow a number of hardware design principles, which result in different implementations. We assessed various implementations in terms of resource usage and frequency of operation. In conclusion, the following hardware-friendly algorithm results in a simple and fast implementation, while preserving the same lossless quality of the output depth map as the full software implementation. The computation of min_far array, which is the output of the block min_of_far_L_paths (see Figure 1) is based on the following observations:

For each index d of cost values L(p-1, d), we have already computed the minimum among the path costs of the close disparities d - 1, d, d + 1, plus a penalty value P = P1, 0, P1, respectively. The far path costs correspond to all other disparities, excluding the above. However, among all path costs, there is an overall minimum, which is independent of the index d and which we denote as overall_min. To this minimum, corresponds an index min_tag in the range [0, dmax + 1]. When an index d of the cost array L(p-1, d) is far away from min_tag, then it is obvious that the minimum among the far-indexed costs coincides with overall_min. When the index d of the cost array L(p-1, d) is close to min_tag, then overall_min will coincide with the path costs of close disparities and therefore it has to be excluded from the computation of the far minimum for this d value, along with the path costs of other close disparities. Therefore, in a second stage of computations, we exclude overall_min and its close neighboring costs from the computation of minimum and find a new minimum value, denoted as far_min, away from overall_min. Now, the min_far array for each index d is computed as follows: (8) \(\begin{equation} \min \_far\left( d \right) = {\rm{\ }}\left\{ {\begin{array}{@{}*{1}{c}@{}} {overall\_\min {\rm{when\ }}d{\rm{\ is\ away\ from\ }}min\_tag}\\ {far\_min,{\rm{\ when\ }}d{\rm{\ is\ close\ to\ }}min\_tag} \end{array}} \right. \end{equation}\)In order to use the above algorithm, two minimum values among all path costs have to be computed. First, the overall_min value, which is computed applying the two-tier scheme of Figure 5. overall_min is one of the k partial minimum values produced in groups of N/k path cost inputs. Then, overall_min is excluded and the new far_min value is computed among the rest k-1 partial minima. The final min_far array for all indices d is produced following Equation (8), for all d values. “Closeness” to min_tag value is interpreted here as (9) \(\begin{equation} \left| {d - min\_tag} \right| < 3 \end{equation}\)The HDL implementation of the above algorithm is straightforward and is based on the parallel_min structure presented in Section 4.3, Figure 5. The validity of the above algorithm has been tested in practice against a strict software procedure. The software version computed the min_far array using an exact procedure with min() array functions and it was found to produce almost identical results with the proposed hardware system. An example comparison between the software implementation and a hardware implementation including the proposed hardware-friendly algorithm is shown in Section 6, Figure 9.

Skip 5IMPLEMENTATION OF THE VERTICAL AND DIAGONAL DIRECTIONS Section

5 IMPLEMENTATION OF THE VERTICAL AND DIAGONAL DIRECTIONS

The notion of direction in the above semi-global algorithm refers to an image line where the previous pixel p-1 belongs. For example, if this direction coincides with the horizontal scanline, then the previous pixel to the current pixel p = (i, j) is the pixel p-1 = (i, j-1) and it belongs to the same scanline. We considered this case in the description of the system in the previous sections. In hardware, pixels flow in one dimension from left to right, with each successive scanline following the previous scanline. Therefore, a single register can pipeline the value of one pixel for one clock cycle. The output of the register at clock cycle t is the pixel value at clock cycle t-1. The same principle holds for every computation performed on a pixel, for example its path costs L or its disparity value d. A register at the output of the recursive_path block produces the L(p-1, d) value. This configuration is shown in Figure 6 (a). The recursive_path block is the second block in the pipeline of Figure 1.

Fig. 6.

Fig. 6. Registering schemes used to optimize cost along the four main image directions: (a) along the horizontal direction (0o), (b) along the vertical direction (90o), (c) along the diagonal (45o), and (d) along the second diagonal (135o).

In the case of the vertical direction, the previous pixel to the current pixel p = (i, j) is the pixel p-1 = (i - 1, j). Now the previous pixel belongs to the previous scanline and is found exactly above the current pixel. Therefore, the two pixels are separated by a full-length scanline. In this case, an image-line buffer should be used at the output of the recursive_path block. When the output of the recursive_path block is the current pixel cost L(p, d), then the output of the line-buffer is the cost L(p-1, d) of the previous pixel on the vertical direction. This configuration is shown in Figure 6(b).

In a similar manner, in the case of the diagonal direction, the previous pixel to the current pixel p = (i, j) is the pixel p-1 = (i - 1, j - 1). The previous pixel belongs to the previous scanline and is delayed by one step. Therefore, the two pixels are separated by a full-length scanline plus one pixel. In this case, a buffer with depth lw + 1 should be used at the output of the recursive_path block, where lw stands for line-width. When the output of the recursive_path block is the current pixel cost L(p, d), then the output of the buffer is the cost of the previous pixel on the diagonal direction. This configuration is shown in Figure 6(c). Finally, in Figure 6(d) the shift-registers delay the pixels for lw - 1 cycles, corresponding to the pixel on the opposite diagonal, at 135o relative to the horizontal direction.

The memory buffers detailed above store the whole array of path costs, as it is produced at the output of the recursive_path block. Therefore, the registers are dmax + 1 wide, in order to host path costs for all displacement values d. The depth of the registers varies from 1 to lw + 1, depending on the direction of cost computation.

Figure 7 shows the grid of pixels around the current pixel. Pixel O arrives at the input stage of the pipeline at time t. Measuring time with clock cycles, pixel A arrived one clock cycle ago and pixel C arrived tlw clock cycles ago. Values A and C can be stored using a single register and a line buffer, respectively. Values B and D need a buffer lw + 1 and lw - 1 deep, respectively. The corresponding image directions are shown with blue arrows.

Fig. 7.

Fig. 7. Pixel grid showing the current pixel at time t and the previous pixels in all eight linear directions. The blue arrows correspond to the implemented directions.

In order to implement all four directions, namely the horizontal the vertical and the two diagonal, four different datapaths are needed, each one comprised by the recursive_path block, the pixel/line shift registers and the blocks in the computation loop, as they are shown in Figure 1. The intro_block, where the net correspondence cost C(p, d) is computed, is common for all four datapaths. The overall concept in schematic form is shown in Figure 8. The final block adds partial costs Lh, Lv, Ld, Ld’ and recovers the index d of the minimum total cost. This index corresponds to the disparity value. This final stage is implemented again with the rectangular structure parallel_min of Figure 5, with an output for the index of the minimum value.

Fig. 8.

Fig. 8. Computation of the minimum cost path along four basic directions, using four parallel datapaths for the implementation of the recursive loop.

Obviously, for the computation of the eight linear image directions proposed in the original publication of the SGM algorithm, eight parallel datapaths are needed. However, one has to consider that the native flow of pixels from left to right and from the upper to the lower row, forbids the immediate computation of more than four linear directions. In the circumstances of a real stereo pair, four directions are enough for the production of a smooth dense depth map. This is also how most competing architectures [1519] have treated this problem. In order to compute all eight directions, more elaborate memory schemes need to be implemented. The whole image frame has to be stored, in order to reverse the flow of scanlines and let rows flow from the last to the first. The efficient implementation of such frame buffers is an interesting extension but lies beyond the scope of the present article.

Skip 6SIMULATION OF THE HARDWARE SYSTEM Section

6 SIMULATION OF THE HARDWARE SYSTEM

In order to establish a reference implementation, we first designed the SGM system in software, using Python. Array structures similar to the shift-register structures in VHDL were used. Integer numbers were used for arithmetic operations. For the computation of the minima of cost arrays, array functions from the mathematical numpy library were used. Emphasis was given to the arithmetic and algorithmic accuracy of the results, according to Equation (4), without any simplification. As a first step, only the horizontal direction was used for the optimization of the depth map. Net correspondence cost C(p, d) was computed using both the naïve absolute difference measure given in Equation (5) and the cost according to Birchfield and Tomasi [21]. A more elaborate cost function would certainly produce better results in terms of the final depth map, however, it is necessary to recall that here we focus on the hardware implementation of the SGM optimization step rather, than on the best overall stereo system. Once the SGM optimization step has been successfully mapped in hardware, then one can apply any suitable cost aggregation scheme [20, 22], as well as post-processing, in order to enhance the final disparity map.

In order to assess the results, the well-known Tsukuba stereo pair [4] was used. In Figures 9(a) and (b), the left reference image of the Tsukuba stereo pair and the ground truth depth map are presented. The depth map resulting from the software implementation is shown in Figure 9(c). In Figure 9(d) the depth map produced by the Modelsim simulation of the proposed hardware system is shown. The differences between the two images are negligible and can be attributed to small differences in the manipulation of cost arrays in the software and hardware implementations. To measure the difference between the disparity maps of the software and hardware implementation, the rms error metric was used, given by the Equation [4]: (10) \(\begin{equation} rms{\rm{\ }}error = {\left(\frac{1}{N}\mathop \sum \limits_{i,j} {\left| {{d}_H\left( {i,j} \right) - {d}_S\left( {i,j} \right)} \right|}^2\right)}^{1/2}. \end{equation}\)In the above equation, dH and dS are the disparities produced by the hardware accelerator and the software implementation, respectively. The rms error is calculated over the whole image. N is the total number of pixels in each image. rms error is measured in disparity units and in our experiments it was found 0,38 disparity units over the whole image. Therefore, the software disparities and the hardware disparities, as produced by Modelsim, are almost identical. In this 1D horizontal optimization, the scanline streaking problem is present, as is the case when vertical smoothness interactions are absent.

Fig. 9.

Fig. 9. (a) Left image of the Tsukuba stereo pair and (b) ground truth depth map [4]. (c) Dense disparity map, as produced by a reference software version of the SGM algorithm. (d) Dense map produced by the Modelsim simulation of the hardware system. Both systems are designed to minimize path-costs along the horizontal direction (dmax = 31 pix, net cost C(p, d) according to BT [21]).

As a next step, we produced results using parallel systems for the computation of path costs along the four main image directions, namely, the horizontal (0o), the vertical (90o), the diagonal (45o), and opposite diagonal (135o). The result is shown in Figure 10 (left). In Figure 10 (right) the result of the purely software SGM implementation is also shown for comparison. Finally, the system was adjusted to produce results for a number of different reference stereo pairs obtained from the Middlebury data set [23]. The main parameters to be adjusted were the maximum disparity value dmax and the width of the image scanline. Results of this experiment are shown in Figure 11, for the “cones” [24] and the “books” [25] stereo pairs.

Fig. 10.

Fig. 10. (Left) Dense map produced by the Modelsim simulation of the hardware system. (Right) Dense disparity map, as produced by a pure software implementation of SGM. Both systems are designed to minimize path costs along the same four directions (dmax = 31 pix, net cost according to BT).

Fig. 11.

Fig. 11. First row from left to right: Left image of the Cones stereo pair, ground truth depth map [24] and dense map produced by the Modelsim simulation of the proposed hardware system. Second row: Left image of the Books stereo pair, ground truth [25] and dense map produced by the Modelsim simulation of the proposed hardware system.

Skip 7IMPLEMENTATION WITH STRATIX V AND ZYNQ FPGAS Section

7 IMPLEMENTATION WITH STRATIX V AND ZYNQ FPGAS

The proposed system was implemented with the Stratix V 5SEEBF45I2L FPGA by Intel, using the Quartus Prime Synthesis tool. Also, it was implemented for a Zynq xczu7ev-ffvc1156-2-e UltraScale+ MPSoC device, using the Vivado Design Suite by Xilinx/AMD for synthesis. In the following paragraphs, implementation details are given in terms of resources and timing characteristics, for both chips. The two families use similar, albeit different Look-Up Tables (LUTs) in their logic cells and therefore an approximate comparison stands between 8-input Adaptive LUTs (ALUTs) in Intel's Adaptive Logic Modules (ALMs) and 6-input LUTs in Xilinx Configurable Logic Blocks (CLBs). Eight ALMs form a LAB in Stratix technology, while eight LUTs plus storage and arithmetic logic form the Xilinx CLB in UltraScale+ technology [26].

7.1 Implementation with Stratix V Technology

The results obtained from the implementation in the Stratix V 5SEEBF45I2L device are presented in Table 1. This FPGA has in total 359.200 available ALMs and 54.067.200 memory bits. The implemented system consists of four parallel computation stages, optimizing along four directions, as in Figure 8. The parameters that determine the size of the system are the levels of disparity dmax and the image line-width. Another parameter is the number k of rectangular blocks used in the fast computation of the minimum value, as described in Section 4.3 (see Figure 5). Fmax is the maximum frequency produced by the Quartus Prime timing analysis tool. With increasing dmax, the resources increase proportionally, while the maximum frequency decreases. Block memory bits are engaged for the line-buffers that are necessary for the implementation of the vertical and diagonal directions. Naturally, the required memory bits vary proportionally to the width of the scanline. The most resource-demanding implementation is the one with support for 128 levels of disparity and line width 1280 pixels. This system requires 20% of the overall ALMs and approximately 3% of memory resources out of the 5SEEBF45I2L FPGA chip.

Table 1.
maximum disparities (dmax)Line- widthkTotal ALMsALUTsLogicRegistersBlock memory bitsFmax (MHz)
32640814.14123.3503.410631.74076,3
64384831.59751.9356.464745.02065
640831.68551.9576.4821.244.22063,5
1280831.69252.0646.5002.492.22064
96640853.63786.3459.5541.856.70061
1024853.68586.3409.5562.974.14058,36
1280853.70386.3629.5723.719.10057
1286401669.709111.32012.6262.469.18053
12801669.719111.33212.6444.945.98055,1
  • *Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT.

Table 1. Total Required Resources in Stratix V Technology, for Different Parameter Settings*

  • *Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT.

Table 2 presents the partitioning of resources by the main entities, namely the introductory block, for the computation of Birchfield and Tomasi cost (intro_block), the block for the computation of the array of minima among costs at d – 1, d, d + 1 disparities (close disparities) for all d (compute_close_min) and the block for the computation of the overall minimum among L(p-1, dall) costs (far/overall_min). In the same far/overall_min block, the min_far array of minimum costs among far disparity costs is computed for all d values (see Section 4.4). Finally, the required resources for the implementation of the parallel_min structure of Figure 5 are also given for different dmax values (parallel_min). The resources required for a single instance of each entity are given and they are multiplied by 4 for the four stages, when it is required.

Table 2.
maximum disparities (dmax)intro_block ALUTs/Regscompute_close_min ALUTs/Regsfar/overall_minALUTs/Regsparallel_minALUT/Regsk
324.360/8741.114 × 4/02.124 × 4/0588/08
649.111/1.7062.686 × 4/04.790 × 4/02.600/08
9613.575/2.5384.030 × 4/08.546 × 4/04.161/08
12818.023/3.3705.374 × 4/011.706 × 4/05.933/016
  • *Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT, lw = 640.

Table 2. Partitioning of Resources by Entity with Increasing dmax (Stratix V)*

  • *Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT, lw = 640.

Migrating to a Stratix 10 device results in about 20% higher frequencies with slightly increased memory resources. The system was compiled for 1SG085HN1F43E1VG Stratix 10 GX edge device, featuring 14 nm technology. The results are summarized in Table 3. Total resources in this device are 284.960 ALMs and 71.208.960 block memory bits. Again, k is the number of rectangular blocks used in the fast computation of the minimum value, as described in Section 4.3, Figure 5. Total on-chip power is computed by the Quartus power analysis tool, and it is reported to be below 0,7 W. Although the Stratix V implementation, presented in Tables 1 and 2, corresponds to a lower-end device, it can be meaningfully compared with other systems in the literature.

Table 3.
maximum disparities(dmax)kLine-widthALMsALUTsLogic RegistersBlock memory bitsFmax (MHz)Total Power (W)
32864014.56126.8085.4201.013.760860,36
64864031.10056.87410.4121.996.800730,46
1281664069.527126.74120.3963.962.880630,698
  • * Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT, lw = 640.

Table 3. Total Required Resources in Stratix 10 Technology, for Different Parameter Settings

  • * Four parallel computation stages (horizontal, vertical and two diagonal), cost according to BT, lw = 640.

7.2 Implementation with ZYNQ UltraScale+ Technology

Next, the system was implemented using the Xilinx Zynq UltraScale+ MPSoC xczu7ev-ffvc1156-2-e device. This is a heterogeneous device, featuring FPGA programmable logic and an ARM Cortex A-53 processing system. This device can support hardware/software co-design for the implementation of a practical stereo system. In this work, only the programmable logic part of the chip is used. The total available programmable logic resources are 28.800 CLBs equivalent to 230.400 CLB LUTs and 460.800 CLB registers.

Table 4 reports resources and other critical implementation characteristics, with increasing number of disparity levels, starting from 32 and ending at 128. Image resolution is standard VGA 640 × 480. All four stages corresponding to the four optimizing directions are implemented. The most resource-demanding implementation supporting 128 levels of disparity requires 71% of the total resources of the chip, as this device is more limited than a Stratix V.

Table 4.
Number of maximum disparities (dmax)LUTs as LogicCLBRegistersLUTs as MemoryWNS/Clock period(ns)Fmax =1000/(T-WNS)(MHz)kTotal Power(W)
3220.5693.97319.8329,817/256680,99
6443.4337.13539.0327,018/3043,581,209
9668.14110.28658.2329/334181,65
12887.53513.43277.43211,5/3739161,95
  • *Four parallel optimization stages, line width = 640 pixels.

Table 4. Total Required Resources in Zynq UltraScale+ Technology, with Increasing Maximum Disparities*

  • *Four parallel optimization stages, line width = 640 pixels.

We note that UltraScale+ technology can implement shift registers with LUTs as memory, with each LUT corresponding to 32 memory bits. We observe the same trend of diminishing maximum frequency when we increase the levels of disparity, as in Stratix V technology. Also, we observe that clocking frequencies are lower than those reported for Stratix V, especially at large disparity ranges. The Vivado timing analysis reports the Worst Negative Slack (WNS) for a specific clock period constraining the implementation. Both values appear in the fifth column of Table 3, while the resulting maximum frequency appears in the sixth column. k is the number of subcircuits implementing the computation of the minimum value, as in Figure 5. Total on-chip power is also reported, as derived by Vivado power analysis.

In the above investigation with both FPGA architectures, it was found that resources grew linearly when dmax is increased (up to 128) and when the number of optimization directions is increased (up to 4). No suffocation of logic or of routing resources was observed at any scale of the proposed design. The linear growth of resources with increasing dmax, against the quadratic form of Equation (7), is attributed to the proposed optimization of Figure 5 (see also Equation (8)) and the proposed hardware-friendly algorithm for the computation of the minima of far-indexed path costs, described in Section 4.4.

7.3 Implementation on an Actual FPGA Board

The proposed stereo accelerator was implemented and tested using the PYNQ-Z1 board [27] by Digilent, which features a Xilinx Zynq-7020 MPSoC. This implementation serves as a practical verification of the proposed system. The PYNQ-Z1 board is equipped with HDMI-in and HDMI-out ports for video in/out. Zynq-7020 device features an on-chip ARM processor and Artix technology programmable logic. Only the FPGA part of the device was used for the implementation of the stereo accelerator.

A StereoPi camera was used for the capturing and rectification of the stereo pair. The StereoPi [28] is an open-source project, based on a small, low-cost expansion board for the Raspberry Pi (RPi) Compute Module (CM). The board features two Camera Serial Interface (CSI) connectors, where two Raspberry Pi V1 cameras (OV5647) are attached. Τhe OpenCV library for C++ running on the RPi Compute Module was used for camera calibration and stereo pair rectification. Experiments were performed with two stereo pair resolutions: 320 × 240 and 640 × 480. In the first case, OpenCV on the Compute Module achieved the rectification of stereo images with a frame rate of 97 fps. In the second case, stereo pairs of resolution 640 × 480 were rectified with a rate of 38 fps. The pair of rectified images were united in a single frame, which was sent to the Compute Module HDMI output. The refresh rate of the HDMI signal was 60 Hz in all experiments.

The HDMI signal from the RPi CM was given as input to the hardware system in the Zynq FPGA, on the PYNQ-Z1 board. The TMDS input signals were translated to native video, using a Vivado IP block. As a first step, the two stereo images were separated. They were synchronized using suitable buffers and were fed to the stereo accelerator. In the end, the disparity output was translated to HDMI source signal and was projected on a monitor screen. Figure 12 presents a diagram of the connection between the different parts of the system. Figure 13 shows the input pair of rectified images and the resulting depth map at the output of the PYNQ board. In the inset, a picture of the setup is shown.

Fig. 12.

Fig. 12. The setup used for the practical verification of the proposed system.

Fig. 13.

Fig. 13. Upper image: A rectified stereo pair captured in video, using the StereoPi, as it is projected on a monitor screen. Lower image: a depth map of the scene, processed by the proposed system and projected on screen in real-time. In the inset, the StereoPi and the PYNQ-Z1 board are shown.

In the first experiment, the pair of 320 × 240 images were united in a single frame with analysis 640 × 480, where the lower part of the frame remained blank. In this case the pixel-clock frequency was 24 MHz. Similarly, in the second experiment, the pair of 640 × 480 images were united in a single frame with analysis 1280 × 720. In this case, the pixel-clock was 74MHz. In both cases, the depth maps were refreshed at a rate of 60 Hz. In the case of images with resolution 640 × 480 only 38 frames per second were different, due to the limitation of the software rectification process running on the CM. In all cases, the video streaming was smooth and flawless. Maximum disparity range was 64 pixels. Larger designs could not fit in the Zynq-7020 device, which has limited resources compared to the more expensive UltraScale+ devices.

7.4 Evaluation and Comparison with other Architectures

Both Stratix and Zynq devices can host the proposed system and implement the design using a fraction of their resources. The evaluated maximum frequencies in both the Stratix V and Zynq UltraScale+ implementations allow high refresh rates, which are dependent on the resolution of the stereo images. The most resource-demanding implementation with support for 128 levels of disparity and line-width 1280 pixels, allows a pixel-clock frequency 55,1 MHz for the Stratix V device. This frequency corresponds to a nominal throughput of 59,78 frames per second, for image resolution 1280 × 720. The Zynq device reports lower maximum frequencies, which, however, correspond to frame rates above 40 fps. Taking also into account the blanking intervals and the standard refresh rate of 60 Hz of typical sensors, it follows from the reported pixel-clock frequencies that stereo images of standard resolution 800 × 600 can be processed at 60 fps, with support for 128 pixels of disparity.

The key issues that are characteristic of the proposed hardware implementation can be listed as follows. First, the net cost C(p, d) is computed in the introductory block according to the Birchfield and Tomasi dissimilarity measure. This measure produces smooth cost functions and results in much better depth maps than pure absolute differences, with little additional computational overhead. Second, in the recursive computation loop, the minima among path costs L(p-1, d) are computed in one clock cycle. This is achieved using the massive structure of parallel comparators, shown in Figures 4 and 5. In addition, a hardware-friendly simplification is proposed, that reduces the parallel computation blocks to a bare minimum, without loss of quality of the final depth map, as described in Section 4.4.

As a result of the above improvements, the proposed system shortens drastically the processing time inside the recursive computation loop. A comparison with other systems in the literature shows that the proposed system adopts a different approach to solving the data dependency problem and to increasing the throughput of the accelerator. Instead of using multiple passing schemes with free cycles and row-level parallelism, as in [15] or serial row processing, as in [18], our system employs parallelism in the computation of the minimum costs and processes path costs with lower complexity. Dataflow is simple and there is no need for complex alignment of multiple rows. The required buffers are also minimal, since they only produce the path cost delay required by the four directions of optimization. The cost of our approach is a moderate drop in the overall clocking frequency. However, the system achieves high frame rates, utilizing substantially lower resources and consuming less power than other state-of-the-art implementations.

A comparison in terms of resources and throughput with competitive architectures, namely [14], [15], [17] and [18], is given in Table 5. [15] and [17] employ row-level parallelism, and in this respect differ from the proposed design. [18] uses serial processing of image rows, employing FIFO buffers and a pixel aligning control scheme. The implementations in [17] and [18] used a Stratix Intel device, while the implementations in [14] and [15] used a Virtex-4 and Virtex-5 device, correspondingly. When possible, the parameters of all systems were chosen to be close to each other, for the comparison to be as fair as possible. However, the systems in Table 5 implement their own matching cost computation units and pre- or post-processing units, besides the common SGM optimization step. The metric million disparity estimations per second (MDEs/s) is equal to the product of image size, disparity levels and frame rate [15], while the normalized MDEs/s/KLC divides the MDEs/s by the Kilo LUTs.

Table 5.
SystemImage resolutiondmaxResource utilizationFrame rateMDEs/sbMDEs/s/KLCcDevice
LUTsRegistersRAM bits
proposed640 × 4806451.9576.4821.244.220206,7 fps4.06478,2Stratix V
Gehrig et al. [14]320 × 2006460.000n.a.a2.457.60027 fps110,61,84Virtex 4
Wang et al. [17]640 × 4806492.24952.5374.156.903122,1 fps2.40026Stratix V
Banz et al. [15]640 × 4806496.2987.4664.847.616167,0 fps3.28334,1Virtex 5
proposed1024 × 7689686.3409.5562.974.14074,2 fps5.60264,9Stratix V
Wang et al. [17]1024 × 76896125.25581.0929.282.49467,82 fps5.12040,88Stratix V
proposed1280 × 720128111.33212.6444.945.98059,78 fps7.05363,35Stratix V
Wang et al. [17]1600 × 1200128222.034149.28816.604.24742,61 fps10.47247,2Stratix V
Banz et al. [15]640 × 48012896.2987.4664.847.616103 fps4.05042,1Virtex 4
Cambuim et al. [18]1024 × 768128141.000107.1926.000.00025 fps2.51617,84Stratix IV
  • an.a: not available.

  • bMDEs/s: Million disparity estimations per second.

  • cKLC: Kilo LUTs.

Table 5. Comparison of Resource usage and Performance with other SGM Systems

  • an.a: not available.

  • bMDEs/s: Million disparity estimations per second.

  • cKLC: Kilo LUTs.

Depending on the evaluation method, the performance results listed in Table 5 may not reflect a system's maximum throughput. For example, Gehrig et al. [14] and Cambuim et al. [18] report experimental values possibly affected by real world limitations, like a software bottleneck or the camera refresh rates. Cambuim et al. report in [29] a theoretical throughput for their system equal to 127 fps.

Comparisons indicate that the proposed system can be implemented with lower resources, especially in terms of register and memory usage and can achieve better or equivalent frame rates than other systems in the literature. On-chip power consumption reported in Tables 3 and 4 is less than the reported power consumption for all other systems. Power consumption of other FPGA implementations ranges from 3 to 6,5 W, as reported comparatively in [18].

Skip 8CONCLUSIONS Section

8 CONCLUSIONS

A novel design is proposed for the hardware acceleration of the semi-global matching stereo algorithm. The proposed architecture tackles an inherent problem in the acceleration of the SGM algorithm, namely the data dependency problem: the computation of the current pixel path costs L(p, d) is recursive and receives as feedback the minima of the adjacent pixel cost array L(p-1, d), as defined in Equation (4). Cost arrays are of the size of the disparity range dmax, which in the proposed system ranges up to 128 pixels. In order to preserve computations at pixel-clock, we approach the solution to this problem using a parallel comparator structure for the fast computation of the minimum of a large set of values. This block is then applied to the computation of the required partial minima in one clock cycle. By using a two-tier arrangement, and implementing only the necessary comparators, the overall resource requirements of the parallel structure are minimized. A hardware-friendly solution is proposed for the efficient computation of the array of the minima of far-indexed disparities for each d value. This problem is reduced to the computation of the overall minimum among path costs and to the computation of the minimum next to overall minimum. Introducing these innovations in the design of the datapath, the system is able to process the whole range of disparities in parallel, at pixel-clock. Four parallel stages are employed for optimization along the horizontal, the vertical and two diagonal directions. The dense depth maps produced by Modelsim simulation, using reference stereo pairs, are almost identical to the output of the software version of the original algorithm.

The system is implemented using an Intel Stratix V FPGA and a Zynq UltraScale+ device by Xilinx/AMD. Resource requirement and maximum clock frequencies are investigated for a range of disparity levels up to 128 pixels and for image resolutions up to 1280 × 720. A practical implementation of the system is also presented and serves as a real verification of the proposed design. The system is less elaborate than other state-of-the art implementations and compares favorably to them.

REFERENCES

  1. [1] Hicks Stephen L., Wilson Iain, Muhammed Louwai, Worsfold John, Downes Susan M., and Kennard Christopher. 2013. A depth-based head-mounted visual display to aid navigation in partially sighted individuals. PLoS ONE 8, 7 (July 2013), e67695. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Golodetz Stuart, Cavallari Tommaso, Lord Nicholas A., Prisacariu Victor A., Murray David W., and Torr Philip H. S.. 2018. Collaborative large-scale dense 3D reconstruction with online inter-agent pose optimisation. IEEE Transactions on Visualization and Computer Graphics 24, 11 (Nov. 2018), 28952905. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Hartley Richard and Zisserman Andrew. 2004. Multiple View Geometry in Computer Vision (2nd ed.), Cambridge University Press, New York, NY, USA. Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Scharstein Daniel and Szeliski Richard. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 1-3 (April-June 2002), 742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bleyer Michael and Breiteneder Christian. 2013. Stereo Matching—State-of-the-Art and research challenges. In: Farinella G., Battiato S., Cipolla R., (Eds) Advanced Topics in Computer Vision. Advances in Computer Vision and Pattern Recognition (ACVPR). Springer, London, (2013), 143179Google ScholarGoogle Scholar
  6. [6] Sun Jian, Zheng Nan-Ning, and Shum Heung-Yeung. 2003. Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 7 (July 2003), 787800Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Boykov Yuri, Veksler Olga, and Zabih Ramin. 2001. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 11 (Nov. 2001), 12221239Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Pérez-Patricio Madain and Aguilar-González Abiel. 2019. FPGA implementation of an efficient similarity-based adaptive window algorithm for real-time stereo matching. J. Real-Time Image Proc. 16, 2 (Apr. 2019), 271287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Aguilar-González Abiel and Arias-Estrada Miguel. (2016). An FPGA stereo matching processor based on the sum of Hamming distances. In V. Bonato, C. Bouganis, and M. Gorgon (Eds.) Applied Reconfigurable Computing. ARC 2016. Lecture Notes in Computer Science, Vol. 9625. Springer, Cham. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Daolu Zha, Xi Jin, and Tian Xiang. 2016. A real-time global stereo-matching on FPGA. Microprocessors and Microsystems 47, Part B (Nov. 2016), 419428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Hernandez-Juarez D., Chacón A., Espinosa A., Vázquez D., Moure J. C., and López A. M.. 2016. Embedded real-time stereo estimation via semi-global matching on the GPU. Procedia Computer Science, 80 (2016), 143153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Hirschmüller Heiko. 2005. Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), June 20-26, 2005, San Diego, CA, USA, IEEE vol. 2., 807-814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Hirschmüller Heiko. 2008. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 2 (Feb. 2008), 328341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gehrig Stefan K., Eberli Felix, and Meyer Thomas. 2009. A real-time low-power stereo vision engine using semi-global matching. In Fritz M., Schiele B., Piater J. H. (Eds.) Computer Vision Systems (ICVS 2009). Lecture Notes in Computer Science, Vol 5815. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Banz Christian, Hesselbarth Sebastian, Flatt Holger, Blume, Holger and Pirsch Peter. 2010. Real-time stereo vision system using semi-global matching disparity estimation: Architecture and FPGA implementation. International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, July 19-22, 2010, IEEE, 93101. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Wang Wenqiang, Yan Jing, Xu Ningyi, Wang Yu, and Hsu Feng-Hsiung. 2013. Real-time high-quality stereo vision system in FPGA. International Conference on Field-Programmable Technology (FPT), Sept. 9-11, 2013, Kyoto, Japan, IEEE, 358361. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Wang Wenqiang, Yan Jing, Xu Ningyi, Wang Yu, and Hsu Feng-Hsiung. 2015. Real-time high-quality stereo vision system in FPGA. IEEE Transactions on Circuits and Systems for Video Technology 25, 10 (Oct. 2015), 16961708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Cambuim Lucas, Oliveira Luiz, Barros Edna, and Ferreira Antonyus. 2020. An FPGA-based real-time occlusion robust stereo vision system using semi-global matching. Journal of Real-Time Image Processing 17, 4 (July 2019), 14471468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Brown Myron Z., Burschka Darius, and Hager Gregory D.. 2003. Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 8 (Aug. 2003), 9931008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Viola Paul and Wells William M.. 1997. Alignment by maximization of mutual information. International Journal of Computer Vision 24, (Sept. 1997), 137154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Birchfield S. and Tomasi C.. 1998. A pixel dissimilarity measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 4 (April 1997), 401406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Hirschmüller Heiko and Scharstein Daniel. 2007. Evaluation of cost functions for stereo matching. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), June 18-23, 2007, IEEE, Minneapolis, MN, USA.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Middlebury stereo data set. Retrieved February 7, 2023 from https://vision.middlebury.edu/stereo/data/Google ScholarGoogle Scholar
  24. [24] Scharstein Daniel and Szeliski Richard. 2003. High-accuracy stereo depth maps using structured light. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), June 18-20, 2003, Madison, WI, USA, I-I. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Scharstein Daniel and Pal C.. 2007. Learning conditional random fields for stereo. IEEE Conference on Computer Vision and Pattern Recognition, June 17-22, 2007, Minneapolis, MN, USA, 18. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] UltraScale Architecture Configurable Logic Block. 2017. User Guide, UG574 (v1.5), Xilinx, February 2017.Google ScholarGoogle Scholar
  27. [27] PYNQ-Z1: Python Productivity for Zynq-7000 ARM/FPGA SoC. Retrieved June 3, 2023 from https://digilent.com/shop/pynq-z1-python-productivity-for-zynq-7000-arm-fpga-soc/Google ScholarGoogle Scholar
  28. [28] StereoPi. Retrieved June 3, 2023 from https://stereopi.com/Google ScholarGoogle Scholar
  29. [29] Cambuim Lucas F. S., Barbosa João P. F., and Barros Edna N. S.. 2017. Hardware module for low-resource and real-time stereo vision engine using semi-global matching approach. 30th Symposium on Integrated Circuits and Systems Design (SBCCI), Aug. 28-Sept. 1, 2017, IEEE, Fortaleza, Brazil, 5358.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm: An Efficient Implementation for the Stratix V and Zynq UltraScale+ FPGA Technology

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 17, Issue 1
        March 2024
        446 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3613534
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 January 2024
        • Online AM: 9 September 2023
        • Accepted: 4 August 2023
        • Revised: 7 June 2023
        • Received: 10 February 2023
        Published in trets Volume 17, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader