1 Introduction

3D image processing is an important research area, that has gained a lot of attention in the past decades. It is essential for spatial reconstruction of a scene and therefore for many applications in the fields of robotics, autonomous driving and scene understanding. Such 3D reconstruction techniques even reached medicine, human pose estimation and human action recognition.

Fig. 1
figure 1

Search space and parametrization of images points. A point of interest is annotated with the triangle in (a) and with a filled circle in (b). Furthermore the triangle can also be found in (b) as a yellow triangle to see the different locations more clearly. Given an image point on the left omnidirectional image, the corresponding right point has to be searched along a so-called epipolar curve for 3D reconstruction purposes. These image points can be parameterized by the azimuth \(\phi \in \{\phi _\text {l}, \phi _\text {r}\}\) and the elevation \(\theta \in \{\theta _\text {l}, \theta _\text {r}\}\) (See Sect. 3.1.2). The images in this figure rely on a sample of the THEOStereo dataset [8]

One prominent 3D reconstruction technique is stereo vision. Inspired by human 3D perception, stereo vision aims to recover the depth of a scene by aggregating the information of multiple cameras. Matching algorithms constitute the core of stereo vision. These algorithms look for corresponding points in the different images, that is, those points in the images that result from the projection of the same point in the real world. Finally, the difference in the position of these points allows the stereo methods to retrieve their distance.

Most of the research in this area is based on images that underlie the perspective camera model and are mainly free of distortion artifacts. For performance reasons, these approaches rectified the input images. Rectified images correspond to multi-camera setups with aligned viewing directions and collinear x-axes. As result, in all rectified images, the corresponding points are found on horizontal and collinear lines called epipolar lines. This represents an advantage for the matching algorithms whose search space is reduced to a one dimension given by this epipolar line.

Perspective images have a field of view (FOV) of usually less than \(65^\circ \). In many applications, this limited FOV represents a big drawback by requiring the use of multiple calibrated cameras to cover the region of interest. For this reason, omnidirectional images from so-called fisheye cameras have gained a lot of attention in the last years. These cameras exhibit a much higher FOV of around \(180^\circ \) and therefore show much more content of the scene with only one sensor. However, the higher FOV of these cameras is associated with a high distortion in the omnidirectional images. For the stereo vision, this distortion results in a new and more complex search space for corresponding points. The same holds true for the so-called normalized omnidirectional images, which are, broadly speaking, a scaled version of the non-normalized counterparts. As shown in Fig. 1, the search space for stereo correspondences on (normalized) omnidirectional images is not more a horizontal line, as in the case of perspective images, but a curve. This curve is known as epipolar curve. In order to be able to use perspective matching algorithms, many stereo methods for omnidirectional images unwarp the images to cylindrical images [1,2,3] to recover the epipolar lines. However, these transformations reduce the FOV as well as the efficiency of the methods.

Neural networks have also reached stereo vision. Networks for perspective images, like AnyNet [4], reach excellent results and allow more than 30 fps on high resolution images. Also for omnidirectional stereo vision different networks has been proposed by Won et al. [5,6,7]. These networks use spherical sweeping in which the omnidirectional images are first projected on spherical images with high FOV, also known as equirectangular projection (ERP) images, and then mapped on concentric global spheres surrounding the cameras’ rig center. Although this strategy gives high precision results, its structure and transformations make it computationally intensive and prevent it from achieving real-time performance.

In this work, we aim to bridge the gap between accuracy and fast processing time in omnidirectional stereo vision. This is accomplished with the following contributions:

  • We propose OmniGlasses, a set of look up tables (LUTs) carefully designed for fast and incremental stereo correspondence search on omnidirectional images.

  • We integrate OmniGlasses into AnyNet as part of our new network Omni-AnyNet. This demonstrates how fast networks can be modified to process omnidirectional images.

  • We proof the efficiency of Omni-AnyNet and therefore OmniGlasses experimentally. We show that the integration of OmniGlasses comes with only low cost in terms of throughput while producing accurate scene reconstructions.

  • All results are compared to the state-of-the-art network OmniMVS \(^+\).

2 Related work

Depth estimation from omnidirectional images using neural networks was pioneered by Won et al. [5,6,7]. The input images of these networks come from a wide-baseline multi-view (four cameras) omnidirectional setup. The first of these works introduces SweepNet [5], a CNN that computes the matching costs of ERP image pairs warped from the omnidirectional images. The resulting cost volume is refined by applying a semi global matching (SGM) algorithm [9] and finally the depth map is estimated. SweepNet presents a problem to manage occlusions, which are typical for the proposed wide-baseline omnidirectional setup. To overcome this problem, Won et al. propose OmniMVS [6], an end-to-end deep neural network consisting of three blocks: feature extractor, spherical sweeping and cost volume computation. OmniMVS was then extended to OmniMVS \(^+\) [7] by incorporating an entropy boundary loss for a better regularization of the cost volume computation. Furthermore, OmniMVS \(^+\) improves the efficiency of its predecessor in terms of memory consumption and run time. This is achieved by merging opposite camera views into the same ERP image.

Almost parallel to OmniMVS, Wang et al. [10] developed 360SD-Net. This network takes as input a pair of ERP images from two cameras aligned in a top-bottom manner. The features extracted from the images are concatenated with those extracted from a polar angle map, in order to introduce the geometry information in the model. Atrous-Spatial Pyramid Pooling (ASPP) modules are proposed to enlarge the receptive field followed by a learnable cost volume (LCV). The final disparity is regressed by using a Stacked-Hourglass module.

Komatsu et al. [11] present IcoSweepNet for depth estimation from four omnidirectional images. IcoSweepNet based on an icosahedron representation, spherical sweeping and a 2D/3D CNN architecture called CrownConv, which is specially designed for extracting features of icosahedrons. By considering the extrinsic camera parameters, this network achieves more robust results against camera misalignments than OmniMVS.

Córdova-Espaza et al. [12] combine a deep learning matching algorithm and stereo epipolar constraints to reconstruct 3D scenes from a stereo catadioptric system. After converting the images to panoramic ones, the matching points pairs proposed by a DeepMatching algorithm [13] are filtered according to the defined epipolar constraints. This method requires a tradeoff between 3D-point sparsity and reconstruction error, which is achieved by adjusting a threshold on the distance between the proposed points and their corresponding epipolar curves.

In [14], Lee et al. propose a semi-supervised learning method by expanding OmniMVS \(^+\) with a second loss function. The pixel-level loss selects a supervised loss or a unsupervised re-projection loss according to the availability or not of ground truth information. This combination of loss functions allows to consider the common sparsity of real depth ground truth generated by LIDAR. This work achieves better results than OmniMVS \(^+\) in presence of sparsity and calibration errors, making the network more robust to work with real data.

Li et al. [15] introduce the Spherical Convolution Residual Network (SCRN) for omnidirectional depth estimation. This network processes ERP images as inputs, which are sampled in spherical meshes. In this way, the non-linear epipolar constrains in the plain are converted to linear constraints in the sphere. The SCRN is then followed by a planar refinement network (PRN) to go back to a 2D representation. The full architecture is called Cascade Spherical Depth Network (CSDNet).

While the architectures intended to estimate depth maps from perspective images have reached real-time conditions [4], the analogous architectures developed for omnidirectional images are still far from these levels of efficiency. Outside the field of neural networks and machine learning, Meuleman et al. [16] present a deterministic real-time sphere-sweeping stereo method. This work was developed for a \(360^\circ \) field of view setup consisting of 4 omnidirectional cameras. The proposed adaptive spherical matching runs directly on the input images but it considers only the best cameras pairs for each correspondence, which allows to reduce the computation time. A fast inter-scale bilateral cost volume filtering allows the method to reach 29 fps. This method performs better and faster than OmniMVS and CrownConv, however, it lacks the generalization power of the learning methods that makes the results robust to changes in the input data. Our work adapts the sweeping method from [16] making it part of the learning process of a neural network. As part of this integration, we also present an optimization process in order to save computational time without loosing depth resolution. Moreover, as explain in Sect. 3, the matching process in our approach is performed on features instead on intensity values and the considered setup is a stereo one.

Fig. 2
figure 2

Epipolar Geometry. In (a), the projection points \(\tilde{\varvec{P}}_\text {l}\) and \(\tilde{\varvec{P}}_\text {r}\) of \(\varvec{P}\) on the image hemispheres are located in the so-called epipolar plane spanned by the camera centers \(\varvec{O}_\text {l}\) and \(\varvec{O}_\text {r}\) and the world point \(\varvec{P}\). (b) shows the parametrization of the epipolar plane by the angle \(\alpha \). (c) depicts \(\delta _\text {omni}\) and the yaw angles \(\beta _\text {l}\) and \(\beta _\text {r}\) of \(\varvec{v}_\text {l}\) and \(\varvec{v}_\text {r}\). These are used together with \(\alpha \) to triangulate \(\varvec{P}\). Unlike (a) and (c), the system in (b) is not camera-specific and applies to any camera coordinate system at \(\varvec{O}\in \{\varvec{O}_\text {l}, \varvec{O}_\text {r}\}\)

3 Omnidirectional stereo vision

In perspective stereo vision, one common way to retrieve the depth (z-distance) of a scene is to determine the so-called disparity map between two images \(\varvec{I}_\text {l}\) and \(\varvec{I}_\text {r}\) captured by two cameras at different positions. In the case of horizontally aligned cameras, the subindexes l and r denote left and right, respectively. For performance reasons, the images \(\varvec{I}_\text {l}\) and \(\varvec{I}_\text {r}\) are usually rectified. This means they do not present any distortion and their corresponding x- and y-axes are respectively parallel. Moreover, the x-axes are collinear. Under these conditions, a real world point \(\varvec{P}\) captured by both cameras is projected on the rectified images at pixels with the same y-coordinate on collinear horizontal lines, called epipolar lines. The difference between the x-coordinates \(x_\text {l}\) and \(x_\text {r}\) of this projected point on the left and the right image gives the disparity value \(d_\text {persp}(x_\text {l},y) = x_\text {l}- x_\text {r}\). The \(x_\text {l}\)- and \(x_\text {r}\)-coordinates of corresponding points are extracted along the epipolar lines using stereo matching techniques, e.g., Block Matching [17]. Finally, a disparity map \(D_\text {persp}(x_\text {l},y)=x_\text {l}-x_\text {r}\) is generated, where each disparity value is inverse proportional to the searched depth value \(z(x_\text {l},y)\).

Nowadays stereo vision networks for perspective images [4, 18,19,20] perform stereo matching on feature maps rather than on the original input images. These maps are computed from both rectified input images by using a feature extractor, e.g., U-Net [21]. Then one of the feature maps is horizontally shifted and compared with the other camera’s feature map for each shift. This shift in x-direction (along the epipolar line) for a value \(d_\text {persp} \in [0, d_\text {max}]\) generates a pixel-wise cost volume of size \(H\times W\times D\), where H and W describe the height and width of the feature map and D the number of considered disparity values between 0 and \(d_\text {max}\). This way, networks like AnyNet [4] summarize the costs for all possible disparity assumptions. Finally from this cost volume a regression module retrieves the optimal disparity value for each pixel.

However, this method of disparity estimation by horizontally shifting feature maps along the epipolar lines is not valid for omnidirectional images. In the case of using omnidirectional cameras, the world point \(\varvec{P}\) is projected onto an image hemisphere rather than an image plane [1,2,3] as shown in Fig. 2a. As a result, corresponding points on omnidirectional images are located along a so-called epipolar curve instead of along a line, as in the case of perspective images (See Fig. 1). Therefore, the disparity cannot be considered anymore as an offset in x-direction since two corresponding pixels in the left and right images may also differ in the y-coordinate. In this case, the correspondence search and the following disparity calculation should be based on an epipolar geometry for canonical stereo configurations, which follows an omnidirectional camera model. As described in the next subsection, our work describes omnidirectional cameras using the equiangular camera model.

3.1 Epipolar geometry for omnidirectional stereo vision

We first describe, in Sect. 3.1.1, the search space for stereo correspondences along the epipolar curves on the image hemispheres (See Fig. 2a). Then we link this search space to its corresponding search space on the omnidirectional images in Sect. 3.1.2. Finally, in Sect. 3.1.3, we propose OmniGlasses, a set of LUTs designed for searching stereo correspondences in omnidirectional image pairs.

3.1.1 Relationship of world points and their projection on the image hemisphere

Let \(\tilde{\varvec{P}}_\text {l}\) and \(\tilde{\varvec{P}}_\text {r}\) be the projection points of a world point \(\varvec{P}\) on the left and right image hemisphere of a canonical stereo setup, as shown in Fig. 2a. After bringing them into a joint coordinate system, both projection points, the camera centers \(\varvec{O}_\text {l}\) and \(\varvec{O}_\text {r}\) and the world point \(\varvec{P}\) itself sit on a so-called epipolar plane. As a consequence, given a reference point \(\tilde{\varvec{P}}_\text {l}\) on the left image hemisphere, the orientation of the epipolar plane determines the valid search space for \(\tilde{\varvec{P}}_\text {r}\) on the right image hemisphere. The orientation of the epipolar plane can be described through the pitch angle \(\alpha \) between the epipolar plane and the plane spanned by the camera’s x- and z-axis (See Fig. 2b):

$$\begin{aligned} \alpha = \arctan 2(\tilde{p}_y, \tilde{p}_z), \end{aligned}$$
(1)

where \(\tilde{p}_y\) and \(\tilde{p}_z\) are the components in y- and z-direction of a point \(\tilde{\varvec{P}}\in \{\tilde{\varvec{P}}_\text {l}, \tilde{\varvec{P}}_\text {r}\}\). Furthermore, the position of \(\tilde{\varvec{P}}_\text {l}\) and \(\tilde{\varvec{P}}_\text {r}\) is determined by their corresponding light rays. Each light ray can be described with the help of the unit vector \(\varvec{v}= \tilde{\varvec{P}}/ \Vert \tilde{\varvec{P}}\Vert \), where \(\varvec{v}\in \{\varvec{v}_\text {l},\varvec{v}_\text {r}\}\) (See Fig. 2c) points in the opposite direction to the given light ray. Finally, scaling \(\varvec{v}\) by the value of the focal length f results in the projection point \(\tilde{\varvec{P}}=f \cdot \varvec{v}\).

The angle \(\delta _\text {omni}\) between the light rays described by \(\varvec{v}_\text {l}\) and \(\varvec{v}_\text {r}\) is denoted as normalized disparity by Li et al. [22, 23]. This angle can also be defined in relation to the angles \(\beta _\text {l}\) and \(\beta _\text {r}\) between the vectors \(\varvec{v}_\text {l}\) and \(\varvec{v}_\text {r}\) and the unit vector \(-\varvec{e}_x\) pointing in the negative direction of the x-axis (See Fig. 2c):

$$\begin{aligned} \delta _\text {omni}= \beta _\text {l}- \beta _\text {r}\qquad \end{aligned}$$
(2)

Finally, by expressing the vector \(\varvec{v}\) in terms of the angles \(\beta \in \{\beta _\text {l}, \beta _\text {r}\}\) and \(\alpha \), a projection point \(\tilde{\varvec{P}}\) can be defined as follows:

$$\begin{aligned} \tilde{\varvec{P}}(\alpha ,\beta )&= f \cdot \varvec{R}_x(-\alpha ) \cdot \varvec{R}_y(\beta ) \cdot (-\varvec{e}_x) \nonumber \\&= f \cdot \begin{pmatrix} -\cos \beta \\ \sin \alpha \sin \beta \\ \cos \alpha \sin \beta \end{pmatrix} = f \cdot \varvec{v}, \end{aligned}$$
(3)

where \(\varvec{R}_x\) and \(\varvec{R}_y\) denote rotation matrices around x- and y-axis, respectively. Equation 3 describes the rotation of \(-\varvec{e}_x\) around the camera’s y-axis (on the xz-plane) to account for the yaw angle \(\beta \). Then, the introduction of the pitch \(\alpha \) by rotating around the x-axis by \(-\alpha \). The result is the unit vector \(\varvec{v}\), which is finally scaled by f to obtain the projection point \(\tilde{\varvec{P}}\).

The goal of stereo matching is to find the projection point \(\tilde{\varvec{P}}_\text {r}\) in the right image hemisphere, that corresponds to a given projection point \(\tilde{\varvec{P}}_\text {l}\) in the left one. As both points belong to the same epipolar plane, the value of \(\alpha \) can be calculated from \(\tilde{\varvec{P}}_\text {l}\) (Conf. Equation 1). According to Eq. 3, given \(\alpha \), the search space for \(\tilde{\varvec{P}}_\text {r}\) is defined by all possible values of \(\beta _\text {r}\) or, what is the same, by all possible values of \(\delta _\text {omni}\) (Conf. Equation 2). Iterating over all possible disparity values \(\hat{\delta }_\text {omni}\in \left[ 0, \hat{\delta }_\text {omni,max}\right] \) results in a set of angles \(\hat{\beta }_\text {r}(\hat{\delta }_\text {omni}) = \beta _\text {l}- \hat{\delta }_\text {omni}\) (Eq. 2), which together with \(\alpha \) define all possible correspondence points \(\hat{\varvec{P}}_\text {r}(\hat{\delta }_\text {omni})\) for \(\tilde{\varvec{P}}_\text {l}\) (Eq. 3).

The purpose of stereo matching is then to find from all candidates \(\hat{\varvec{P}}_\text {r} (\hat{\delta }_\text {omni})\), which one best represents the observed projection \(\tilde{\varvec{P}}_\text {r}\) and with it the best value of \(\hat{\delta }_\text {omni}\). This disparity value is then the last key to spatially reconstruct the scene as described in the next section.

3.1.2 Relationship of world points and their projection on the omnidirectional image

A projection point \(\tilde{\varvec{P}}\in \{\tilde{\varvec{P}}_\text {l}, \tilde{\varvec{P}}_\text {r}\}\) on an image hemisphere corresponds to a point on the resulting omnidirectional image. In order to avoid the conversion between omnidirectional image and image hemisphere during runtime, the search of stereo correspondences is performed directly on the omnidirectional images according to the restrictions derived in Sect. 3.1.1. Cameras following the equiangular projection model project an incoming light ray onto the omnidirectional image depending on the elevation \(\theta \) and azimuth \(\phi \) angles of the corresponding vector \(\varvec{v}\) [24]. The elevation angle \(\theta \) describes the angle between \(\varvec{v}\) and the optical axis \(\varvec{e}_z\):

$$\begin{aligned} \theta = \arccos (v_\text {z}) = \arccos (\cos \alpha \sin \beta ) \end{aligned}$$
(4)

The azimuth angle \(\phi \) stands for the angle between the x-axis and the projection of \(\varvec{v}\) onto the xy-plane:

$$\begin{aligned} \phi&= \arctan 2(v_\text {y},v_\text {x}) \text { mod } 2\pi \nonumber \\&= \arctan 2(\sin \alpha \sin \beta , -\cos \beta ) \text { mod } 2\pi \end{aligned}$$
(5)

The modulo operator ensures that \(\phi \in [0, 2\pi [\) and avoids negative values. By equiangular projection, \(r=\theta \), where r is the distance between the projected point and the image distortion center as show in Fig. 1. With the help of these polar coordinates, the light ray can be projected onto the normalized omnidirectional image at

$$\begin{aligned} \begin{pmatrix} x_\text {norm}\\ y_\text {norm}\end{pmatrix} = \theta \cdot \begin{pmatrix} \cos \phi \\ \sin \phi \end{pmatrix}= r \cdot \begin{pmatrix} \cos \phi \\ \sin \phi \end{pmatrix} \end{aligned}$$
(6)

and on the omnidirectional image itself at:

$$\begin{aligned} \begin{pmatrix} x \\ y \end{pmatrix}= f \cdot \begin{pmatrix} x_\text {norm}\\ y_\text {norm}\end{pmatrix} + \begin{pmatrix} c_\text {x} \\ c_\text {y} \end{pmatrix} \end{aligned}$$
(7)

Here we assume equal focal length f for x- and y-direction. The vector \(\left( c_\text {x}, c_\text {y}\right) ^T\) describes the coordinates of the image distortion center.

Now \(\theta \), \(\phi \) and finally \(\varvec{v}\) can be restored from the pixel locations in the normalized image itself. The elevation \(\theta \) is calculated as:

$$\begin{aligned} \theta = r= \left\Vert \begin{pmatrix} x_\text {norm}\\ y_\text {norm}\end{pmatrix} \right\Vert \end{aligned}$$
(8)

Note that an explicit conversion of radians to pixels an vice versa is not necessary in Eqs. 6 and 8 as both pixels and radians are dimensionless. The azimuth \(\phi \) can be retrieved from the normalized image as:

$$\begin{aligned} \phi = \arctan 2(y_\text {norm}, x_\text {norm}) \text { mod } 2\pi \end{aligned}$$
(9)

The relationship between image points on normalized image pairs and their parameters \(\phi \in \{\phi _\text {l}, \phi _\text {r}\}\) and \(r \in \{r_\text {l}, r_\text {r}\}\) are visualized in Fig. 1 for both left and right image.

Finally, the vector \(\varvec{v}\) that links the omnidirectional images with the search space on the image hemispheres can be restored:

$$\begin{aligned} \varvec{v}= \varvec{R}_z(\phi )\varvec{R}_y(\theta )\varvec{e}_z= \begin{pmatrix} \cos \phi \sin \theta \\ \sin \phi \sin \theta \\ \cos \theta \end{pmatrix} \end{aligned}$$
(10)

Note that the equations in this section are based on the equidistant camera model and may differ for real world omnidirectional dioptric lenses. However, adapting these equations to other similar camera models is straightforward.

3.1.3 Searching strategy on omnidirectional images

By applying Eqs. 8 and 9, the polar coordinates \((r_\text {l}, \phi _\text {l})^T\) for each pixel in the left omnidirectional image \(\varvec{I}_\text {l}\) can be calculated. Then the vector \(\varvec{v}_\text {l}\) and the angles \(\alpha \) and \(\beta _\text {l}\) can be derived for each pixel in \(\varvec{I}_\text {l}\) by using Eqs. 10, 1 and 3.

As mentioned before, during the correspondence search, a set of plausible disparity values \(\hat{\delta }_\text {omni}\in \left[ 0, \hat{\delta }_\text {omni,max}\right] \) and their corresponding \(\beta _\text {l}\) are assumed for each projection point \(\tilde{\varvec{P}}_\text {l}\). The omnidirectional disparity \(\hat{\delta }_\text {omni}\) of Li et al. [22, 23] describes an angle represented by a floating-point number. For sake of implementation, \(\hat{\delta }_\text {omni}\) needs to be redefined as a discrete variable by sampling its valid range with step size \(S=\hat{\delta }_\text {omni,max}/D\). This results in \(\hat{\delta }_\text {omni} \in \{\hat{\delta }_0,\dots ,\hat{\delta }_{s},\dots ,\hat{\delta }_{D-1}\}\), with \(0\le s \le D-1\), where D is the number of considered disparities values.

Each value of \(\hat{\delta }_{s}\) results in an angle \(\hat{\beta }_\text {r}(\hat{\delta }_{s})\) (Conf. Equation 2). Given a pixel \((\theta _\text {l},\phi _\text {l})^{T}\) in the left image, its correspondence candidates \((\hat{\theta }_\text {r}, \hat{\phi }_\text {r})^T\) in the right image can then be found by substituting \(\beta \) by \(\hat{\beta }_\text {r}(\hat{\delta }_{s})\) in Eqs. 4 and 5. Finally, the coordinates of the correspondence candidates \(\left( \hat{x}_\text {r}, \hat{y}_\text {r}\right) ^T\) on the right image can be obtained by using Eqs. 6 and 7 for each disparity hypothesis \(\hat{\delta }_{s}\).

In order to implement a stereo matching process, the coordinates \(\left( \hat{x}_\text {r}, \hat{y}_\text {r}\right) ^T\) derived from each hypothetical disparity value \(\hat{\delta }_{s}\) are used to project the right image on the left one. The resulting transformed image \(\hat{I}^{\hat{\delta }_{s}}_\text {r}(x_\text {l},y_\text {l})\) can be defined as follows:

$$\begin{aligned} \hat{I}^{\hat{\delta }_{s}}_\text {r}\left( x_\text {l},y_\text {l}\right) =I_\text {r}\left( \hat{x}_\text {r}, \hat{y}_\text {r}\right) \end{aligned}$$
(11)

This view transformation can be easily implemented as a backward-projection with look up tables (LUTs). These LUTs store for each pixel location \((x_\text {l}, y_\text {l})^T\) in the left image and each value of \(\hat{\delta }_{s}\) the resulting location of the correspondence point \(\left( \hat{x}_\text {r}, \hat{y}_\text {r}\right) ^T\) in the right image. We name the volume comprising all LUTs for all disparity hypothesis Full OmniGlasses. These LUTs have the size \(D \times H \times W \times 2\), where D is the number of hypothetical disparity values, \(H \times W\) is given by the size of the picture and the 2 refers to the two coordinates \(\hat{x}_\text {r}\) and \(\hat{y}_\text {r}\). A sparse version of OmniGlasses will be introduced in Sect. 3.2.

By applying these LUTs to the right image, D transformed images \(\hat{I}^{\hat{\delta }_{s}}_\text {r}(x_\text {l},y_\text {l})\) are obtained. The optimal value of \(\hat{\delta }_{s}\) for each coordinate \((x_\text {l},y_\text {l})\) in the left image is the one that maximizes the similarity between the intensity \(\hat{I}^{\hat{\delta }_{s}}_\text {r}(x_\text {l},y_\text {l})\) and \(I_\text {l}(x_\text {l}, y_\text {l})\). In order to determine this value, a measurement of the similarity between both images, the left one and the transformed right one, is performed. In our work, we used the \(L_1\) norm, for similarity measurements, which is the cost metric used by AnyNet, as explain in the next section. A cost volume \(\varvec{C}\), of size \(D \times H \times W\) stores all resulting \(C^{s}(x_\text {l}, y_\text {l})\) with:

$$\begin{aligned} C^s(x_\text {l}, y_\text {l}) = \left|I_\text {l}(x_\text {l}, y_\text {l}) - \hat{I}^{\hat{\delta }_{s}}_\text {r}(x_\text {l},y_\text {l}) \right|\end{aligned}$$
(12)

The final disparity value can be determined with the help of the softargmin on the cost values for each pixel separately [19] and refined by a disparity refinement module [4]. The softargmin function gives the index of the optimal disparity value. Moreover, this function allows to obtain a subindex precision by weighting and integrating the cost volume results. This local oversampling results in a subindex \(s'\) between two given indexes \(s\le s'\le s+1\) and a final estimated disparity \(\hat{\delta }'=s'\cdot S\), with \(\hat{\delta }_{\lfloor s'\rfloor }\le \hat{\delta }'\le \hat{\delta }_{\lceil s'\rceil }\). Finally, following [22, 23], the Euclidean distance \(\hat{\rho }_\text {l}\) bewteen world point \(\varvec{P}\) and left camera \(\varvec{O}_\text {l}\) is given by

$$\begin{aligned} \hat{\rho }_\text {l} = b \cdot \frac{\sin \hat{\beta }_\text {r}}{\sin \hat{\delta }'} = b \cdot \frac{\sin (\beta _\text {l}- \hat{\delta }') }{\sin \hat{\delta }'}\text {,} \end{aligned}$$
(13)

with b being the baseline of the stereo camera, i.e., the distance between \(\varvec{O}_\text {l}\) and \(\varvec{O}_\text {r}\).

3.2 Integration of OmniGlasses into AnyNet

AnyNet [4] is a network for disparity estimation with state-of-the-art results on perspective images. Unlike what is described in the previous sections, AnyNet does not perform stereo matching on the input image but rather on feature maps extracted from them. Designed to achieve a good computing time, AnyNet estimates the disparity in a hierarchical way. The network is organized into four stages, where each stage increases the resolution of the disparity map generated in the previous one. Stage 1, takes feature maps of 1/16 of the full image resolution. Stages 2 and 3 increase this resolution to 1/8 and 1/4 of the original resolution respectively. Finally, the last stage estimates the full resolution disparity map.

In the first stage, \(D=12\) values are considered for the disparity estimation. In stages 2 and 3, AnyNet takes the disparity estimation of the preceding stage (rounded to integer) as initial value and predicts a residual disparity instead of undertaking a full estimation. Here it is assumed that, the disparity \(\hat{d}^i_\text {persp}\) estimated for a pixel \((x_\text {l},y)\) in stage \(i \in \{2, 3\}\) does not differ by more than two pixels from the previous prediction, i.e., \(\hat{d}^{i}_\text {persp} \in [2\hat{d}^{i-1}_\text {persp} -2, 2\hat{d}^{i-1}_\text {persp} +2]\). This means that the new stage needs to consider only 5 values of disparity at maximum: The one received by the previous stage, two values higher and two lower. This incremental improvement per stage saves much computational time and enables the real-time capability of AnyNet. In this work, i refers to the stage of AnyNet or its adaptions always. However, its range can be refined individually to make more specific statements.

AnyNet is designed for perspective images captured from a canonical camera setup. Therefore, it takes advantage of the parallelism and collinearity of the epipolar lines on the left and right rectified images. Hence, the disparity \(d_\text {persp}(x_\text {l},y) = x_\text {l} - x_\text {r}\) results from an offset only in x-direction. Therefore, the cost volume \(C^{\hat{d}_\text {persp}}(x_\text {l}, y_\text {l})\) can be generated by overlapping the right feature map on the left one and horizontally shifting it for each considered value \(\hat{d}^{i}_\text {persp}\). For this purpose, the \(L_1\) norm serves as cost measure to evaluate each assumed disparities value. The final disparities for stages 1–3 are then estimated as a weighted softargmin on the cost volume and refined through a network. Stage 4 has the task of refining the disparity maps of stage 3 and upsampling it to the original resolution. For this, it uses an SPNet module [25] together with the left image as a guide.

As shown in [8], the displacement in x-direction cannot adequately model perspective disparity arranged in an omnidirectional manner. In this case, the subsequent disparity refinement module can only partially correct the disparity estimates. In this work, the original AnyNet process for generating the cost volume is replaced by the proposed OmniGlasses.

By considering the 2D displacement in the x- and y-directions, OmniGlasses can follow epipolar curves (as shown in Fig. 1) during the stereo matching process. Modelling this 2D displacement, which describes the distortion in omnidirectional images, is the main challenge for depth estimation. We will show in Sect. 5 that OmniGlasses successfully overcome this challenge and significantly reduces the error compared to applying stereo matching along epipolar lines.

In this way, an appropriate omnidirectional view synthesis, as described in Sects. 3.1.1, 3.1.2, 3.1.3, is incorporated into AnyNet. Each stage from 1 to 3 has its own parameters \(D_i\), \(S_i\) and therefore its own valid set of assumed disparities \(\hat{\delta }_\text {omni} \in \{\hat{\delta }_0,\dots ,\hat{\delta }_{s_i},\dots ,\hat{\delta }_{D_i-1}\}\), with \(0\le s_i \le D_i-1\) and \(i\in \{1,2,3\}\).

Analogous to the original AnyNet, \(D_1 = 12\) is selected for the first stage. The resulting disparity map is then refined in each successive stage by estimating a residual value for the upsampled version. In the same way as with perspective images, the residual calculation reduces the number of view synthesis transformations to the predefined number of residual values. We conserve the original number of residual values by considering a disparity range between two disparity indexes lower and two higher than the one received from previous stage. This results in five transformations for each feature vector given by \(\hat{\delta }^{i}_\text {omni}\in \{\hat{\delta }^{i}_0,\cdots ,\hat{\delta }^{i}_2,\cdots , \hat{\delta }^{i}_4\}\), with \(i\in \{2,3\}\), \(\hat{\delta }^{i}_2\) is the estimated disparity from the previous stage, \(\hat{\delta }^{i}_{s_i}=\hat{\delta }^{i}_2+S_i(s_i-2)\).

As a consequence, we first generated three OmniGlasses of shape \(D_i \times H_i \times W_i \times 2\), referred to as Full OmniGlasses, for the first three stages before runtime and further reduced the shape of the LUT of stage \(i \in \{2, 3\}\) to \(5 \times H_i \times W_i \times 2\) during runtime. The reduced versions of OmniGlasses are hereinafter denoted as Sparse OmniGlasses. The values inside the Sparse OmniGlasses depend on the predictions of the previous stage for each feature vector independently, as shown in Fig. 3.

Analogously to AnyNet, the resulting disparity values from one stage are rounded to integers and upscaled in order to incorporate them into the next stage. The shapes of all OmniGlasses are documented in Table 1. We hereinafter refer to this version of AnyNet leveraging OmniGlasses as Omni-AnyNet.

Fig. 3
figure 3

Shape reduction of OmniGlasses. The disparity index maps of the stages 1 and 2 are upsampled to \(H_{i} \times W_{i}\), with \(i \in \{2,3\}\). Each disparity index is used to determine five residual disparity indexes \(s_{i}\). These indexes constitute the color-coded positions of five small cubes per pixel along the \(s_{i}\)-axis in a Full OmniGlasses LUT. Note that the center cube of each sequence in Full OmniGlasses has the same color as the disparity map value of the corresponding pixel. Each small cube denotes a look up tuple \((\hat{x}_\text {r}, \hat{y}_\text {r})\). In Sparse OmniGlasses, only these five cubes / look up tuples are stored for each pixel

Table 1 Shapes of OmniGlasses and disparity index mapping for stages 1–3 in Omni-AnyNet

4 Experiments

We propose three groups of experiments to demonstrate the effectiveness of our approach. First, we show qualitative results of OmniGlasses as a standalone module. This experiment aims to prove the correctness of the proposed LUTs in the generation of transformed omnidirectional images for different disparity values. Furthermore, we determine a set of measurements of different error metrics to show the accuracy of OmniGlasses as part of Omni-AnyNet. The second group of experiments presents an ablation study to compare the performance of AnyNet with and without the proposed adaptation. Moreover, this study shows the importance of choosing the right disparity metric. Finally, we compare Omni-AnyNet with the state-of-the-art network OmniMVS \(^+\) [7].

There are few datasets for omnidirectional stereo vision. Won et al. [6] published the datasets OmniThings and OmniHouse for training the inverse distance of a scene. These datasets present images from a system of four cameras with not aligned viewing directions. In contrast, our system is based on a canonical stereo setup (aligned viewing directions). To the best of our knowledge, THEOStereo [8] is the only dataset with rendered samples for depth estimation with an canonical omnidirectional stereo setup. Therefore, all evaluated networks in this work are trained and tested with THEOStereo. THEOStereo comprises \(31,\!250\) stereo image pairs together with their ground truth depth maps (distance in z-direction). Images and depth maps were rendered using the Unity3D game engine. As Unity3D does not provide shaders for omnidirectional camera models, the authors of [8] used the handcrafted shaders of [26], which merge four perspective images or depth maps according to the fusion method of Bourke et al. [27, 28]. In addition to RGB images, the shaders generated relative depth values between zero and one, which were then scaled to the given absolute distance (z-direction). For the experiments on (Omni-)AnyNet and OmniMVS \(^+\), these depth maps are used as ground truth by first converting them to point clouds and then to Euclidean distance and disparity maps. Training, validation and testing subsets are partitioned in a ratio of \(80\%\)/\(10\%\)/\(10\%\). We downsampled THEOStereo’s images and ground truth to \(H \times W = 1024 \times 1024\) pixels.

For a proof of concept of the LUT as a standalone module (without CNN layers), we first built up a Full OmniGlasses LUT of shape \(201 \times H \times W \times 2\), with \(H\times W\) given by the full image resolution. We reduced the shape to \(1 \times H \times W \times 2\) by using the ground truth disparity. The transformations for each pixel of the right image given by this optimal version of the LUT are based in the correct disparity value. The right image is then transformed with the help of this LUT and compared with the left image. A high agreement by this comparison indicates that the transformation proposed by Full OmniGlasses as described in Sect. 3 is correct.

The quantitative evaluation involves error measurements on both disparity and Euclidean distance. All the output maps (Li’s disparity, perspective disparity or inverse distance) of the considered approaches were converted into Euclidean distance maps to facilitate a comparison between them. The mean absolute error (MAE) is calculated by averaging the \(L_1\) norm of each error. For perspective images, the bad-e error (abbreviated by \(\Delta > e\)) describes the ratio of disparity errors greater than e pixels along the epipolar line [29]. This error metric is, however, not directly applicable on omnidirectional images. In this case, we defined the bad-e error in relation to the disparity index. For omnidirectional images, \(\Delta > e\) describes the ratio of disparities errors \(\epsilon _i(x_\text {l},y_\text {l})=\frac{1}{S_i}\Vert \hat{\delta }'^i(x_\text {l},y_\text {l}) - \delta (x_\text {l},y_\text {l})\Vert \) that exceeds e disparity indices, where \(\hat{\delta }'^i(x_\text {l},y_\text {l})\) is the final estimated disparity of stage i (Conf. Sect. 3.1.3) and \(\delta (x_\text {l},y_\text {l})\) is the ground truth disparity. Here, we upsampled \(\hat{\delta }'^i(x_\text {l},y_\text {l})\) to match the resolution of \(\delta (x_\text {l},y_\text {l})\). The use of the disparity index (given by the division of the disparity values by the sampling step size \(S_i\)) by the error calculation allows to evaluate the performance of the network regardless of the predefined sampling rate given by \(S_i\). The three pixel error (3PE) adds a new constraint to the bad-3 error, considering that also the relative error between the estimated disparity and the ground truth should exceed a certain threshold, in this case 5%. We used a 3PE analogous to Wang et al. [4] which is similar to that of KITTI Stereo Dataset 2015 [30, 31]. Our 3PE for omnidirectional images is defined as follows:

$$\begin{aligned} E_{\text {3PE}_i}=\frac{1}{N_i}\sum _{x_\text {l},y_\text {l}} f_i(x_\text {l}, y_\text {l})\text {,} \end{aligned}$$
(14)

with:

(15)

For all considered metrics, 3PE, bad-e and mean absolute error (MAE), only valid regions are considered, which reduced the pixel number from \(H_i \cdot W_i\) to \(N_i\). For sake of comparability between approaches and stages of Omni-AnyNet, some pixels of the results are discarded in the evaluation of disparity maps. This is the case, for example, for those pixels whose final disparity exceeds the index \(H_i / H_1 \cdot (D_1-1)\). This allows to ignore those estimated disparities values that, because of the residual strategy of AnyNet, exceed the maximum disparity value \(\hat{\delta }_\text {omni,max}\) defined for the setup. Unlike the evaluation strategy followed by the original version of AnyNet, those pixels that do not belong to regions captured by both cameras (and therefore do not present valid values for stereo methods) are also discarded in the whole quantitative evaluation. This causes pixels to be ignored, mostly along the border of the FOV, but avoids including monocularly estimated values in the evaluation.

Fig. 4
figure 4

Proof of concept: View synthesis on RGB images via OmniGlasses. The right image of a sample of THEOStereo is transformed using the proposed LUT and the disparity values of the ground truth. A superposition of the original left (a) and right image (b) shows the presence of disparity as a blurred effect (d). In contrast, by superposing the left image (a) and transformed right image (c) the location of the objects in both image coincides and the results looks sharper (e). This evidences that the coordinate transformations for the given disparities in the proposed LUTs are correct. As shown in (f), occlusion artifacts cannot be avoided. Here, part of the person’s shape appears twice. The images in this figure rely on a sample of the THEOStereo dataset [8]

Table 2 Full Evaluation of Omni-AnyNet. Metrics relying on the Euclidean distance are given in arbitrary units of the THEOStereo dataset whereas 1 AU \( \approx 50\) cm. Disparity values are given in radians

We integrate OmniGlasses into AnyNet as described in Sect. 3.2 and train Omni-AnyNet for 300 epochs on the training set of THEOStereo. As THEOStereo provides depth maps as ground truth, we convert them to disparity maps. Adam [32] with default parameters (\(\beta _1 = 0.9\) and \(\beta _2 = 0.999\)) is chosen as an optimizer. The training is carried out with a learning rate of \(1 \cdot 10^{-3}\) that decays to zero at the end of the training following the cosine annealing strategy [33] with a single decay period (no warm restarts). A smooth L1 loss with a threshold of 2.0 serves as training loss function for each stage. The final loss constitutes a weighted sum of the loss values of all stages with the same weights used for the original AnyNet implementation: 0.25 for the first, 0.5 for the second and 1 for the third and fourth stage.

For the ablation study, we train AnyNet without OmniGlasses using the same described hyperparameters. Here, we distinguish between AnyNet trained on omnidirectional disparities after [22, 23] and AnyNet trained on perspective disparities arranged in an omnidirectional manner analog to [8]. Both experiments use the original architecture of AnyNet, but with different ground truth arrangements. We refer to the first model as AnyNet(Li) and the second one as AnyNet(Persp.). The number of evaluated pixels \(N_i\) may slightly vary for the different approaches: AnyNet(Persp.), AnyNet(Li) and Omni-AnyNet. For sake of comparability a joint mask is applied during the evaluation to calculate the error maps only on that regions that are valid for all three approaches.

Fig. 5
figure 5

Comparison of Omni-AnyNet with AnyNet(Li), AnyNet(Persp.) and OmniMVS \(^+\) showing the absolute error maps that correspond to the Euclidean distance for a sample of THEOStereo [8] (a). Figures (bd) show qualitative results of our ablation study. AnyNet(Li) produces better results on omnidirectional disparity values (c) than AnyNet(Persp.) (d). However, Omni-AnyNet (b) produces more promising results. On the other hand, the high throughput of Omni-AnyNet comes with the cost of accuracy if compared with OmniMVS \(^+\) (Conf. (e) and (f))

OmniMVS \(^+\) was designed to reconstruct a 3D scene from four cameras with four different viewing directions. The number of cameras can hardly be changed in the official implementation without changing the architecture significantly. Therefore, we fed the stereo image pair from THEOStereo into OmniMVS \(^+\) and kept the images of the remaining two cameras black. As the training routine was not provided by the authorsFootnote 1, we built up a standard training pipeline on PyTorch. We excluded the entropy loss as no such training code was provided by the authors. However, it turned out that OmniMVS \(^+\) still produces accurate results without this optimization as demonstrated in the next section. We split the network into two parts after Layer conv4-11 of the Unary Feature Extractor (See Table 1 of [7]) and run the network on two GPUs in sequence. The first part was executed by an NVIDIA GeForce GTX 1080, the remaining part was processed on an NVIDIA Quardo P6000. The high GPU-RAM utilization (29132 MiB / 32768 MiB) did not allow to train the network on only one of our training GPUs nor to increase batch size or the resolution of the inverse distance maps. Hence, we kept a batch size of one and the original resolution of \(160 \times 640\) of the distance maps. However, the same input images with resolution \(1024 \times 1024\) were fed into OmniMVS \(^+\). We used OmniMVS \(^+\) with interleaved spheres which was essential to reduce the memory consumption and allows training on the mentioned GPUs. Furthermore, we chose the default number of channels, i.e., 32. We used the weights obtained from the pretraining on OmniThings [7] to initialize the network. The minimum distance parameter of OmniMVS \(^+\) was adjusted to the minimal observable distance in THEOStereo, i.e., 0.78 AU. The inference and training times obtained for OmniMVS \(^+\) are approx. \(0.8-2.0\) fps (See Table 5) and 0.3 fps, respectively. As a result, it was not feasible to train the network for 300 epochs like for Omni-AnyNet. Our learning rate schedule for OmniMVS \(^+\) therefore imitates the original schedule depending on the number of processed samples rather then the processed epochs. In [7], OmniMVS \(^+\) was trained for 20 epochs on OmniThings with a learning rate of \(3 \cdot 10^{-3}\) and for further 10 epochs with a learning rate of \(3 \cdot 10^{-4}\). A training for 20 or 30 epochs on OmniThings roughly processes as much training samples as in seven or 11 epochs on THEOStereo. Hence, we trained OmniMVS \(^+\) for seven epochs with the initial learning rate of \(3 \cdot 10^{-3}\) and for further four epochs with the reduced learning rate of \(3 \cdot 10^{-4}\). In order to compare the results of Omni-AnyNet and OmniMVS \(^+\), the error maps should coincide in their projection model as well as in their resolution. With this objective, the error maps of Omni-AnyNet were converted to the ERP model with the same resolution as in OmniMVS \(^+\). Analogue to our ablation study, we masked out regions in the error maps of Omni-AnyNet and OmniMVS \(^+\), that are not valid in both approaches. Due to the high memory footprint, it was not feasible to train both networks Omni-AnyNet and OmniMVS \(^+\) under equal conditions. Hence, the juxtaposition of both approaches can only be seen as a coarse comparison.

Table 3 Comparison of Omni-AnyNet and AnyNet. Absolute metrics are given in arbitrary units of THEOStereo (1 AU \( \approx 50\) cm)
Table 4 Comparison of Omni-AnyNet and OmniMVS \(^+\). Absolute metrics are given in arbitrary units of THEOStereo (1 AU \( \approx 50\) cm)

5 Results

Figure 4 depicts our proof of concept. In contrast to a simple superposition of left and right image (See Fig. 4d), the superposition of the left and transformed right image (See Fig. 4e) appears significantly sharper. This indicates that both the left and the transformed right image mainly coincide, which signalizes that the view synthesis was successful. However, some occlusion artifacts are visible in the transformed right image, which cannot be diminished by OmniGlasses as a standalone module without CNN layers. Figure 4f zooms one of this artifacts, where a part of the person’s shape appears a second time on the left side of the person. For this particular image area, the left camera captures a part of the floor and wall shelves. However, this part of the background is occluded by the person for the right camera. This occluding texture is then copied instead of the floor or wall shelves texture from the right image to the transformed version where the mainly the floor texture has been expected. As a consequence, the person partially appears a second time.

Table 5 Throughput measurements on THEOStereo during inference (batch size 1). All measurements are given in frames per second

Table 2 shows the result of Omni-AnyNet on the testing partition of THEOStereo. The bad-e error is remarkably low. The MAE for the disparity index as well as the MAE for the Euclidean distance give very satisfying results, that are inside the tolerance ranges for many applications. It can be seen that the estimated disparity index does not differ more than one step from the ground truth in average. An MAE for the Euclidean distance of 0.25 AU represents an error of around 12.5 cm in THEOStereo.

Table 3 summarizes our ablation study. As the disparity metrics differ between the proposed algorithms, only the absolute and the relative MAE of the Euclidean distance have been chosen for comparison. The errors are averaged over the dataset. As aforementioned, in each output sample we mask out regions that are not valid for all the three approaches. It can be seen that the incorporation of Li’s disparity into AnyNet (AnyNet(Li)) considerably increases the performance in comparison with AnyNet using perspective disparity values arranged in an omnidirectional geometry (AnyNet(Persp.)). Moreover, Omni-AnyNet, which replaces the original AnyNet’s look up tables with the proposed OmniGlasses, significantly reduced the absolute MAE by around 0.08 AU to 0.25 AU. In order to visualize these results, Fig. 5b–d present maps of the absolute errors by measuring the Euclidean distance, where the brighter the color the higher the error. It can be seen that Omni-AnyNet produces an error map with only a few bright spots indicating high MAEs (See Fig. 5b). Moreover, these bright spots are located nearby edges or fine objects like the wheeled walker, which might be explained by (self) occlusion artifacts. AnyNet(Li) has a lower performance than Omni-AnyNet, which is specially visible by the standing human’s shape in the image. Finally, AnyNet(Persp.) results in larger high-error spots.

To complete our comparison, we present the results of OmniMVS \(^+\). As this network uses the EPR model to present their results, Therefore, in order to facilitate the comparison, Figs. 5e and 5f show the outputs of Omni-AnyNet and OmniMVS \(^+\) under ERP, respectively. OmniMVS \(^+\) produces more accurate Euclidean distance maps than Omni-AnyNet (See Table 4), however, at the price of much more computational time (See Table 5). The pixelated borders of the estimation of Omni-AnyNet in Figs. 5b and 5e stem from intentionally deactivating monocular disparity estimation as mentioned in Sect. 4.

The throughput measurements for all discussed networks are documented in Table 5. Omni-AnyNet exhibits high frame rates of up to 48.4 fps. In contrast, OmniMVS \(^+\), with maximum 2 fps, is an order of magnitude slower than Omni-AnyNet.

Experiments on the NVIDIA GTX 1080 as well as the NVIDIA Quadro P6000 were conducted on a deep learning machine with an Intel® Core i7-6900K CPU @ 3.20GHz (8 cores, 16 threads). The experiments on the NVIDIA TITAN X were performed on a second deep learning workstation with an Intel® Core i9-9960X CPU @ 3.10GHz (16 cores, 32 threads). Both machines have 128 GiB of RAM. It should be noted that the original version of AnyNet [4] achieved 10 fps on images with a resolution of \(1242 \times 375\) on an NVIDIA Jetson TX2. This, together with the moderate RAM, GPURAM and CPU utilization of Omni-AnyNet, indicates that it is also suitable for real-time inference in embedded systems, delivering high quality results.

6 Conclusion and future work

In this work, we derive a search space for stereo correspondences in omnidirectional image pairs captured by canonical stereo cameras. We plan to extend OmniGlasses to consider other projection models for real-world lenses by refining the constraints for epipolar geometry, in particular Eq. 8. Moreover a corresponding search strategy, similar to Meuleman et al. [16], is proposed. These derive in a set of LUTs named OmniGlasses, which can be easily combined with machine learning methods, like neural netowrks. In contrast to [7] and [16], OmniGlasses search for the disparity instead of a distance or inverse distance. It is therefore, to some extent, reminiscent of classical stereo vision retrieving disparity values rather than estimating properties of the scene (the distance of 3D points to the camera) directly. We concentrated on a canonical camera system. This system maximizes the area of the scene that is visible by both cameras and is therefore optimal for binocular stereo setup. We integrated OmniGlasses into AnyNet and proved the efficiency of our approach. We call the resulting network Omni-AnyNet. This achieves remarkable reconstruction results with a low MAE of around 13 cm (Euclidean distance) at up to 48.4 fps and outperforms OmniMVS \(^+\), a state-of-the-art CNN for depth reconstruction with omnidirectional images, in terms of speed. As a consequence, OmniGlasses successfully diminish the gap between reconstruction accuracy and high throughput rates. As a large number of networks for perspective stereo vision like AnyNet exist, we believe that OmniGlasses can open up many opportunities to develop fast networks for omnidirectional vision. We derived OmniGlasses for the equiangular projection model. This model could be replaced by camera models more suitable for real-world images in the future.