Abstract
One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.
1 A PLETHORA OF FLOATING-POINT FORMATS AND ROUNDING MODES
The 2019 revision of the IEEE 754 standard for floating-point arithmetic [32] specifies three basic binary formats for computation: binary32, binary64, and binary128. Most 64-bit CPUs equipped with a floating-point unit natively support binary32 and binary64 arithmetic, and 32-bit CPUs can simulate binary64 arithmetic very efficiently by relying on highly optimized software libraries. The binary128 format, introduced in the 2008 revision [31] of the original IEEE 754 standard [30], has not gained much popularity among hardware manufacturers, and over 10 years after having been standardized, it is supported only by two IBM processor families: the POWER9, which implements version 3.0 of the Power Instruction Set Architecture [36], and the z/Architecture [39].
In fact, the low-precision requirements of artificial intelligence applications have steered the hardware market in the opposite direction, and a number of fewer-than-32-bit formats have been proposed in recent years. The first widely available 16-bit floating-point format was binary16. Despite having been defined in the last two revisions of the IEEE 754 standard only as an interchange format, binary16 has been supported as an arithmetic format by all NVIDIA microarchitectures since Pascal [45] and all AMD architectures since Vega [51]. Google has recently introduced the bfloat16 data type [34], a 16-bit format with approximately the same dynamic range as binary32.
The latest Armv8 CPUs support a wide variety of floating-point formats, including binary32, binary64, bfloat16 [1, Section A1.4.5], binary16, and an alternative custom half-precision format [1, Section A1.4.2]. The latter is a 16-bit format based on binary16 that extends the dynamic range from \([-65{,}504,\: +65{,}504]\) to \([-131{,}008, \: +131{,}008]\) by reclaiming the \(2{,}048\) bit patterns (about 3%) used for infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\) (Not a Number).
The latest NVIDIA graphics card microarchitectures offer new low-precision formats, aimed at machine learning, that break with the 16-bit tradition. The A100 GPUs, which are based on the Amprere microarchitecture [46], support a 19-bit floating-point format, called TensorFloat-32, which combines the precision of binary16 with the exponent range of binary32. The latest H100 GPUs, based on the Hopper microarchitecture [47], provide two 8-bit formats: E4M3, with 4 bits of precision and dynamic range \([-240, 240]\), and E5M2, which trades one bit of precision for the much larger dynamic range \([-57{,}344, 57{,}344]\).
In Section 4, we discuss the general framework that underlies all these floating-point formats. This framework is a straightforward extension of the IEEE 754 standard [30, 31, 32] and is well established in the literature [27, 44].
Such a broad range of floating-point formats poses a major challenge to those trying to develop mixed-precision algorithms for scientific computing, because studying the numerical behavior of an algorithm in different working precisions may require access to a number of high-end hardware platforms. To alleviated this issue, two families of techniques for simulating low-precision floating-point arithmetic via software have been proposed over the years.
In principle, one can use (signed) integers to represent the exponent and significand of the low-precision numbers. All mathematical functions can then be evaluated by explicitly operating on the integers that make up these representations. This solution allows for maximum flexibility, and it is typically used to simulate arbitrary-precision arithmetics, where exponent and significand can have arbitrarily many digits.
If only low-precision arithmetic is of interest, however, then relying on the floating-point arithmetics available in hardware will typically lead to more efficient software simulators. In fact, one can perform each arithmetic operation using a floating-point format available in hardware, check that the result is representable in the low-precision format of interest, and then round the significand to the desired number of significant bits. In this context, simulating low-precision floating-point arithmetic only requires the ability to round a high-precision number to lower precision, and users can leverage existing hardware and mathematical libraries. This method is more convenient: It leads to software that is easier to maintain, since less code is necessary, and is typically more efficient, as most of the computation is performed using thoroughly optimized high-precision mathematical routines that are already available. Here, we follow this second approach, which is the one most often encountered in the literature, as the survey of existing software for simulating low-precision floating-point arithmetic in Section 2 illustrates.
Our contribution is two-fold. First, we discuss how the operations underlying the rounding of a floating-point number x to lower precision can be performed directly on the binary floating-point representation of x. We present the new algorithms in Section 5, and in Section 6, we explain how to implement them efficiently using bitwise operations and integer arithmetic. Second, we introduce
The floating-point representation in the simulated (target) format exists only implicitly: In practice, all numbers are represented in a (storage) format that is natively supported by the compiler. The storage formats currently available in
We provide a MEX interface that allows users to call
The library can be used to prototype and test mixed-precision algorithms, as well as to simulate custom-precision hardware. Recently, for example, we have used
2 RELATED WORK
A number of packages to simulate low-precision floating-point arithmetic via software are available. These are summarized in Table 1, where we report the main programming language in which the software is written and detail what storage formats, target formats, and rounding modes are supported.
The first three columns report the name of the package, the main programming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponent range can be arbitrary (A), must be a—possibly restricted—sub-range of the exponent range of the storage format (S), or can be a sub-range only for built-in types (B); whether the target format supports subnormal numbers (Y), does not support them (N), supports them only for built-in types (B), supports them by default but allows the user to switch the functionality off (F), or does not support them by default but allows the user to turn the functionality on (O). The following column lists the floating-point formats that are built into the system, if any. The last seven columns indicate what rounding modes the software supports: round-to-nearest with ties-to-even (RNE), ties-to-zero (RNZ), or ties-to-away (RNA), round-toward-zero (RZ), round-to-\(+\infty\) and round-to-\(-\infty\) (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 4 (SR). Fully supported rounding modes are denoted by \(\checkmark\), while ! is used for rounding modes that can be simulated at a higher computational cost. The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.
The first three columns report the name of the package, the main programming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponent range can be arbitrary (A), must be a—possibly restricted—sub-range of the exponent range of the storage format (S), or can be a sub-range only for built-in types (B); whether the target format supports subnormal numbers (Y), does not support them (N), supports them only for built-in types (B), supports them by default but allows the user to switch the functionality off (F), or does not support them by default but allows the user to turn the functionality on (O). The following column lists the floating-point formats that are built into the system, if any. The last seven columns indicate what rounding modes the software supports: round-to-nearest with ties-to-even (RNE), ties-to-zero (RNZ), or ties-to-away (RNA), round-toward-zero (RZ), round-to-\(+\infty\) and round-to-\(-\infty\) (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 4 (SR). Fully supported rounding modes are denoted by \(\checkmark\), while ! is used for rounding modes that can be simulated at a higher computational cost. The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.
The most comprehensive software package for simulating arbitrary-precision floating-point arithmetic is the GNU MPFR library [19], which extends the formats in the IEEE 754 standard to any precision and to an arbitrarily large exponent range. The library is written in C, but interfaces for most programming languages are available. GNU MPFR is a de facto standard for working with arbitrary precision and is typically used to perform computations that require high accuracy, rather than to simulate low-precision arithmetics.
Sipe is a header-only C mini-library designed to simulate low precision efficiently. This software supports round-to-nearest with ties-to-even and round-to-zero, and numbers can be stored either as a pair of signed integers representing the significand and the exponent of the floating-point value [37] or as a value in a native floating-point format. The latest version of the library [38] supports
Dawson and Düben [11] recently developed a Fortran library, called rpe, for simulating reduced-precision floating-point arithmetics in large numerical simulations. In rpe, the reduced-precision floating-point values are stored as binary64 numbers, a solution that provides a very efficient technique for simulating floating-point formats with the same exponent range as binary64 and no more than 53 bits of precision.
FloatX [16] is a header-only C++ library for simulating low-precision arithmetic that supports binary32 and binary64 as storage formats. This software is more flexible than rpe, in that it allows the user to choose not only the number of significant digits, but also the number of bits used to represent the exponent in the target format.
The only rounding mode currently implemented by both rpe and FloatX is round-to-nearest with ties-to-even, which may be too restrictive when one wants to simulate hardware units where truncation or stochastic rounding are available.
The
Higham and Pranesh [28] proposed
FLOATP_Toolbox [41] is a MATLAB toolbox for simulating reduced-precision fixed-point and floating-point arithmetics. This library uses binary64 as storage format, supports the same six rounding modes as
The QPyTorch library is a low-precision arithmetic simulator, written in PyTorch, whose primary goal is to facilitate the training of neural networks in low precision without the overhead of building low-precision hardware [60]. The core design principles of this library are analogous to those of
The packages SERP [56] and FASE [49] utilize dynamic binary translation to seamlessly change the precision of an executable without requiring any alteration of the source code. A downside of this approach is that the user cannot control which rounding mode a given operation will use, and the target format appears to be a global setting that cannot be changed on a per-operation basis—this significantly limits the simulation environment and notably does not allow for the simulation of mixed-precision operations, which are becoming more and more common in hardware devices.
The algorithms presented here complement existing software by proposing efficient techniques for implementing rounding using low-level bitwise instructions. Our library is intended as a software package that enables the use of these rounding functionalities in low-level languages such as C and C++, but high-level languages that allow the user to call C routines, either directly or indirectly, can also benefit from it.
3 FEATURES AND EXAMPLES
At the core of
The target numerical format is specified by a C structure of type
as done on line 8. This function allocates the memory underlying the
In Listing 1, we do not showcase all the functionalities of
The parameters in an
if the storage format is binary64, or
if the storage format is binary32. The convention of appending an
should be used as shown on line 49 to free the memory correctly.
The functions that make up the library can be divided into three groups: functions for rounding, functions for computing in low-precision arithmetic, and functions for probing the low-precision floating-point format. We discuss these assuming arrays of binary64 values—for binary32 arrays it suffices to change all
A
which convert the
The second group contains functions for simulating the four elementary arithmetic operations and many mathematical functions in low precision. The computation is first performed in the storage format, using the corresponding C operators or C mathematical library (
The third and last group comprises functions that work on the implicit target-format representation. These can be used to extract the exponent and fraction of the number rounded to low precision to compute the number that in the target format is closest to a given value in the storage format and to check whether a value in the storage format would become subnormal, normal, infinite, or \({\text{N}\scriptstyle\text{A}}\text{N}\) once rounded to the target format.
We list all functions available in
The names of the corresponding functions for
float are obtained by appending the suffix “f ”: for example, the functioncpfloat_mulf can be used to multiply twofloat arrays elementwise. All mathematical functions return as output numbers in the target format.
The names of the corresponding functions for
float are obtained by appending the suffix “f ”: for example, the functioncpfloat_mulf can be used to multiply twofloat arrays elementwise. All mathematical functions return as output numbers in the target format.
The example in Listing 1 is available in the
The example in Listing 1 can be compiled with
which produces the binary
4 FLOATING-POINT STORAGE FORMATS AND ROUNDING
A family of binary floating-point numbers \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) is a finite set that includes a subset of the real line and a handful of values with a special meaning. In our notation, the three integer parameters p, \({e_{\smash{\min }}}\), and \({e_{\smash{\max }}}\) represent the number of binary digits of precision, the minimum exponent, and the maximum exponent, respectively, and the Boolean flag \({\mathfrak {s}_{\mathrm{n}}}\) indicates whether subnormals are supported. A real number \(x := (s,m,e)\) in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) can be written as (4.1) \(\begin{equation} x = (-1)^s\cdot m \cdot 2^{e-p+1}, \end{equation}\) where s is the sign bit, set to 0 if x is positive and to 1 otherwise, the integral siginificand m is a natural number not greater than \(2^p-1\), and the exponent e is an integer between \({e_{\smash{\min }}}\) and \({e_{\smash{\max }}}\) inclusive.
To ensure a unique representation for all numbers in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle \setminus \lbrace 0\rbrace\), it is customary to normalize the system by assuming that if \(x \ge 2^{{e_{\smash{\min }}}}\), then \(2^{p-1} \le m \le 2^p-1\), that is, the number is represented using the smallest possible exponent and the largest possible significand. In such systems, the number \((s,m,e) \in \mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle \setminus \lbrace 0\rbrace\) is normal if \(m \ge 2^{p-1}\), and subnormal otherwise. The exponent of subnormals is always \({e_{\smash{\min }}}\), and in a normalized system any number \(x = (s,m,e) \ne 0\) has a unique p-digit binary expansion \((-1)^s \cdot \widetilde{m} \cdot 2^{e}\), where (4.2) \(\begin{equation} \widetilde{m} = m \cdot 2^{1-p} = d_0 + \frac{d_1}{2} + \dots + \frac{d_{p-1}}{2^{p-1}} = d_0.d_1\dots d_{p-1}, \end{equation}\) for some \(d_0, d_1, \dots , d_{p-1} \in \lbrace 0,1\rbrace\), is called the normal significand of x. One can verify that if x is normal, then \(d_0 = 1\) and \({1 \le \widetilde{m} \lt 2}\), whereas if x is subnormal, then \(d_0 = 0\) and \({0 \lt \widetilde{m} \lt 1}\). Conventionally, the signed zero values \(+0\) and \(-0\) are considered neither normal nor subnormal. In a normalized system, the smallest subnormal is \({x_{\mathrm{minsub}}}:= 2^{{e_{\smash{\min }}}-p+1}\), whereas the smallest and largest positive normal numbers are \({x_{\min }}:= 2^{{e_{\smash{\min }}}}\) and \({x_{\max \vphantom{\min }}}:= 2^{{e_{\smash{\max }}}}(2-2^{1-p})\), respectively. Their negative counterparts can be obtained by observing that a floating-point number system is symmetric with respect to 0. In our notation, subnormals are kept if \({\mathfrak {s}_{\mathrm{n}}}= \mathbf {true}\) and rounded to either the closest smallest floating-point number or the zero of appropriate sign if \({\mathfrak {s}_{\mathrm{n}}}= \mathbf {false}\).
The floating-point numbers in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) can be represented efficiently as binary strings. The sign is stored in the leftmost bit of the representation, and the following \(b_e\) bits are used to store the exponent. Under the IEEE 754 assumption that \({e_{\smash{\min }}}=1-{e_{\smash{\max }}}\), the most efficient way of representing the exponent is obtained by setting \({e_{\smash{\max }}}= 2^{b_e-1}-1\) and using a representation biased by \({e_{\smash{\max }}}\), so \(00\cdots 01_2\) represents the smallest allowed exponent and \(11\cdots 10_2\) represents the largest. The trailing \(p-1\) bits are used to store the significand of x. It is not necessary to store the value of \(d_0\) explicitly, as the IEEE standard uses an encoding often referred to as “implicit bit”: \(d_0\) is assumed to be 1 unless the binary encoding of the exponent is the all-zero string, in which case \(d_0 = 0\) and x represents either a signed 0, if \(m = 0\), or a subnormal number, otherwise. If the exponent field contains the reserved all-one string, then the number represents \(+\infty\) or \(-\infty\) if the fraction field is set to 0, and a \({\text{N}\scriptstyle\text{A}}\text{N}\) otherwise. Infinities are needed to express values whose magnitude exceeds that of the largest positive and negative numbers that can be represented in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\), whereas \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\) represent the result of invalid operations, such as taking the square root of a negative number, dividing a zero by a zero, or multiplying an infinity by a zero. These are needed to ensure that the semantics of all floating-point operations is well specified and that the resulting floating-point number system is closed.
A rounding is an operator that maps a real number x to one of the floating-point numbers closest to x in a given floating-point family. The original IEEE 754 standard [30, Section 4] defines four rounding modes for binary formats: the default round-to-nearest with ties-to-even, and three directed rounding modes, round-toward-\(+\infty\), round-toward-\(-\infty\), and round-toward-zero. The 2008 revision of the standard [31] introduces a fifth rounding mode, round-to-nearest with ties-to-away, but states that it is not necessary for an implementation of a binary format to be IEEE compliant. Finally, the latest revision of the standard [32] recommends the use of round-to-nearest with ties-to-zero for augmented operations [32, Section 9.5]. Here, we also consider two less common rounding strategies, round-to-odd [3] and stochastic rounding [9].
With round-to-odd, we have to set the least significant bit of the rounded number to 1 unless the infinitely-precise number is exactly representable in the target format. This rounding has applications in several domains. In computer arithmetic, for instance, it can be used to emulate a correctly rounded fused multiply–add (FMA) operator when a hardware FMA unit is not available [3]. In general, round-to-odd helps avoid issues associated with double rounding and is used for this purpose in GCC, 2 for example. The fact that it can be implemented at low cost in hardware has recently prompted interest from the machine learning community [7].
Unlike all rounding modes discussed so far, stochastic rounding is non-deterministic, in the sense that the direction in which to round is chosen randomly and repeating the rounding may yield different results. The simplest choice is to round an infinitely precise number that is not representable in the target format to either of the two closest representable floating-point numbers with equal probability. A more interesting flavor is obtained by rounding a non-representable number x to either of the closest floating-point numbers with probability proportional to the distance between x and the two rounding candidates. This rounding mode dates back to the ’50s [17, 18] and has recently gained prominence owing to the surge of interest in low-precision floating-point formats. It is particularly effective at alleviating swamping in long sums [8] and ordinary differential equation solvers [15, 29] and at counteracting the loss of accuracy observed when the precision used to train neural networks is reduced [25, 58]. Stochastic rounding is not widely available in general-purpose hardware but has started to appear in some specialized processors, such as the Intel Loihi neuromorphic chips [10] and the Graphcore IPU, an accelerator for machine learning [22]. We refer the reader to the recent survey by Croci et al. [9] for more details on this rounding mode.
5 EFFICIENT ROUNDING TO A LOWER-PRECISION FORMAT
We simulate low-precision arithmetic by first performing each operation in some higher precision available in hardware and then rounding the result to the desired target format. In this section, we discuss how to exploit the binary representation in the storage format to develop efficient algorithms for rounding \(x \in \mathcal {F}^{ (h) }\) to \(\widetilde{x} \in \mathcal {F}^{ (\ell) }\), where
Whether this technique can be used to simulate low precision accurately depends on the floating-point parameters of the two formats under consideration. First, we need to ensure that any number in \(\mathcal {F}^{ (\ell) }\) is representable in \(\mathcal {F}^{ (h) }\). In most cases, a necessary and sufficient condition for \(\mathcal {F}^{ (\ell) }\subset \mathcal {F}^{ (h) }\) is that \({{p}}\le {{p}}^{(h)}\) and \({e_{\smash{\max }}}^{(\ell)}\le {e_{\smash{\max }}}^{(h)}\). When \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {true}\) but \({\mathfrak {s}_{\mathrm{n}}}^{(h)}= \mathbf {false}\), however, this is not sufficient, because a number that is subnormal in \(\mathcal {F}^{ (\ell) }\) is not necessarily normal in \(\mathcal {F}^{ (h) }\). In this case, we need to require that \({e_{\smash{\min }}}^{(\ell)}\ge {e_{\smash{\min }}}^{(h)}+ p - 1\) or equivalently that \({e_{\smash{\max }}}^{(\ell)}\le {e_{\smash{\max }}}^{(h)}- p + 1\).
But this may not be enough when it comes to simulating arithmetic operations. Let \(y \in \mathbb {R}\) be the exact result of an arithmetic operation \(f(x_{1}, \ldots , x_{N})\), where \(x_{1}\), ..., \(x_{N} \in \mathcal {F}^{ (h) }\), and let \(\mathrm{fl}_{h}: \mathbb {R}\rightarrow \mathcal {F}^{ (h) }\) and \(\mathrm{fl}_{\ell }: \mathbb {R}\rightarrow \mathcal {F}^{ (\ell) }\) be the functions that round to the high- and low-precision formats, respectively. A correctly rounded floating-point implementation of f will return \(\mathrm{fl}_{h}(y)\), and a low-precision simulation should return \(\mathrm{fl}_{\ell }(y)\).
Is \(\mathrm{fl}_{\ell }(y) = \mathrm{fl}_{\ell }(\mathrm{fl}_{h}(y))\) for any value of \(x_{1}\), ..., \(x_{N}\)? This is always true for directed rounding, but when round-to-nearest is considered, one may have that \(\mathrm{fl}_{\ell }(y) \ne \mathrm{fl}_{\ell }(\mathrm{fl}_{h}(y))\) if the intermediate format \(\mathcal {F}^{ (h) }\) does not have enough extra bits of precision.
If f is one of the arithmetic operators required by the IEEE-754 standard [32, Section 5.4.1], for which the IEEE 754 standard mandates correct rounding, then it is known that \({{p}}\le {{p}}^{(h)}/2-1\) will guarantee that double rounding will be innocuous for the four elementary arithmetic operations (\(+\), \(-\), \(\times\), and \(\div\)) and for the square root [53, 55].
If f is a transcendental function, then we cannot guarantee that harmful double rounding will not occur, but this is not a problem, as the IEEE-754 standard [32, Section 9.2] only recommends that these operations be correctly rounded, and many general-purpose mathematical libraries do not provide correct rounding. This is the case, for example, for the GNU C library3 [40, Section 19.7], for the OpenCL C++ 1.0 Specification [24, Section 7.4] and OpenCL C 2.0 Specification [23, Section 4.4], which the Intel oneAPI math kernel library4 uses to determine the accuracy of mathematical functions,5 and for the Radeon open compute math library.6 Most mathematical libraries do not provide correctly rounded implementations of transcendental functions, and we refer the interested reader to recent work by Innocente and Zimmermann [33] for an experimental investigation of the issue.
We now list the high-level functions we need to operate on floating-point numbers. In the description, x denotes a floating-point number in \(\mathcal {F}^{ (h) }\), n denotes a positive integer, and i is an integer index between 0 and \({{p}}^{(h)}-1\) inclusive. Unless otherwise stated, these functions are not defined for infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\).
\(\rm {\rm\small ABS}(x)\) returns the absolute value of x. For infinities, we use the definition \(\rm {\rm\small ABS}(\pm \infty) = +\infty\).
\(\rm {\rm\small DIGIT}(x,i)\) returns the ith digit of the significand of x from the left. The indexing starts from 0, so \(\rm {\rm\small DIGIT}(x,i)\) is \(d_i\) in (4.2).
\(\rm {\rm\small EXPONENT}(x)\) returns the exponent of x. This is the signed integer e in (4.1) if x is normal, and an integer \(\gamma \lt {e_{\smash{\min }}}\) otherwise.
\(\rm {\rm\small SIGNIFICAND}(x)\) returns the integral significand of x, that is, the positive integer m in Equation (4.1).
\(\rm {\rm\small RAND}(n)\) returns a string of n randomly generated bits.
\(\rm {\rm\small SIGN}(x)\) returns \(-1\) if the floating-point number x is negative and \(+1\) otherwise. This function behaves as expected for (signed) zeros and infinities, i.e., \(\rm {\rm\small SIGN}(\pm 0) = \pm 1\) and \(\rm {\rm\small SIGN}(\pm \infty) = \pm 1\).
\(\rm {\rm\small TAIL}(x,i)\) returns the trailing \({{p}}^{(h)}-i\) bits of the significand of x as an unsigned integer. For infinities, this function returns 0 by convention, thus \(\rm {\rm\small TAIL}(\pm \infty ,i) = 0\) for any i.
\(\rm {\rm\small TRUNC}(x,i)\) returns the number x with the last \({{p}}^{(h)}-i\) bits of the significand set to zero. Truncating an infinity leaves it unchanged, thus \(\rm {\rm\small TRUNC}(\pm \infty , i)= \pm \infty\) for any i.
\(\rm {\rm\small ULP}(x,i)\) returns the number \(2^{\rm {\rm\small EXPONENT}(x)-i+1}\), that is, the gap between x and its successor in a floating-point number system with i bits of precision. As noted by Muller [43], this function corresponds to the unit in the last place as defined by Overton [50, p. 14] and Goldberg [20].
How to implement these efficiently will be discussed in detail in Section 6.
Our rounding strategy is summarized in Algorithm 5.1. The function \(\rm {\rm CPF}{\rm\small{LOAT}}\) computes the representation of the floating-point number \(x \in \mathcal {F}^{ (h) }{}\) in a lower-precision format \(\mathcal {F}^{ (\ell) }\). In the pseudocode, \(\mathbb {N}^+\) denotes the set of positive integers. The input parameter \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) is a pointer to one of the functions in Algorithms 5.2, 5.4, or 5.5. A call to \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}(x,t,\zeta)\) returns the floating-point number \(\widetilde{x} \in \mathcal {F}^{ (h) }\), that is, x rounded to \(\mathcal {F}^{ (\ell) }\). The function starts by setting the underflow threshold \(\zeta\), which corresponds to the smallest subnormal number if \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {true}\) and to the smallest normal number if \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {false}\). This value will be used by \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) to round numbers that are too small to be represented. Then the algorithm computes the number t of significant digits in the significand of the binary representation of \(\widetilde{x}\). If x falls within the normal range of \(\mathcal {F}^{ (\ell) }\), then its significand has \({{p}}\) significant digits and the algorithm sets \(t= {{p}}\). If, however, \(\vert x \vert\) is between \({x_{\mathrm{minsub}}}^{(\ell)}\) and \({x_{\min }}^{(\ell)}\), then the exponent of x is smaller than \({e_{\smash{\min }}}^{(\ell)}\) and the number t of significant binary digits may have to be reduced. If \({e_{\smash{\min }}}^{(\ell)}= {e_{\smash{\min }}}^{(h)}\), then x is subnormal and has the same number of leading zeros in both storage and target formats. Otherwise, the value of t is given by the difference between \({{p}}\) and the number of leading zeros in the representation of \(\widetilde{x}\), including the zero to the left of the binary point.
In the coming sections, we discuss how the function \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) can be implemented for the six rounding strategies we consider.
5.1 Round-to-nearest
Our algorithm for rounding a floating-point number in \(\mathcal {F}^{ (h) }\) to the closest floating-point number in the lower-precision format \(\mathcal {F}^{ (\ell) }\) is given in Algorithm 5.2. The pseudocode describes, in particular, a variant of round-to-nearest known as ties-to-even, whereby a number exactly in-between two floating-point numbers is rounded to the rounding candidate with even significand, that is, a number that has a 0 in the \(t-1\) position to the right of the binary point. Two other variants, ties-to-zero and ties-to-away, will be briefly examined at the end of this section.
Initially, the function checks if the number to be rounded is too small to be represented in \(\mathcal {F}^{ (\ell) }\). If subnormals are supported, then the smallest representable number \({x_{\mathrm{minsub}}}^{(\ell)}\) has an odd fraction, and a tie value \(x \in \mathcal {F}^{ (h) }\) such that \(\vert x \vert = \zeta /2\) is rounded towards zero. If subnormals are not supported, then x is equidistant from two numbers with even significands, since the significand of the smallest normal number \({x_{\min }}^{(\ell)}\) in \(\mathcal {F}^{ (\ell) }\) is even. We still round x to zero, as this behavior is consistent with that of the GNU MPFR library.
Next, a number \(x \in \mathcal {F}^{ (h) }\) that is too large to be rounded to zero but has absolute value below the threshold \(\zeta\) is rounded to \(\operatorname{sign}(x) \cdot \zeta\). If \(\vert x \vert\) is larger than the threshold, then the algorithm checks whether x is within the dynamic range of \(\mathcal {F}^{ (\ell) }\): Following the IEEE 754 standard [32, Section 4.3.1], x will overflow to infinity without changes in sign if \(\vert x \vert \lt 2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\).
If x is within the range of numbers representable in \(\mathcal {F}^{ (\ell) }\), then the algorithm truncates \(x \in \mathcal {F}^{ (h) }\) to t significant digits to obtain \(\widetilde{x} \in \mathcal {F}^{ (\ell) }\), the largest number (in absolute value) that satisfies \(\operatorname{sign}(\widetilde{x}) = \operatorname{sign}(x)\) and \(\vert \widetilde{x} \vert \le \vert x \vert\). In general, \(\widetilde{x}\) is one of the two floating-point numbers in \(\mathcal {F}^{ (\ell) }\) closest to x, the other candidate being \(\widetilde{x} + \operatorname{sign}(x) \cdot \rm {\rm\small ULP}(\widetilde{x}, t)\). To choose a rounding direction, we examine the value of the discarded bits. The unsigned integer \(d:=\rm {\rm\small TAIL}(x,t)\) represents the trailing \({{p}}^{(h)}-t\) digits of the significand of x. Thus, \(0 \le d \le 2^{{{p}}^{(h)}-t}-1\), and if \(d\lt \gamma := 2^{{{p}}^{(h)}-t-1}\), then \(\widetilde{x}\) is the result to be returned, whereas if \(d \gt \gamma\), then it is necessary to add (or subtract, if \(\widetilde{x}\) is negative) \(\rm {\rm\small ULP}(\widetilde{x}, t)\) to obtain the correctly rounded value. If \(d = \gamma\), then we have a tie and we need to round to the nearest even number. Therefore, we add \(\operatorname{sign}(x) \cdot \rm {\rm\small ULP}(\widetilde{x}, t)\) if the bit in position \(t-1\) of the significand of \(\widetilde{x}\) is a 1, and we leave \(\widetilde{x}\) unchanged otherwise.
The latest revision of the IEEE 754 standard mentions two other tie-breaking rules for round-to-nearest: ties-to-zero, to be used for the recommended augmented operations, and ties-to-away, which is required for decimal formats. These can be implemented by changing the conditions of the if statements on lines 3 and 11 in Algorithm 5.2 to
Note that this implementation preserves the sign of zero, maps infinities to infinities, and does not change the encoding of quiet and signaling \({\text{N}\scriptstyle\text{A}}\text{N}\) values. The same observation is true for the rounding functions in Algorithm5.3, 5.4, and 5.5.
5.2 Round-to-odd
The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{O}}{\text{O}\scriptstyle\text{DD}}\) in Algorithm 5.3 implements round-to-odd according to the definition given by Boldo and Melquiond [3, Section 3.1], as this rounding mode is not part of the IEEE standard. Informally speaking, if x is exactly representable in \(\mathcal {F}^{ (\ell) }\), then the function returns it unchanged, otherwise it returns the number closest to x with an odd significand, that is, a significand with a trailing 1. In the spirit of the algorithms discussed so far, one could obtain \(\widetilde{x}\) by truncating \(\vert x \vert\) to the first t significant digits, checking the parity of the significand of \(\widetilde{x}\), and adding \(\rm {\rm\small ULP}(\widetilde{x}, t)\) to the result if \(\widetilde{x}\) is even. However, in this case, we know that the result of the truncation requires a correction only if the least significant digit of \(\widetilde{x}\) is a 0, in which case adding \(\rm {\rm\small ULP}(\widetilde{x}, t)\) amounts to setting that bit to 1. Therefore, we check the bits obliterated by the truncation, and if \(\rm {\rm\small TAIL}(x,t) \ne 0\), then we conclude that x is not exactly representable in \(\mathcal {F}^{ (\ell) }\) and that the significand of the rounded \(\widetilde{x}\) must be odd. We can ensure this by setting the digit in position \(t-1\) of the significand of \(\widetilde{x}\) to 1. In practice, this operation will have an effect only if that digit is not already set to 1, and in particular will have no effect when \(t= 1\), as the leading digit of the significand is not stored explicitly. The core idea of this algorithm is the same as that of the second of the two methods for round-to-odd discussed by Boldo and Melquiond [3, Section 3.4].
The algorithm must round \(\widetilde{x}\) to the closest odd number in \(\mathcal {F}^{ (\ell) }\) if it falls within the underflow or the overflow range. Since 0 is even, rounding to 0 is not an option when this rounding mode is used, and numbers too small to be represented must be rounded to the smallest floating-point number with an odd significand of corresponding sign. When subnormals are supported, the smallest representable number \({x_{\mathrm{minsub}}}^{(\ell)}\) is odd, thus numbers smaller than \({x_{\mathrm{minsub}}}^{(\ell)}\) in absolute value are simply rounded to \(\operatorname{sign}(x) \cdot {x_{\mathrm{minsub}}}^{(\ell)}\). If subnormals are not supported, however, then the smallest representable number \({x_{\min }}^{(\ell)}\) is even, and we are faced with a choice. We could round \(\widetilde{x}\) such that \(\vert \widetilde{x} \vert \lt {x_{\min }}^{(\ell)}\) to \(\operatorname{sign}(x) \cdot ({x_{\min }}^{(\ell)} + \rm {\rm\small ULP}({x_{\min }}^{(\ell)}, t))\), which is the closest number with odd significand. This choice, however, feels unnatural: The operator thus defined would not be rounding to either of the floating-point numbers closest to \(\widetilde{x}\). In our pseudocode, we prefer to round \(\widetilde{x}\) to the rounding candidate largest in magnitude, that is, \(\operatorname{sign}(x) \cdot {x_{\min }}^{(\ell)}\).
The definition given by Boldo and Melquiond cannot be applied directly to values in the overflow range, as in principle the significand of \(\pm \infty\) is neither odd nor even. Since \(-{x_{\max \vphantom{\min }}}^{(\ell)}\) and \({x_{\max \vphantom{\min }}}^{(\ell)}\) are necessarily odd, we prefer to round values outside the finite range of \(\mathcal {F}^{ (\ell) }\) to the closest finite number. Being exactly representable, infinities themselves represent an exception to this rule, and the algorithm leaves them unchanged.
5.3 Directed Rounding
The functions in Algorithm 5.4 show how to implement the three directed rounding modes prescribed by the IEEE 754 standard. The idea underlying the three functions is similar to that discussed for the function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{O}}{\text{N}\scriptstyle\text{EAREST}}\) in Algorithm 5.2, the main differences being (1) the use of the sign, which is relevant when the rounding direction is not symmetric with respect to 0, and (2) the conditions under which a unit in the last place has to be added.
We start by discussing the function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{P}\scriptstyle\text{LUS}}{\text{I}\scriptstyle\text{NFINITY}}\). First, we check whether x is within the range of numbers that are representable in \(\mathcal {F}^{ (\ell) }\). Numbers that are too small to be represented are rounded up to 0 if negative and to \(\zeta\) if positive. Finite positive numbers larger than the largest representable number \({x_{\max \vphantom{\min }}}^{(\ell)}\) overflow to \(+\infty\), whereas negative numbers smaller than the smallest representable number \(-{x_{\max \vphantom{\min }}}^{(\ell)}\) are rounded up to \(-{x_{\max \vphantom{\min }}}^{(\ell)}\), with the only exception of \(-\infty\), which is handled below using the fact that \(\rm {\rm\small TRUNC}(-\infty ,t) = -\infty\) and \(\rm {\rm\small TAIL}(-\infty ,t) = 0\).
Next, the function computes \(\widetilde{x}\), that is, the number x with significand truncated to t significant digits, and checks whether \(\widetilde{x}\) is smaller than the smallest number representable in \(\mathcal {F}^{ (\ell) }\). The rounding can be easily performed by noting that the truncation \(\widetilde{x}\) is the correct result if x is negative or exactly representable in \(\mathcal {F}^{ (\ell) }\). Otherwise, \(\widetilde{x}\) is incremented by \(\rm {\rm\small ULP}(\widetilde{x}, t)\).
The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{M}\scriptstyle\text{INUS}}{\text{I}\scriptstyle\text{NFINITY}}\) is identical, modulo some sign adjustments to take the opposite rounding direction into account. The algorithm starts by checking that x is within the range of numbers representable in \(\mathcal {F}^{ (\ell) }\). Numbers between \(-\zeta\) and 0 are rounded down to \(-\zeta\), whereas those between 0 and \(\zeta\) underflow and are flushed to 0. Numbers that are smaller than the smallest number representable in \(\mathcal {F}^{ (\ell) }\) overflow to \(-\infty\), whereas finite numbers greater than the largest number representable in \(\mathcal {F}^{ (\ell) }\) are rounded to \({x_{\max \vphantom{\min }}}^{(\ell)}\), the only exception in this case being \(+\infty\). To round a number that falls within the range of \(\mathcal {F}^{ (\ell) }\), we first compute \(\widetilde{x}\) by truncating the significand of x to t significant digits and then subtract \(\rm {\rm\small ULP}(\widetilde{x}, t)\) from \(\widetilde{x}\) if x is negative and not exactly representable with a t-digit significand.
The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{Z}\scriptstyle\text{ERO}}\) is simpler than the other two considered in this section, as truncation is sufficient to correctly round the significand of x to t significant digits. Underflow and overflow are also easier to handle: Finite numbers that are smaller than the smallest representable number in absolute value are flushed to 0, whereas numbers too large to be represented are rounded to the closest representable finite number.
5.4 Stochastic Rounding
The functions in Algorithm 5.5 describe how to implement the two variants of stochastic rounding we are concerned with.
The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{S}\scriptstyle\text{TOCHASTIC}}\) implements the strategy that rounds \(x \in \mathcal {F}^{ (h) }\) to one of the two closest floating-point numbers with probability proportional to the distance. First, the algorithm considers numbers in the underflow range, whose rounding candidates are 0 and the threshold value \(\zeta\), which equals \({x_{\mathrm{minsub}}}\) if subnormals are supported and \({x_{\min }}\) if they are not. The distance between \(\vert x \vert\) and 0 depends not only on the significand but also on the magnitude of x, thus the algorithm starts by computing the two values \(e_x\) and \(m_x\), which represent the exponent and the integral significand of x, respectively. Being the exponent of a floating-point number in \(\mathcal {F}^{ (h) }\), \(e_x\) can be much smaller than \({e_{\smash{\min }}}^{(\ell)}\), in which case it may be necessary to rescale \(m_x\) to align its exponent to \({e_{\smash{\min }}}^{(\ell)}\). This is achieved by multiplying \(m_x\) by \(2^{e_x + 1 - {e_{\smash{\min }}}^{(\ell)}}\). In the pseudocode, we take the floor of the result to keep it integer, although this is not strictly necessary: We prefer to work with integer arithmetic here so the integers generated by the random number generator can be used without any further post-processing. This is desirable not only from a performance point of view, but also because drawing floating-point numbers from the uniform distribution over an interval is not a trivial task, even when a good integer pseudo-random number generator is available [21]. Finally, the algorithm generates a \({{p}}^{(h)}\)-digit random integer \(\gamma\), which is used as a threshold to choose the rounding direction: If the discarded bits, interpreted as an unsigned integer, are larger than \(\gamma\), then x is rounded away from zero, otherwise it is rounded towards zero.
The procedure for numbers in the representable range is easier. In this case, it suffices to compute \(\widetilde{x}\), the value of x truncated to t significant bits, and then generate a random integer r between 0 and \(2^{{{p}}^{(h)}- t}\). Since \(\rm {\rm\small TAIL}(x,t)\) represents the distance between x and \(\widetilde{x}\), we increment the absolute value of \(\widetilde{x}\) by \(\rm {\rm\small ULP}(\widetilde{x}, t)\) if \(\rm {\rm\small TAIL}(x,t) \gt r\) and leave it unchanged otherwise. For overflow, we use the threshold value that the IEEE 754 standard recommends for round-to-nearest and round numbers whose absolute value after rounding is larger than the threshold \(2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\) to infinity, leaving the sign unchanged.
The function \({{\scriptstyle\text{ROUND}}\text{S}\scriptstyle\text{TOCHASTIC}}{\text{E}\scriptstyle\text{QUAL}}\) deals with the simpler strategy that rounds x up or down with equal probability. Depending on the interval in which x falls, the function selects the two closest representable numbers in \(\mathcal {F}^{ (\ell) }\) and calls the function \({{\scriptstyle\text{RAND}}\text{S}\scriptstyle\text{ELECT}}\) to select one of them with equal probability. In the pseudocode, we use a single bit generated randomly to discriminate between the two rounding directions.
6 EFFICIENT IMPLEMENTATION FOR IEEE-LIKE REPRESENTATION FORMATS
The subroutines used in Section 5 can be implemented efficiently if we assume that the numbers are represented using the floating-point format described in Section 4. First, we need to define the semantics of the operators for bit manipulation that we will rely on. These are available in most programming languages, although the notation varies greatly from language to language. For clarity, we use a prefix notation for all the operators.
Let \(\mathbf {\textsf {a}}\) and \(\mathbf {\textsf {b}}\) be strings of n bits. The bits are indexed from left to right, so \(\mathbf {\textsf {a}}_{0}\) and \(\mathbf {\textsf {a}}_{n-1}\) denote the leftmost and the rightmost bit of \(\mathbf {\textsf {a}}\), respectively. For \(i\in \mathbb {N}\), we define the following operators:
Conjunction: \(\mathbf {\textsf {c}}=\rm {\rm\small AND}(\mathbf {\textsf {a}},\mathbf {\textsf {b}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k = 1\) if \(\mathbf {\textsf {a}}_k\) and \(\mathbf {\textsf {b}}_k\) are both set to 1, and \(\mathbf {\textsf {c}}_k = 0\) otherwise.
Disjunction: \(\mathbf {\textsf {c}}=\rm {\rm\small OR}(\mathbf {\textsf {a}},\mathbf {\textsf {b}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=1\) if at least one of \(\mathbf {\textsf {a}}_k\) and \(\mathbf {\textsf {b}}_k\) is set to 1, and \(\mathbf {\textsf {c}}_k=0\) otherwise.
Negation: \(\mathbf {\textsf {c}}=\rm {\rm\small NOT}(\mathbf {\textsf {a}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k = 1\) if \(\mathbf {\textsf {a}}_k=0\), and \(\mathbf {\textsf {c}}_k=0\) otherwise.
Logical shift left: \(\mathbf {\textsf {c}}=\rm {\rm\small LSL}(\mathbf {\textsf {a}},i)\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=0\) if \(k \gt (n-1) - i\), and \(\mathbf {\textsf {c}}_k = \mathbf {\textsf {a}}_{k+i}\) otherwise.
Logical shift right: \(\mathbf {\textsf {c}}=\rm {\rm\small LSR}(\mathbf {\textsf {a}},i)\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=0\) if \(k \lt i\), and \(\mathbf {\textsf {c}}_k=\mathbf {\textsf {a}}_{k-i}\) otherwise.
Most of the operations used in Section 5 require extracting a certain subset of the bits in the binary representation of the floating-point number \(x \in \mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) while zeroing out the remaining ones. This can be achieved by taking the bitwise conjunction between the binary string that represents x and a bitmask, that is, a string as long as the binary representation of x that has ones in the positions corresponding to the bits to be extracted and zeros everywhere else. More generally, the functions in Section 5 can be implemented using the operators above as follows. In the descriptions, \(\mathbf {\textsf {x}}\) denotes the binary floating-point representation of x, n denotes a positive integer, and i denotes an integer index between 0 and \(p-1\).
\(\rm {\rm\small ABS}(x)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {abs}})\), where \(\mathbf {\textsf {m}}_{\textrm {abs}}\) is constituted by a single leading 0 followed by ones.
\(\rm {\rm\small DIGIT}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {digit}})\), where \(\mathbf {\textsf {m}}_{\textrm {digit}}\) has a 1 in the position corresponding to the digit to be extracted and 0 everywhere else. We note that checking whether this digit is 0 or 1 does not require any additional operations in programming languages such as C where 0 is interpreted as \(\mathbf {false}\) and any other integer is interpreted as \(\mathbf {true}\).
\(\rm {\rm\small EXPONENT}(x)\) can be implemented as a sequence of logic and arithmetic operations. The raw bits of the exponents can be extracted with \(\mathbf {\textsf {c}}:=\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {exp}})\), where \(\mathbf {\textsf {m}}_{\textrm {exp}}\) has 1 in the positions corresponding to the exponent bits of the binary representation of x. This can be converted into the unsigned integer \(\rm {\rm\small LSR}(\mathbf {\textsf {c}},p-1)\), and the signed exponent can be obtained by subtracting the bias of the storage floating-point format. If x is subnormal in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\), then the value computed in this way is \(-{e_{\smash{\max }}}= {e_{\smash{\min }}}-1\), and the correct value to return in this case is \({e_{\smash{\min }}}+ \lambda\), where \(\lambda\) is the number of trailing zeros in the significand of x, including the implicit bit.
\(\rm {\rm\small SIGNIFICAND}(x)\) can be implemented leveraging the function \(\rm {\rm\small EXPONENT}\). The digits to the right of the radix point can be obtained as \(\mathbf {\textsf {c}}:=\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {frac}})\) where \(\mathbf {\textsf {m}}_{\textrm {frac}}\) is the bitmask that has the \(p-1\) trailing bits set to 1 and the remaining bits set to 0. If \({{x_{\min }}\le \vert x \vert \le {x_{\max \vphantom{\min }}}}\), then \(\rm {\rm\small EXPONENT}(x) \gt {e_{\smash{\min }}}\) and the implicit bit must be set to 1. This can be achieved by using \(\mbox{$\rm {\rm\small OR}(\mathbf {\textsf {c}},\rm {\rm\small LSL}(1,p-1))$}\), for instance.
\(\rm {\rm\small RAND}(n)\) can be implemented by concatenating numbers produced by a pseudo-random number generator. Two m-bit strings \(\mathbf {\textsf {a}}\) and \(\mathbf {\textsf {b}}\) can be joined together by \(\rm {\rm\small OR}(\rm {\rm\small LSL}(\mathbf {\textsf {a}},m),\mathbf {\textsf {b}})\), and the unnecessary bits can be set to zero using a suitable bitmask.
\(\rm {\rm\small SIGN}(x)\) is relatively expensive to implement by means of bit manipulation. However, note that we only need to compute the product \(\operatorname{sign}(x) \cdot y,\) where y is a positive floating-point number. This operation can be implemented as \(\rm {\rm\small OR}(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {sign}}),\mathbf {\textsf {y}})\), where \(\mathbf {\textsf {m}}_{\textrm {sign}}\) is the bitmask with a leading 1 followed by zeros, and the string \(\mathbf {\textsf {y}}\) denotes the floating-point representation of y.
\(\rm {\rm\small TAIL}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {tail}})\), where the trailing \(p-i\) bits of \(\mathbf {\textsf {m}}_{\textrm {tail}}\) are set to 1 and the remaining bits are set to 0. This way, bits i to \(p-1\) of the significand of \(\mathbf {\textsf {x}}\) are preserved while the rest of the bits, including those representing the sign and exponent, are set to zero.
\(\rm {\rm\small TRUNC}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}^{\prime })\), where \(\mathbf {\textsf {m}}^{\prime }=\rm {\rm\small NOT}(\mathbf {\textsf {m}}_{\textrm {tail}})\). This way, bits i to \(p-1\) of the significand of \(\mathbf {\textsf {x}}\) are set to zero while the rest of the bits of \(\mathbf {\textsf {x}}\), including the exponent and sign bits, are preserved.
\(\rm {\rm\small ULP}(x,p)\) is a rather expensive function to implement, because it requires extracting the exponent from the binary representation of x and then performing arithmetic operations on it. Increasing or decreasing x by \(\rm {\rm\small ULP}(x, p)\), on the contrary, can be achieved efficiently using only one bit shift and one integer arithmetic operation. In particular, it suffices to add to the binary representation of x, seen as an unsigned integer, a number that has 0 everywhere but in the position corresponding to the pth digit of x, that is, the digit in position \(p-1\) of the significand. We note that this technique could fail if \(x = \pm \infty\), since adding \(\rm {\rm\small ULP}(x, p)\) in this fashion would turn infinities into \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\). It is easy to check that this is not a problem in our setting, as we only add or subtract \(\rm {\rm\small ULP}(x, p)\) when x is finite and nonzero.
It is possible to implement some of the rounding routines even more efficiently by extending to other rounding modes the technique for round-to-nearest with ties-to-even used in the function
We denote the 32-bit string containing the floating-point representation of y by \(\mathbf {\textsf {y}}\) and use the uppercase Latin letters X and Y to denote the unsigned integers that can be obtained by interpreting \(\mathbf {\textsf {x}}\) and \(\mathbf {\textsf {y}}\) as unsigned integers in radix 2. All the usual underflow and overflow checks are not included here—the aim is to demonstrate the core ideas for performing each type of rounding efficiently. We recall that the sign of a floating-point number can be determined by checking the leftmost bit, and that x (respectively, y) is negative if \(\mathbf {\textsf {x}}_{0}\) (respectively, \(\mathbf {\textsf {y}}_{0}\)) is set and positive otherwise. The rounding modes amenable to this approach can be implemented as follows:
Round-to-nearest with ties-to-even: isolate the bit in position \({{p}}-1\) of the significand of x, and then compute \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSR}(\mathbf {\textsf {m}}_{\textrm {tail}},1) + \rm {\rm\small DIGIT}(x, {{p}}- 1), {{p}})\). When rounding a binary32 number to binary16, the formula becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x7FFF} + \mathbf {\textsf {y}}_{15}, 16)\).
Round-to-nearest with ties-to-away: return \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSL}(\text{0x1},{{p}}^{(h)}-{{p}}-1), {{p}})\), which in our example becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x8000}, 16)\).
Round-to-nearest with ties-to-zero: return \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSR}(\mathbf {\textsf {m}}_{\textrm {tail}},1), {{p}})\), which in our example becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x7FFF}, 16)\).
Round-toward-\(+\infty\): return \(\rm {\rm\small TRUNC}(X,{{p}})\) if \(\mathbf {\textsf {x}}_{0}\) is set and \(\rm {\rm\small TRUNC}(X+ \mathbf {\textsf {m}}_{\textrm {tail}},{{p}})\) otherwise. For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\), if \(\mathbf {\textsf {y}}_{0}\) is set, and \(\rm {\rm\small TRUNC}(Y+ \text{0xFFFF}, 16)\), if not.
Round-toward-\(-\infty\): return \(\rm {\rm\small TRUNC}(X,{{p}})\) if \(\mathbf {\textsf {x}}_{0}\) is not set and \(\rm {\rm\small TRUNC}(X+ \mathbf {\textsf {m}}_{\textrm {tail}},{{p}})\) otherwise. For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\) if \(\mathbf {\textsf {y}}_{0}\) is not set and \(\rm {\rm\small TRUNC}(Y+ \text{0xFFFF}, 16)\) otherwise.
Round-toward-zero: return \(\rm {\rm\small TRUNC}(X,{{p}})\). For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\).
Stochastic rounding with proportional probabilities: return \(\rm {\rm\small TRUNC}(X+\rm {\rm\small RAND}({{p}}^{(h)}- t),{{p}})\). For our example, return \(\rm {\rm\small TRUNC}(Y+\rm {\rm\small RAND}(13), 13)\).
7 IMPLEMENTATION AND VALIDATION OF THE CODE
Our C implementation of the algorithms discussed in Sections 5 and 6 is available on GitHub. The code can be compiled as a static or dynamic library, but we also provide the option to use
A header-only library allows the user to take advantage of the inlining feature of the C language for maximum efficiency, and it also enhances the portability of the code, as packaging of the binaries and installation of the library are not required. To alleviate the main drawback of this approach, that is, a longer compilation time, we divided the library into two separate units, one for each supported storage format.
To achieve better performance on large data, our functions work directly on C arrays. All the algorithms discussed in Section 5 are embarrassingly parallel, and each element of an array can be rounded independently from all the others. Therefore, our code can leverage the OpenMP library if available on the system in use.
In general, OpenMP brings significant gains in terms of performance but greatly increases the execution time for arrays with just few elements. This well-known phenomenon is caused by the additional overhead of synchronization and loop scheduling [5], which is negligible for large arrays but can be significant when only a small amount of work is allocated to each OpenMP thread. The impact of this overhead is hard to quantify in general, as it depends on the hardware platform as well as the number of OpenMP threads and the compiler used [6]. Our library contains both a parallel and a sequential version of each function, but we were unable to provide a single threshold that would allow the code to switch automatically from one variant to the other for optimal performance. Thus, we devised a simple auto-tuning strategy that tries to determine the optimal threshold for the system in use by timing the rounding function on several arrays of different lengths and performing a binary search.
For generating the pseudo-random numbers required for stochastic rounding, we rely on algorithms from the family of permuted congruential generators developed by O’Neill [48], who provides a pure C implementation available on GitHub.7 In our code, we use the functions
To validate our code experimentally, we wrote a suite of extensive unit tests. We describe in detail how we tested the rounding routines—all other functions in the library are tested by relying on them. We considered two storage formats, binary32 and binary64, which are available in C via the native data types
First, we checked that all the numbers that can be represented exactly in the target format, including subnormals and special values such as infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\), are not altered by any of the rounding routines. As the target formats we consider do not have an unduly large cardinality, we can test that this property is true for all representable numbers.
To check the correctness of the code when rounding is necessary, exhaustive testing is not an option, as the storage formats contain too many distinct values. In this case, we opted for testing only a set of representative values. For deterministic rounding, the correctness of the function can be assessed by checking that the output of the rounding routine matches the value predicted by the definition. For each pair of numbers \(x_1, x_2 \in \mathcal {F}^{ (\ell) }\) such that \(x_1\) and \(x_2\) are consecutive in \(\mathcal {F}^{ (\ell) }\) and \(x_1 \lt x_2\), we considered five values in \(\mathcal {F}^{ (h) }\): \(\mathrm{nextafter}(x_1, +\infty)\), \(\mathrm{nextafter}(x_m,-\infty)\), \(x_m\), \(\mathrm{nextafter}(x_m,+\infty)\), and \(\mathrm{nextafter}(x_2,-\infty)\). Here, \(\mathrm{nextafter}(x,y)\) denotes the next number in \(\mathcal {F}^{ (h) }\) after x in the direction of y, and \(x_m\) denotes the mid point between \(x_1\) and \(x_2\). We used the same technique for numbers in the underflow range, whereas for testing the correctness of overflow, we used the values \(\mathrm{nextafter}(\pm {x_{\max \vphantom{\min }}}^{(\ell)},\pm \infty)\), \(\mathrm{nextafter}(\pm {x_{\mathrm{bnd}}}^{(\ell)},\mp \infty)\), \(\pm {x_{\mathrm{bnd}}}^{(\ell)}\), \(\mathrm{nextafter}(\pm {x_{\mathrm{bnd}}}^{(\ell)}, \pm \infty)\), where \({x_{\mathrm{bnd}}}^{(\ell)} := 2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\) is the IEEE 754 threshold for overflow in round-to-nearest.
This technique would not work for stochastic rounding, as each value that is not representable in \(\mathcal {F}^{ (\ell) }\) can be rounded to two different values. We produced a test set by taking, for each pair of numbers \(x_1, x_2 \in \mathcal {F}^{ (\ell) }\) such that \(x_1\) and \(x_2\) are consecutive in \(\mathcal {F}^{ (\ell) }\) and \(x_1 \lt x_2\), the numbers \((3x_1+x_2)/4\), \((x_1+x_2)/2\), and \((x_1+3x_2)/4\). We rounded each number 1,000 times and confirmed that the rounding routines always return either \(x_1\) or \(x_2\) and that the empirical probability distribution matches the expected one. We validated the correctness of the implementation for values in the underflow range by using the same technique, whereas for inputs in the overflow range, we repeated the test on the three values: \((3{x_{\max \vphantom{\min }}}^{(\ell)}+{x_{\mathrm{bnd}}}^{(\ell)})/4\), \(({x_{\max \vphantom{\min }}}^{(\ell)}+{x_{\mathrm{bnd}}}^{(\ell)})/2\), and \(({x_{\max \vphantom{\min }}}^{(\ell)}+3{x_{\mathrm{bnd}}}^{(\ell)})/4\).
The Makefile target
runs the test suite for the C implementations.
We designed a MEX interface for MATLAB and Octave that is in charge of parsing and checking the input, allocating the output, and calling our library to perform the rounding. To show that the interface is fully compatible with
in MATLAB and with
in Octave.
8 PERFORMANCE EVALUATION
The experiments were run on a machine equipped with two 12-core Intel Xeon CPU E5-2690 v3 running at 2.60 GHz. Exclusive node access was used to avoid timing artifacts and ensure that all 24 CPU threads were available for parallel runs. The C code was compiled with version 9.3.0 of the GNU Compiler Collection (GCC) with the optimization flag
We compare three versions of our codes.
cpfloat_seq denotes the sequential C codes in our library. Rounding is performed using the algorithms in Section 6.cpfloat denotes the C codes in our library that leverage OpenMP and employ the auto-tuning technique discussed in Section 7 to switch between sequential and parallel implementations. Rounding is performed using the algorithms in Section 6.cpfloat_ml denotes our MEX interface tocpfloat compiled in MATLAB. For large matrices, this function relies on the parallel version of our C codes.
As the numerical validation of the code has already been discussed in Section 7, here, we focus on timings. We time the C or C++ code by comparing the value returned by the function
8.1 Performance of the C Interface
In this section, we compare the performance of
chop_mpfr denotes the codes that rely on the GNU MPFR library.10 As storage format, we use the GNU MPFR custom data typempfr_t . For rounding, our implementation sets the precision and exponent range of MPFR to the precision and exponent range of the target format and then converts the binary64 input using the functionmpfr_set_d . Arithmetic operations usempfr_t arrays for both input and output.floatx denotes the codes that use thefloatx class from the FloatX library.11 Arrays of binary64 numbers are converted to a lower-precision target format by invoking the constructor of thefloatx class. This code requires that the parameters of the target format be specified at compile time, as thefloatx class uses C++ templates that are instantiated only for the low-precision formats declared in the source code. For arithmetic operations, we use arrays offloatx objects as input and output.floatxr denotes the codes that rely on thefloatxr class in the FloatX library. Conversion is performed using the class constructor. This function is more flexible thanfloatx in that the number of bits of precision and the maximum exponent allowed for the target format to be specified at runtime. As with the other implementations, for arithmetic operations, we assume that input and output are arrays offloatxr objects.
Figure 1 compares the time required by
As storage and target formats, we use binary64 and binary16, respectively. We repeated the experiment simulating bfloat16 and TensorFloat-32 arithmetic—we do not reproduce these results here, as they are indistinguishable from those for binary16. We use only round-to-nearest with ties-to-even, as the FloatX library currently does not support any other rounding modes. We observe, however, that
The plots in the top row report the median timing of 100 runs for each algorithm. The plots in the bottom row present the same data as slowdown with respect to the timings of
We remark that
8.2 Performance of the MATLAB Interface
In this section, we compare the performance of
First, we consider the MATLAB function
The input data is obtained by manipulating the entries of an \(n \times n\) matrix X of pseudo-random numbers uniformly distributed in \((0,1)\). To generate a matrix of normal numbers, we add to each entry of X the constant value \({x_{\min }}^{(\ell)}\), which guarantees that \(x_{ij}\), for \(i,j = 1, \dots , n\), is distributed uniformly in the interval \(\big ({x_{\min }}^{(\ell)}, 1+{x_{\min }}^{(\ell)}\big)\).
In all cases, the speedup is greater than one and increases with the size of the input matrix. In other words,
We repeated the experiment using binary32 instead of binary64 as storage format and considering matrices whose entries are subnormal, rather than normal, in the target format. The results of these experiments are not included here, as they are not noticeably different from those in Figure 2.
The timings used to generate Figure 2 were obtained allowing MATLAB to use all 24 computational threads available on the system whereon the experiment was run. To assess how the better performance of
We note that the execution time of both functions increases as the number of threads is reduced, which confirms that they can both exploit parallelism and take advantage of additional computational threads available. On such large matrices,
The data shows that
We compared the performance of
Finally, we run a similar comparison with INTLAB’s
We remark that
8.3 Overhead of the MATLAB Interface
As a final test, we consider the overhead introduced by MATLAB when calling the underlying C implementation of the rounding algorithms. In Figure 5, we compare the execution time required to convert a matrix to binary16 with a direct call to the C code (first column) or with a call to the MEX interface in MATLAB (second column). As the performance of the two algorithms is very similar and the data in the two series is hard to compare directly, we provide the speedup in the third column. We repeat the experiment for both binary32 (top row) and binary64 (bottom row). We remark that the C functions were tuned by using the
The raw timings show that in our implementations stochastic rounding is the slowest rounding mode. The performance of the other rounding modes is so similar that the lines are hard to distinguish for both the C and the MEX interface. The data in the right-most column shows that for both storage formats we consider, the overhead of the MEX interface is significant for small matrices, but becomes negligible for matrices of order 3,000 or more.
Our results seem to suggest that MATLAB code that requires the functionalities of
9 SUMMARY AND FUTURE WORK
Motivated by the growing number of tools and libraries for simulating low-precision arithmetic, we considered the problem of rounding floating-point numbers to low precision in software. We developed low-level algorithms for a number of rounding modes, explained how to implement them efficiently using bit manipulation, and discussed how to validate their behavior experimentally by means of exhaustive testing. We developed
Traditionally, floating-point arithmetic has been the most widely adopted technique for working with non-integer numbers in high-performance scientific computing, but alternative methods have recently begun to gain popularity. In particular, we believe that the techniques we developed here could be adapted to posit arithmetic [12, 26], a generalization of the IEEE 754 floating-point number format, and to fixed-point arithmetic, a de facto standard technique for working with reals on systems that are not equipped with a floating-point unit. This will be the subject of future work.
ACKNOWLEDGMENTS
The authors are grateful to Nicholas J. Higham, for discussions about the MATLAB function
Footnotes
1 https://github.com/north-numerical-computing/cpfloat.
Footnote2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718#c25.
Footnote3 https://www.gnu.org/software/libc/.
Footnote4 https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.
Footnote- Footnote
- Footnote
7 https://github.com/imneme/pcg-c.
Footnote8 https://github.com/higham/chop/blob/master/test_chop.m.
Footnote9 https://github.com/north-numerical-computing/cpfloat_experiments.
Footnote- Footnote
11 https://github.com/oprecomp/FloatX.
Footnote12 https://github.com/higham/chop.
Footnote13 https://gerard-meurant.pagesperso-orange.fr/floatp.zip.
Footnote14 https://www.tuhh.de/ti3/rump/intlab/.
Footnote
- [1] . 2020. Arm Architecture Reference Manual.
Technical Report ARM DDI 0487F.c (ID072120). Retrieved from https://developer.arm.com/documentation/ddi0487/fc/.Google Scholar - [2] . 2016. A festschrift for selim g. akl. Emergent Computation, Andrew Adamatzky (Ed.). Emergence, Complexity and Computation (ECC), Vol. 24, Springer, Cham.
DOI: Google ScholarCross Ref - [3] . 2008. Emulation of a FMA and correctly rounded sums: Proved algorithms using rounding to odd. IEEE Trans. Comput. 57, 4 (
Feb. 2008), 462–471.DOI: Google ScholarDigital Library - [4] . 2021. A study of the effects and benefits of custom-precision mathematical libraries for HPC codes. IEEE Trans. Emerg. Topics Comput. 9, 3 (
July 2021), 1467–1478.DOI: Google ScholarCross Ref - [5] . 1999. Measuring synchronisation and scheduling overheads in OpenMP. In Proceedings of 1st European Workshop on OpenMP. Wiley, 99–105.Google Scholar
- [6] . 2012. A microbenchmark suite for OpenMP tasks. In Proceedings of the 8th International Workshop on OpenMP(
Lecture Notes in Computer Science , Vol. 7312). Springer-Verlag, Berlin, 271–274.DOI: Google ScholarDigital Library - [7] . 2019. Bfloat16 processing for neural networks. In Proceedings of the 26th IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 88–91.
DOI: Google ScholarCross Ref - [8] . 2021. Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput. 43, 1 (
Jan. 2021), A566–A585.DOI: Google ScholarDigital Library - [9] . 2022. Stochastic rounding: Implementation, error analysis and applications. Roy. Societ. Open Sci. 9, 3 (
Mar. 2022).DOI: Google ScholarCross Ref - [10] . 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 1 (
Jan. 2018), 82–99.DOI: Google ScholarCross Ref - [11] . 2017. rpe v5: An emulator for reduced floating-point precision in large numerical simulations. Geosci. Model Dev. 10, 6 (
June 2017), 2221–2230.DOI: Google ScholarCross Ref - [12] . 2019. Posits: The good, the bad and the ugly. In Proceedings of the Conference for Next Generation Arithmetic. ACM Press, 1–10.
DOI: Google ScholarDigital Library - [13] . 2023. Matrix multiplication in multiword arithmetic: Error analysis and application to GPU tensor cores. SIAM J. Sci. Comput. 45, 1 (
Feb. 2023), C1–C19.DOI: Google ScholarCross Ref - [14] . 2021. Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7 (
Feb. 2021), e330(1–19).DOI: Google ScholarCross Ref - [15] . 2021. Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. IEEE Trans. Emerg. Topics Comput. 9, 3 (
July 2021), 1451–1466.DOI: Google ScholarCross Ref - [16] . 2019. FloatX: A C++ library for customized floating-point arithmetic. ACM Trans. Math. Softw. 45, 4 (
Dec. 2019), 1–23.DOI: Google ScholarDigital Library - [17] . 1950. Round-off errors in numerical integration on automatic machinery. Bull. Amer. Math. Societ. 56 (
Nov. 1950), 55–64.DOI: Google ScholarCross Ref - [18] . 1959. Reprint of a note on rounding-off errors. SIAM Rev. 1, 1 (
Jan. 1959), 66–67.DOI: Google ScholarDigital Library - [19] . 2007. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33, 2 (
June 2007), 13:1–13:15.DOI: Google ScholarDigital Library - [20] . 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (
Mar. 1991), 5–48.DOI: Google ScholarDigital Library - [21] . 2020. Generating random floating-point numbers by dividing integers: A case study. In Proceedings of the International Conference on Computational Science. Springer-Verlag, Cham, Switzerland, 15–28.
DOI: Google ScholarCross Ref - [22] . 2020. IPU Programmer’s Guide. Retrieved from https://www.graphcore.ai/docs/ipu-programmers-guide.Google Scholar
- [23] . 2019. The OpenCL C++ 1.0 Specification. The Khronos Group. Retrieved from https://registry.khronos.org/OpenCL/specs/2.2/pdf/OpenCL_Cxx.pdf.Google Scholar
- [24] . 2019. The OpenCL C 2.0 Specification. The Khronos Group. Retrieved from https://registry.khronos.org/OpenCL/specs/2.2/pdf/OpenCL_C.pdf.Google Scholar
- [25] . 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning(
Proceedings of Machine Learning Research , Vol. 37). PMLR, 1737–1746. Retrieved from http://proceedings.mlr.press/v37/gupta15.html.Google Scholar - [26] . 2017. Beating floating point at its own game: Posit arithmetic. Supercomput. Front. Innov. 4, 2 (
June 2017), 71–86.DOI: Google ScholarDigital Library - [27] . 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
DOI: Google ScholarCross Ref - [28] . 2019. Simulating low precision floating-point arithmetic. SIAM J. Sci. Comput. 41, 5 (
Oct. 2019), C585–C602.DOI: Google ScholarDigital Library - [29] . 2020. Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. Philos. Trans. R. Soc. A 378, 2166 (
Jan. 2020), 22.DOI: Google ScholarCross Ref - [30] . 1985. IEEE Standard for Binary Floating-point Arithmetic, ANSI/IEEE Standard 754-1985. Institute of Electrical and Electronics Engineers, Piscataway, NJ.
DOI: Google ScholarCross Ref - [31] . 2008. IEEE Standard for Floating-point Arithmetic, IEEE Std 754-2008 (revision of IEEE Std 754-1985). Institute of Electrical and Electronics Engineers, Piscataway, NJ.
DOI: Google ScholarCross Ref - [32] . 2019. IEEE Standard for Floating-point Arithmetic, IEEE Std 754-2019 (revision of IEEE Std 754-2008). Institute of Electrical and Electronics Engineers, Piscataway, NJ.
DOI: Google ScholarCross Ref - [33] . 2022. Accuracy of Mathematical Functions in Single, Double, Extended Double and Quadruple Precision.
Technical Report hal-03141101, version 2. Inria. Retrieved from https://hal.inria.fr/hal-03141101v2.Google Scholar - [34] . 2018. BFLOAT16—Hardware Numerics Definition. Retrieved from https://software.intel.com/en-us/download/bfloat16-hardware-numerics-defnition.Google Scholar
- [35] . 2020. Intel Architecture Instruction Set Extensions and Future Features Programming Reference. Technical Report 319433-038 (March 2020). Retrieved from https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google Scholar
- [36] . 2017. Power ISA Version 3.0 B. (2017). Retrieved from https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv.Google Scholar
- [37] . 2013. SIPE: Small integer plus exponent. In Proceedings of the 21st IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 99–106.
DOI: Google ScholarDigital Library - [38] . 2013. Sipe: A Mini-library for Very Low Precision Computations with Correct Rounding.
Technical Report hal-00864580, version 1. Inria. Retrieved from https://hal.inria.fr/hal-00864580.Google Scholar - [39] . 2016. Quad precision floating point on the IBM z13. In Proceedings of the 23rd IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 87–94.
DOI: Google ScholarCross Ref - [40] . 2022. The GNU C Library Reference Manual (for version 2.36 ed.). Free Software Foundation. Retrieved from https://www.gnu.org/software/libc/manual/pdf/libc.pdf.Google Scholar
- [41] . 2020. FLOATP_Toolbox. Retrieved from https://gerard-meurant.pagesperso-orange.fr/floatp.zip.
Matlab software, variable precision floating point arithmetic. Google Scholar - [42] . 2017. “Half Precision” 16-Bit Floating Point Arithmetic. Blog post. Retrieved from https://blogs.mathworks.com/cleve/2017/05/08/half-precision-16-bit-floating-point-arithmetic.Google Scholar
- [43] . 2005. On the Definition of \(\mathrm{ulp}(X)\).
Technical Report RR-5504, LIP RR-2005-09. Inria, LIP. Retrieved from https://hal.inria.fr/inria-00070503.Google Scholar - [44] . 2018. Handbook of Floating-point Arithmetic (2nd ed.). Birkhäuser.
DOI: Google ScholarCross Ref - [45] . 2016. NVIDIA Tesla P100 Architecture.
Technical Report WP-08019-001_v01.1. Retrieved from https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google Scholar - [46] . 2020. NVIDIA A100 Tensor Core GPU Architecture.
Technical Report V1.0. Retrieved from https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.Google Scholar - [47] . 2022. NVIDIA H100 Tensor Core GPU Architecture.
Technical Report V1.03. Retrieved from https://resources.nvidia.com/en-us-tensor-core.Google Scholar - [48] . 2014. PCG: A Family of Simple Fast Space-efficient Statistically Good Algorithms for Random Number Generation.
Technical Report HMC-CS-2014-0905. Harvey Mudd College. Retrieved from https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf.Google Scholar - [49] . 2022. FASE: A fast, accurate and seamless emulator for custom numerical formats. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 144–146.
DOI: Google ScholarCross Ref - [50] . 2001. Numerical Computing with IEEE Floating Point Arithmetic. Society for Industrial and Applied Mathematics, Philadelphia, PA.
DOI: Google ScholarCross Ref - [51] . 2017. Radeon’s Next-generation Vega Architecture.
Technical Report . Advanced Micro Devices. Retrieved from https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf.Google Scholar - [52] . 2018. Influence of a-posteriori subcell limiting on fault frequency in higher-order DG schemes. In Proceedings of the 8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS). Institute of Electrical and Electronics Engineers, Piscataway, NJ, 79–86.
DOI: Google ScholarCross Ref - [53] . 2014. Innocuous double rounding of basic arithmetic operations. J. Formaliz. Reason. 7, 1 (
July 2014), 131–142.DOI: Google ScholarCross Ref - [54] . 1999. INTLAB — INTerval LABoratory. In Developments in Reliable Computing. Springer-Verlag, Dordrecht, Netherlands, 77–104.
DOI: Google ScholarCross Ref - [55] . 2017. IEEE754 precision-k base-\(\beta\) arithmetic inherited by precision-m base-\(\beta\) arithmetic for \(k \lt m\). ACM Trans. Math. Softw. 43, 3 (
Jan. 2017), 1–15.DOI: Google ScholarDigital Library - [56] . 2021. Dynamically adapting floating-point precision to accelerate deep neural network training. In Proceedings of the 20th IEEE International Conference on Machine Learning and Applications. 980–987.
DOI: Google ScholarCross Ref - [57] . 2021. Doubt and redundancy kill soft errors—Towards detection and correction of silent data corruption in task-based numerical software. In Proceedings of the 11th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale. Institute of Electrical and Electronics Engineers, Piscataway, NJ.
DOI: Google ScholarCross Ref - [58] . 2018. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, 7675–7684. Retrieved from http://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.Google Scholar
- [59] . 1963. Rounding Errors in Algebraic Processes. (Notes on Applied Science No. 32). Her Majesty’s Stationery Office.Google ScholarDigital Library
- [60] . 2019. QPyTorch: A low-precision arithmetic simulation framework. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition. Institute of Electrical and Electronics Engineers, Piscataway, NJ, 11–13.
DOI: Google ScholarCross Ref
Index Terms
- CPFloat: A C Library for Simulating Low-precision Arithmetic
Recommendations
Simulating Low Precision Floating-Point Arithmetic
The half-precision (fp16) floating-point format, defined in the 2008 revision of the IEEE standard for floating-point arithmetic, and a more recently proposed half-precision format bfloat16, are increasingly available in GPUs and other accelerators. While the ...
Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support
The demand for improved SIMD floating-point performance on general-purpose x86-compatible microprocessors is rising. At the same time, there is a conflicting demand in the low-power computing market for a reduction in power consumption. Along with this, ...
Design of a versatile and cost-effective hybrid floating-point/LNS arithmetic processor
GLSVLSI '07: Proceedings of the 17th ACM Great Lakes symposium on VLSILNS (logarithmic number system) arithmetic has the advantages of high-precision and high performance in complex function computation. However, the large hardware problem in LNS addition/subtraction computation has made the large word-length LNS ...
Comments