skip to main content

CPFloat: A C Library for Simulating Low-precision Arithmetic

Published:17 June 2023Publication History

Skip Abstract Section

Abstract

One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic.

Skip 1A PLETHORA OF FLOATING-POINT FORMATS AND ROUNDING MODES Section

1 A PLETHORA OF FLOATING-POINT FORMATS AND ROUNDING MODES

The 2019 revision of the IEEE 754 standard for floating-point arithmetic [32] specifies three basic binary formats for computation: binary32, binary64, and binary128. Most 64-bit CPUs equipped with a floating-point unit natively support binary32 and binary64 arithmetic, and 32-bit CPUs can simulate binary64 arithmetic very efficiently by relying on highly optimized software libraries. The binary128 format, introduced in the 2008 revision [31] of the original IEEE 754 standard [30], has not gained much popularity among hardware manufacturers, and over 10 years after having been standardized, it is supported only by two IBM processor families: the POWER9, which implements version 3.0 of the Power Instruction Set Architecture [36], and the z/Architecture [39].

In fact, the low-precision requirements of artificial intelligence applications have steered the hardware market in the opposite direction, and a number of fewer-than-32-bit formats have been proposed in recent years. The first widely available 16-bit floating-point format was binary16. Despite having been defined in the last two revisions of the IEEE 754 standard only as an interchange format, binary16 has been supported as an arithmetic format by all NVIDIA microarchitectures since Pascal [45] and all AMD architectures since Vega [51]. Google has recently introduced the bfloat16 data type [34], a 16-bit format with approximately the same dynamic range as binary32.

The latest Armv8 CPUs support a wide variety of floating-point formats, including binary32, binary64, bfloat16 [1, Section A1.4.5], binary16, and an alternative custom half-precision format [1, Section A1.4.2]. The latter is a 16-bit format based on binary16 that extends the dynamic range from \([-65{,}504,\: +65{,}504]\) to \([-131{,}008, \: +131{,}008]\) by reclaiming the \(2{,}048\) bit patterns (about 3%) used for infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\) (Not a Number).

The latest NVIDIA graphics card microarchitectures offer new low-precision formats, aimed at machine learning, that break with the 16-bit tradition. The A100 GPUs, which are based on the Amprere microarchitecture [46], support a 19-bit floating-point format, called TensorFloat-32, which combines the precision of binary16 with the exponent range of binary32. The latest H100 GPUs, based on the Hopper microarchitecture [47], provide two 8-bit formats: E4M3, with 4 bits of precision and dynamic range \([-240, 240]\), and E5M2, which trades one bit of precision for the much larger dynamic range \([-57{,}344, 57{,}344]\).

In Section 4, we discuss the general framework that underlies all these floating-point formats. This framework is a straightforward extension of the IEEE 754 standard [30, 31, 32] and is well established in the literature [27, 44].

Such a broad range of floating-point formats poses a major challenge to those trying to develop mixed-precision algorithms for scientific computing, because studying the numerical behavior of an algorithm in different working precisions may require access to a number of high-end hardware platforms. To alleviated this issue, two families of techniques for simulating low-precision floating-point arithmetic via software have been proposed over the years.

In principle, one can use (signed) integers to represent the exponent and significand of the low-precision numbers. All mathematical functions can then be evaluated by explicitly operating on the integers that make up these representations. This solution allows for maximum flexibility, and it is typically used to simulate arbitrary-precision arithmetics, where exponent and significand can have arbitrarily many digits.

If only low-precision arithmetic is of interest, however, then relying on the floating-point arithmetics available in hardware will typically lead to more efficient software simulators. In fact, one can perform each arithmetic operation using a floating-point format available in hardware, check that the result is representable in the low-precision format of interest, and then round the significand to the desired number of significant bits. In this context, simulating low-precision floating-point arithmetic only requires the ability to round a high-precision number to lower precision, and users can leverage existing hardware and mathematical libraries. This method is more convenient: It leads to software that is easier to maintain, since less code is necessary, and is typically more efficient, as most of the computation is performed using thoroughly optimized high-precision mathematical routines that are already available. Here, we follow this second approach, which is the one most often encountered in the literature, as the survey of existing software for simulating low-precision floating-point arithmetic in Section 2 illustrates.

Our contribution is two-fold. First, we discuss how the operations underlying the rounding of a floating-point number x to lower precision can be performed directly on the binary floating-point representation of x. We present the new algorithms in Section 5, and in Section 6, we explain how to implement them efficiently using bitwise operations and integer arithmetic. Second, we introduce CPFloat, a header-only C library that implements our algorithms and can be used to simulate custom low-precision binary floating-point arithmetic. The name of the library is a shorthand for Custom-Precision Floats. Section 3 lists the functions available in the library and provides a minimal, self-contained code snippet showing how CPFloat can be used as a full arithmetic library with custom precision floating-point formats. Section 7 discusses some implementation details and describes how the library was tested for correctness.

The floating-point representation in the simulated (target) format exists only implicitly: In practice, all numbers are represented in a (storage) format that is natively supported by the compiler. The storage formats currently available in CPFloat are binary32 and binary64.

CPFloat is not the first library for simulating low-precision arithmetic in C: The GNU MPFR library [19], for example, allows the programmer to work with arbitrarily few, as well as arbitrarily many, bits of precision. Unlike MPFR, CPFloat is intended only as a simulator for low-precision floating-point arithmetics. Internally, floating-point values are represented using binary32 or binary64 numbers, thus only formats with at most 53 bits of precision and an exponent range no broader than that of binary64 can be simulated. This narrower aim provides scope for the wide range of optimizations we discuss, which in turn yield more efficient implementations.

We provide a MEX interface that allows users to call CPFloat directly from their MATLAB and Octave codes. We use this to compare the performance of the new library with that of the MATLAB function chop [28] and of the MATLAB toolboxes FLOATP_Toolbox [41] and INTLAB [54]. Our experimental results, discussed in Section 8, show that the new codes bring a considerable speedup over these existing alternatives, as long as the matrices being rounded are large enough to offset the overhead of calling C code from MATLAB.

The library can be used to prototype and test mixed-precision algorithms, as well as to simulate custom-precision hardware. Recently, for example, we have used CPFloat to develop an algorithm for computing matrix–matrix products on NVIDIA GPUs equipped with tensor cores [13]. These mixed-precision matrix multiply–accumulate units internally use different precisions and rounding modes [14]. Using a software simulation, we could evaluate the numerical behavior of custom variations of the tensor cores, which we obtained by changing the rounding mode used in some key operations.

Skip 2RELATED WORK Section

2 RELATED WORK

A number of packages to simulate low-precision floating-point arithmetic via software are available. These are summarized in Table 1, where we report the main programming language in which the software is written and detail what storage formats, target formats, and rounding modes are supported.

Table 1.
PackagePrimaryStorageTarget formatRNERNZRNARZRUDROSR
namelanguageformatpesbuilt-in
GNU MPFRCcustomAAO\(\checkmark\)!\(\checkmark\)\(\checkmark\)
SipeCmultipleRSY\(\checkmark\)\(\checkmark\)
rpeFortranfp64RBBfp16\(\checkmark\)
FloatXC++fp32/fp64RSY\(\checkmark\)
INTLABMATLABfp64RSY\(\checkmark\)\(\checkmark\)\(\checkmark\)
chopMATLABfp32/fp64RSFfp16/bf16\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
FLOATPMATLABfp64RAN\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
QPyTorchPythonfp32RSN\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
CPFloatCfp32/fp64RSFfp16/bf16/tf32\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)\(\checkmark\)
  • The first three columns report the name of the package, the main programming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponent range can be arbitrary (A), must be a—possibly restricted—sub-range of the exponent range of the storage format (S), or can be a sub-range only for built-in types (B); whether the target format supports subnormal numbers (Y), does not support them (N), supports them only for built-in types (B), supports them by default but allows the user to switch the functionality off (F), or does not support them by default but allows the user to turn the functionality on (O). The following column lists the floating-point formats that are built into the system, if any. The last seven columns indicate what rounding modes the software supports: round-to-nearest with ties-to-even (RNE), ties-to-zero (RNZ), or ties-to-away (RNA), round-toward-zero (RZ), round-to-\(+\infty\) and round-to-\(-\infty\) (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 4 (SR). Fully supported rounding modes are denoted by \(\checkmark\), while ! is used for rounding modes that can be simulated at a higher computational cost. The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.

Table 1. Synoptic Table of Available Software Packages for Simulating Low-precision Floating-point Arithmetic

  • The first three columns report the name of the package, the main programming language in which the software is written, and what storage formats are supported. The following three columns describe the parameters of the target formats: whether the number of bits of precision in the significand is arbitrary (A) or limited to the number of bits in the significand of the storage format (R); whether the exponent range can be arbitrary (A), must be a—possibly restricted—sub-range of the exponent range of the storage format (S), or can be a sub-range only for built-in types (B); whether the target format supports subnormal numbers (Y), does not support them (N), supports them only for built-in types (B), supports them by default but allows the user to switch the functionality off (F), or does not support them by default but allows the user to turn the functionality on (O). The following column lists the floating-point formats that are built into the system, if any. The last seven columns indicate what rounding modes the software supports: round-to-nearest with ties-to-even (RNE), ties-to-zero (RNZ), or ties-to-away (RNA), round-toward-zero (RZ), round-to-\(+\infty\) and round-to-\(-\infty\) (RUD), round-to-odd (RO), and the two variants of stochastic rounding discussed in Section 4 (SR). Fully supported rounding modes are denoted by \(\checkmark\), while ! is used for rounding modes that can be simulated at a higher computational cost. The abbreviations bf16, tf32, fp16, fp32, and fp64 denote the formats bfloat16, TensorFloat-32, binary16, binary32, and binary64, respectively.

The most comprehensive software package for simulating arbitrary-precision floating-point arithmetic is the GNU MPFR library [19], which extends the formats in the IEEE 754 standard to any precision and to an arbitrarily large exponent range. The library is written in C, but interfaces for most programming languages are available. GNU MPFR is a de facto standard for working with arbitrary precision and is typically used to perform computations that require high accuracy, rather than to simulate low-precision arithmetics.

Sipe is a header-only C mini-library designed to simulate low precision efficiently. This software supports round-to-nearest with ties-to-even and round-to-zero, and numbers can be stored either as a pair of signed integers representing the significand and the exponent of the floating-point value [37] or as a value in a native floating-point format. The latest version of the library [38] supports float, double, long double, or __float128 (from the GCC libquadmath library).

Dawson and Düben [11] recently developed a Fortran library, called rpe, for simulating reduced-precision floating-point arithmetics in large numerical simulations. In rpe, the reduced-precision floating-point values are stored as binary64 numbers, a solution that provides a very efficient technique for simulating floating-point formats with the same exponent range as binary64 and no more than 53 bits of precision.

FloatX [16] is a header-only C++ library for simulating low-precision arithmetic that supports binary32 and binary64 as storage formats. This software is more flexible than rpe, in that it allows the user to choose not only the number of significant digits, but also the number of bits used to represent the exponent in the target format.

The only rounding mode currently implemented by both rpe and FloatX is round-to-nearest with ties-to-even, which may be too restrictive when one wants to simulate hardware units where truncation or stochastic rounding are available.

The flround function in the INTLAB toolbox for MATLAB and Octave [54] allows the user to round binary64 values to lower-precision formats. The routine implements the binary version of the algorithm by Rump [55, Alg. 3] and rounds a number x by first computing \(x^{\prime }=x+C\), where C is a suitably chosen constant, and then returning \(y=x^{\prime }-\,C\). Addition and subtraction are performed in floating-point arithmetic using the format in which x is stored and are thus performed efficiently when the operations can be performed in hardware. This strategy requires that the exponent range of x be limited, to avoid overflow in the computation of C: In binary64, for example, the maximum available exponent is 1,023, but the exponent of x cannot exceed 970 when the flround function is used. The toolbox supports round-to-nearest with ties-to-even and the three directed rounding modes prescribed by the IEEE 754 standard. Furthermore, the full arithmetic library fl is available on fl-type objects, which are rounded binary64 values to some lower custom precision representation.

Higham and Pranesh [28] proposed chop, a MATLAB function for rounding arrays of binary32 or binary64 numbers to lower precision. This solution is more efficient and flexible than the fp16 and vfp16 MATLAB data types proposed by Moler [42], as the user can specify not only the boundaries of the dynamic exponent range and the number of bits in the significand, but also the rounding mode to be used and whether subnormals are supported. chop supports six rounding modes: the four default rounding modes prescribed by the IEEE 754-2019 for single operations and two variants of stochastic rounding. This function can be used only from within the MATLAB programming environment, and the underlying algorithms, which rely on mathematical operations involving the exponent and significand of the represented floating-point numbers, are not necessarily suitable for efficient implementations in a low-level language such as C. For example, chop uses the built-in MATLAB functions abs, sign, ceil, floor, log2, and pow2, which may not be optimized for the narrow range of inputs required to disassemble and reassemble floating-point numbers.

FLOATP_Toolbox [41] is a MATLAB toolbox for simulating reduced-precision fixed-point and floating-point arithmetics. This library uses binary64 as storage format, supports the same six rounding modes as chop [28], and implements a number of mathematical functions, such as log, exp, or sin, for example. The functionalities of the FLOATP_Toolbox can be used in two ways: either by calling the routines that work directly on the binary64 data structure or by relying on the methods of the floatp class, which override a number of built-in MATLAB functions. A similar library for simulating posit arithmetic has been developed by the same author [12, 26].

The QPyTorch library is a low-precision arithmetic simulator, written in PyTorch, whose primary goal is to facilitate the training of neural networks in low precision without the overhead of building low-precision hardware [60]. The core design principles of this library are analogous to those of chop, in that numbers are stored in binary32 before as well as after rounding. QPyTorch supports custom floating-point formats that can fit into the binary32 format, arbitrary-precision fixed-point formats no wider than 24 bits, and block floating-point formats [59]. Infinities, \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\), and subnormals are not supported for efficiency’s sake and because, as the authors point out, these special values do not appear when training neural networks and thus may not be supported by the underlying low-precision hardware. The supported rounding modes are stochastic rounding and round-to-nearest with ties-to-even.

VPREC-libm [4] is a related tool developed to evaluate the accuracy/performance tradeoffs of numerical code. The library instruments the code so calls to mathematical functions are intercepted at runtime and performed in simulated low precision. Each function is evaluated in binary128 arithmetic by using the GCC libquadmath library and the result is then rounded to the target precision.

The packages SERP [56] and FASE [49] utilize dynamic binary translation to seamlessly change the precision of an executable without requiring any alteration of the source code. A downside of this approach is that the user cannot control which rounding mode a given operation will use, and the target format appears to be a global setting that cannot be changed on a per-operation basis—this significantly limits the simulation environment and notably does not allow for the simulation of mixed-precision operations, which are becoming more and more common in hardware devices.

The algorithms presented here complement existing software by proposing efficient techniques for implementing rounding using low-level bitwise instructions. Our library is intended as a software package that enables the use of these rounding functionalities in low-level languages such as C and C++, but high-level languages that allow the user to call C routines, either directly or indirectly, can also benefit from it.

Skip 3FEATURES AND EXAMPLES Section

3 FEATURES AND EXAMPLES

Listing 1.

Listing 1. CPFloat example.

At the core of CPFloat are the efficient rounding routines described in Section 5. These convert a number from one floating-point format (the storage format, which will be either binary32 or binary64) to a second, custom format (the target format). The representation in the target format is implicit, as the rounded numbers are still stored as either binary32 or binary64 values. Relying on these rounding routines, CPFloat provides functions to simulate custom low-precision arithmetic when operating on arrays of scalars. We now describe its user interface, using Listing 1 as reference. In the example, we use double as storage format, thus, we include the header file cpfloat_binary64.h; the header file cpfloat_binary32.h should be included when using float arrays.

The target numerical format is specified by a C structure of type optstruct. The structure should be initialized by calling

as done on line 8. This function allocates the memory underlying the optstruct correctly, but the target format that the structure represents is still undefined. The parameters of the target format should be initialized explicitly as shown on lines 11–17.

In Listing 1, we do not showcase all the functionalities of CPFloat, and, in particular, we do not inject soft errors in the low-precision results. The library supports two injection modes, which can be selected by changing the value of the flip field of the optstruct structure. If flip is set to CPFLOAT_FRAC_SOFTERR, then with probability p a single bit flip can strike the fraction of the low-precision rounded number. This is the injection mode used in chop [28]. If flip is set to CPFLOAT_FP_SOFTERR, however, then the single bit flip can strike any bit of the low-precision floating-point representation. Soft errors are typically modelled in this way in high-performance computing studies [2] as well as in numerical simulations [52, Section V]. For reproducibility, errors are sometimes introduced by hand at specific points of the execution rather than at random [57]; in CPFloat, this can be achieved by setting p to 1.0 and switching the value of the flip field between CPFLOAT_FP_SOFTERR (soft errors on) and CPFLOAT_NO_SOFTERR (soft errors off).

The parameters in an optstruct structure can be validated with

if the storage format is binary64, or

if the storage format is binary32. The convention of appending an f to the function name when the storage format is binary32 is common in C and is used throughout the library whenever a binary64 and a binary32 variants are offered. The two functions return a negative integer if the fields of the input do not represent a valid floating-point format, zero if they do, and a positive integer if they represent a valid format that may yield an unexpected behavior such as potentially harmful double rounding (see Section 5). Some of the fields of an optstruct are pointers, and the function

should be used as shown on line 49 to free the memory correctly.

The functions that make up the library can be divided into three groups: functions for rounding, functions for computing in low-precision arithmetic, and functions for probing the low-precision floating-point format. We discuss these assuming arrays of binary64 values—for binary32 arrays it suffices to change all double to float and add an f at the end of the function name.

A double array can be rounded to low precision using the functions

which convert the n elements in the array A to the format specified by the parameters in fpopts. We note that cpfloat is just an alias for cpf_fpround that ensures backward compatibility with previous versions of the library—the two functions rely on the same implementation. We also remark that X and A may point to the same memory location, in which case the rounding is performed in place.

The second group contains functions for simulating the four elementary arithmetic operations and many mathematical functions in low precision. The computation is first performed in the storage format, using the corresponding C operators or C mathematical library (math.h) functions, and the result is then rounded to the specified low-precision format.

The third and last group comprises functions that work on the implicit target-format representation. These can be used to extract the exponent and fraction of the number rounded to low precision to compute the number that in the target format is closest to a given value in the storage format and to check whether a value in the storage format would become subnormal, normal, infinite, or \({\text{N}\scriptstyle\text{A}}\text{N}\) once rounded to the target format.

We list all functions available in CPFloat in Table 2 and refer the reader to the package documentation for additional details.

Table 2.
FunctionDescription
cpfloat_validate_optstructValidate fields of optstruct struct variable.
cpfloatRound double array to target format (legacy function name).
cpf_fproundRound double array to target format.
cpf_addSum in target format.
cpf_subDifference in target format.
cpf_mulProduct in target format.
cpf_divRatio in target format.
cpf_cosTrigonometric cosine.
cpf_sinTrigonometric sine.
cpf_tanTrigonometric tangent.
cpf_acosInverse trigonometric cosine.
cpf_asinInverse trigonometric sine.
cpf_atanInverse trigonometric tangent.
cpf_atan2Two-argument arctangent.
cpf_coshHyperbolic cosine.
cpf_sinhHyperbolic sine.
cpf_tanhHyperbolic tangent.
cpf_acoshInverse hyperbolic cosine.
cpf_asinhInverse hyperbolic sine.
cpf_atanhInverse hyperbolic tangent.
cpf_expExponential.
cpf_frexpExponent and normalized fraction in target format.
cpf_ldexpScale number by power of 2.
cpf_logNatural logarithm.
cpf_log10Base-10 logarithm.
cpf_modfIntegral and fractional part.
cpf_exp2Base-2 exponential.
cpf_expm1\(\mathrm{exp}(x) - 1\).
cpf_ilogbIntegral part of logarithm of absolute value.
cpf_log1pNatural logarithm of number shifted by one.
cpf_log2Base-2 logarithm.
cpf_scalbnScale number by power of FLT_RADIX.
cpf_scalblnScale number by power of FLT_RADIX.
cpf_powReal powers.
cpf_sqrtSquare root.
cpf_cbrtCube root.
cpf_hypotHypotenuse of a right-angle triangle.
cpf_erfError function.
cpf_erfcComplementary error function.
cpf_tgammaGamma function.
cpf_lgammaNatural logarithm of absolute value of gamma function.
cpf_ceilCeiling function.
cpf_floorFloor function.
cpf_truncInteger truncation.
cpf_roundClosest integer (round-to-nearest).
cpf_lroundClosest integer (round-to-nearest).
cpf_llroundClosest integer (round-to-nearest).
cpf_rintClosest integer with specified rounding mode.
cpf_lrintClosest integer with specified rounding mode.
cpf_llrintClosest integer with specified rounding mode.
cpf_nearbyintClosest integer with specified rounding mode.
cpf_remainderRemainder of floating point division.
cpf_remquoRemainder and quotient of rounded numbers.
cpf_copysignCompose number from magnitude and sign.
cpf_nextafterNext number in specified direction in target format.
cpf_nexttowardNext number in specified direction in target format.
cpf_fdimPositive difference.
cpf_fmaxElement-wise maximum.
cpf_fminElement-wise minimum.
cpf_fpclassifyCategorize floating-point values.
cpf_isfiniteCheck whether value is finite in target format.
cpf_isinfCheck whether value is infinite in target format.
cpf_isnanCheck if value is not a number in target format.
cpf_isnormalCheck whether value is normal in target format.
cpf_fabsAbsolute value.
cpf_fmaFused multiply-add.
  • The names of the corresponding functions for float are obtained by appending the suffix “f”: for example, the function cpfloat_mulf can be used to multiply two float arrays elementwise. All mathematical functions return as output numbers in the target format.

Table 2. CPFloat Functions if double is Used as Storage Format

  • The names of the corresponding functions for float are obtained by appending the suffix “f”: for example, the function cpfloat_mulf can be used to multiply two float arrays elementwise. All mathematical functions return as output numbers in the target format.

The example in Listing 1 is available in the CPFloat source code repository on GitHub. 1 As we explain in more detail in Section 7, CPFloat can leverage OpenMP to run multiple threads in parallel, but the runtime overhead causes the parallel version of the code to be slower than the sequential one on arrays with very few elements. To alleviate this issue, we use a sequential version of the code for small arrays and switch to the parallel variant only when the number of elements exceeds a machine-dependent threshold that we determine with an auto-tuning procedure. For the C library, the auto-tuning can be triggered with the command

The example in Listing 1 can be compiled with

which produces the binary bin/example. When executed, this program produces the following output:

Skip 4FLOATING-POINT STORAGE FORMATS AND ROUNDING Section

4 FLOATING-POINT STORAGE FORMATS AND ROUNDING

A family of binary floating-point numbers \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) is a finite set that includes a subset of the real line and a handful of values with a special meaning. In our notation, the three integer parameters p, \({e_{\smash{\min }}}\), and \({e_{\smash{\max }}}\) represent the number of binary digits of precision, the minimum exponent, and the maximum exponent, respectively, and the Boolean flag \({\mathfrak {s}_{\mathrm{n}}}\) indicates whether subnormals are supported. A real number \(x := (s,m,e)\) in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) can be written as (4.1) \(\begin{equation} x = (-1)^s\cdot m \cdot 2^{e-p+1}, \end{equation}\) where s is the sign bit, set to 0 if x is positive and to 1 otherwise, the integral siginificand m is a natural number not greater than \(2^p-1\), and the exponent e is an integer between \({e_{\smash{\min }}}\) and \({e_{\smash{\max }}}\) inclusive.

To ensure a unique representation for all numbers in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle \setminus \lbrace 0\rbrace\), it is customary to normalize the system by assuming that if \(x \ge 2^{{e_{\smash{\min }}}}\), then \(2^{p-1} \le m \le 2^p-1\), that is, the number is represented using the smallest possible exponent and the largest possible significand. In such systems, the number \((s,m,e) \in \mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle \setminus \lbrace 0\rbrace\) is normal if \(m \ge 2^{p-1}\), and subnormal otherwise. The exponent of subnormals is always \({e_{\smash{\min }}}\), and in a normalized system any number \(x = (s,m,e) \ne 0\) has a unique p-digit binary expansion \((-1)^s \cdot \widetilde{m} \cdot 2^{e}\), where (4.2) \(\begin{equation} \widetilde{m} = m \cdot 2^{1-p} = d_0 + \frac{d_1}{2} + \dots + \frac{d_{p-1}}{2^{p-1}} = d_0.d_1\dots d_{p-1}, \end{equation}\) for some \(d_0, d_1, \dots , d_{p-1} \in \lbrace 0,1\rbrace\), is called the normal significand of x. One can verify that if x is normal, then \(d_0 = 1\) and \({1 \le \widetilde{m} \lt 2}\), whereas if x is subnormal, then \(d_0 = 0\) and \({0 \lt \widetilde{m} \lt 1}\). Conventionally, the signed zero values \(+0\) and \(-0\) are considered neither normal nor subnormal. In a normalized system, the smallest subnormal is \({x_{\mathrm{minsub}}}:= 2^{{e_{\smash{\min }}}-p+1}\), whereas the smallest and largest positive normal numbers are \({x_{\min }}:= 2^{{e_{\smash{\min }}}}\) and \({x_{\max \vphantom{\min }}}:= 2^{{e_{\smash{\max }}}}(2-2^{1-p})\), respectively. Their negative counterparts can be obtained by observing that a floating-point number system is symmetric with respect to 0. In our notation, subnormals are kept if \({\mathfrak {s}_{\mathrm{n}}}= \mathbf {true}\) and rounded to either the closest smallest floating-point number or the zero of appropriate sign if \({\mathfrak {s}_{\mathrm{n}}}= \mathbf {false}\).

The floating-point numbers in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) can be represented efficiently as binary strings. The sign is stored in the leftmost bit of the representation, and the following \(b_e\) bits are used to store the exponent. Under the IEEE 754 assumption that \({e_{\smash{\min }}}=1-{e_{\smash{\max }}}\), the most efficient way of representing the exponent is obtained by setting \({e_{\smash{\max }}}= 2^{b_e-1}-1\) and using a representation biased by \({e_{\smash{\max }}}\), so \(00\cdots 01_2\) represents the smallest allowed exponent and \(11\cdots 10_2\) represents the largest. The trailing \(p-1\) bits are used to store the significand of x. It is not necessary to store the value of \(d_0\) explicitly, as the IEEE standard uses an encoding often referred to as “implicit bit”: \(d_0\) is assumed to be 1 unless the binary encoding of the exponent is the all-zero string, in which case \(d_0 = 0\) and x represents either a signed 0, if \(m = 0\), or a subnormal number, otherwise. If the exponent field contains the reserved all-one string, then the number represents \(+\infty\) or \(-\infty\) if the fraction field is set to 0, and a \({\text{N}\scriptstyle\text{A}}\text{N}\) otherwise. Infinities are needed to express values whose magnitude exceeds that of the largest positive and negative numbers that can be represented in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\), whereas \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\) represent the result of invalid operations, such as taking the square root of a negative number, dividing a zero by a zero, or multiplying an infinity by a zero. These are needed to ensure that the semantics of all floating-point operations is well specified and that the resulting floating-point number system is closed.

A rounding is an operator that maps a real number x to one of the floating-point numbers closest to x in a given floating-point family. The original IEEE 754 standard [30, Section 4] defines four rounding modes for binary formats: the default round-to-nearest with ties-to-even, and three directed rounding modes, round-toward-\(+\infty\), round-toward-\(-\infty\), and round-toward-zero. The 2008 revision of the standard [31] introduces a fifth rounding mode, round-to-nearest with ties-to-away, but states that it is not necessary for an implementation of a binary format to be IEEE compliant. Finally, the latest revision of the standard [32] recommends the use of round-to-nearest with ties-to-zero for augmented operations [32, Section 9.5]. Here, we also consider two less common rounding strategies, round-to-odd [3] and stochastic rounding [9].

With round-to-odd, we have to set the least significant bit of the rounded number to 1 unless the infinitely-precise number is exactly representable in the target format. This rounding has applications in several domains. In computer arithmetic, for instance, it can be used to emulate a correctly rounded fused multiply–add (FMA) operator when a hardware FMA unit is not available [3]. In general, round-to-odd helps avoid issues associated with double rounding and is used for this purpose in GCC, 2 for example. The fact that it can be implemented at low cost in hardware has recently prompted interest from the machine learning community [7].

Unlike all rounding modes discussed so far, stochastic rounding is non-deterministic, in the sense that the direction in which to round is chosen randomly and repeating the rounding may yield different results. The simplest choice is to round an infinitely precise number that is not representable in the target format to either of the two closest representable floating-point numbers with equal probability. A more interesting flavor is obtained by rounding a non-representable number x to either of the closest floating-point numbers with probability proportional to the distance between x and the two rounding candidates. This rounding mode dates back to the ’50s [17, 18] and has recently gained prominence owing to the surge of interest in low-precision floating-point formats. It is particularly effective at alleviating swamping in long sums [8] and ordinary differential equation solvers [15, 29] and at counteracting the loss of accuracy observed when the precision used to train neural networks is reduced [25, 58]. Stochastic rounding is not widely available in general-purpose hardware but has started to appear in some specialized processors, such as the Intel Loihi neuromorphic chips [10] and the Graphcore IPU, an accelerator for machine learning [22]. We refer the reader to the recent survey by Croci et al. [9] for more details on this rounding mode.

Skip 5EFFICIENT ROUNDING TO A LOWER-PRECISION FORMAT Section

5 EFFICIENT ROUNDING TO A LOWER-PRECISION FORMAT

We simulate low-precision arithmetic by first performing each operation in some higher precision available in hardware and then rounding the result to the desired target format. In this section, we discuss how to exploit the binary representation in the storage format to develop efficient algorithms for rounding \(x \in \mathcal {F}^{ (h) }\) to \(\widetilde{x} \in \mathcal {F}^{ (\ell) }\), where

(5.1)
and the same superscript notation is used for \({x_{\mathrm{minsub}}}\), \({x_{\min }}\), and \({x_{\max \vphantom{\min }}}\).

Whether this technique can be used to simulate low precision accurately depends on the floating-point parameters of the two formats under consideration. First, we need to ensure that any number in \(\mathcal {F}^{ (\ell) }\) is representable in \(\mathcal {F}^{ (h) }\). In most cases, a necessary and sufficient condition for \(\mathcal {F}^{ (\ell) }\subset \mathcal {F}^{ (h) }\) is that \({{p}}\le {{p}}^{(h)}\) and \({e_{\smash{\max }}}^{(\ell)}\le {e_{\smash{\max }}}^{(h)}\). When \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {true}\) but \({\mathfrak {s}_{\mathrm{n}}}^{(h)}= \mathbf {false}\), however, this is not sufficient, because a number that is subnormal in \(\mathcal {F}^{ (\ell) }\) is not necessarily normal in \(\mathcal {F}^{ (h) }\). In this case, we need to require that \({e_{\smash{\min }}}^{(\ell)}\ge {e_{\smash{\min }}}^{(h)}+ p - 1\) or equivalently that \({e_{\smash{\max }}}^{(\ell)}\le {e_{\smash{\max }}}^{(h)}- p + 1\).

But this may not be enough when it comes to simulating arithmetic operations. Let \(y \in \mathbb {R}\) be the exact result of an arithmetic operation \(f(x_{1}, \ldots , x_{N})\), where \(x_{1}\), ..., \(x_{N} \in \mathcal {F}^{ (h) }\), and let \(\mathrm{fl}_{h}: \mathbb {R}\rightarrow \mathcal {F}^{ (h) }\) and \(\mathrm{fl}_{\ell }: \mathbb {R}\rightarrow \mathcal {F}^{ (\ell) }\) be the functions that round to the high- and low-precision formats, respectively. A correctly rounded floating-point implementation of f will return \(\mathrm{fl}_{h}(y)\), and a low-precision simulation should return \(\mathrm{fl}_{\ell }(y)\).

Is \(\mathrm{fl}_{\ell }(y) = \mathrm{fl}_{\ell }(\mathrm{fl}_{h}(y))\) for any value of \(x_{1}\), ..., \(x_{N}\)? This is always true for directed rounding, but when round-to-nearest is considered, one may have that \(\mathrm{fl}_{\ell }(y) \ne \mathrm{fl}_{\ell }(\mathrm{fl}_{h}(y))\) if the intermediate format \(\mathcal {F}^{ (h) }\) does not have enough extra bits of precision.

If f is one of the arithmetic operators required by the IEEE-754 standard [32, Section 5.4.1], for which the IEEE 754 standard mandates correct rounding, then it is known that \({{p}}\le {{p}}^{(h)}/2-1\) will guarantee that double rounding will be innocuous for the four elementary arithmetic operations (\(+\), \(-\), \(\times\), and \(\div\)) and for the square root [53, 55].

If f is a transcendental function, then we cannot guarantee that harmful double rounding will not occur, but this is not a problem, as the IEEE-754 standard [32, Section 9.2] only recommends that these operations be correctly rounded, and many general-purpose mathematical libraries do not provide correct rounding. This is the case, for example, for the GNU C library3 [40, Section 19.7], for the OpenCL C++ 1.0 Specification [24, Section 7.4] and OpenCL C 2.0 Specification [23, Section 4.4], which the Intel oneAPI math kernel library4 uses to determine the accuracy of mathematical functions,5 and for the Radeon open compute math library.6 Most mathematical libraries do not provide correctly rounded implementations of transcendental functions, and we refer the interested reader to recent work by Innocente and Zimmermann [33] for an experimental investigation of the issue.

We now list the high-level functions we need to operate on floating-point numbers. In the description, x denotes a floating-point number in \(\mathcal {F}^{ (h) }\), n denotes a positive integer, and i is an integer index between 0 and \({{p}}^{(h)}-1\) inclusive. Unless otherwise stated, these functions are not defined for infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\).

  • \(\rm {\rm\small ABS}(x)\) returns the absolute value of x. For infinities, we use the definition \(\rm {\rm\small ABS}(\pm \infty) = +\infty\).

  • \(\rm {\rm\small DIGIT}(x,i)\) returns the ith digit of the significand of x from the left. The indexing starts from 0, so \(\rm {\rm\small DIGIT}(x,i)\) is \(d_i\) in (4.2).

  • \(\rm {\rm\small EXPONENT}(x)\) returns the exponent of x. This is the signed integer e in (4.1) if x is normal, and an integer \(\gamma \lt {e_{\smash{\min }}}\) otherwise.

  • \(\rm {\rm\small SIGNIFICAND}(x)\) returns the integral significand of x, that is, the positive integer m in Equation (4.1).

  • \(\rm {\rm\small RAND}(n)\) returns a string of n randomly generated bits.

  • \(\rm {\rm\small SIGN}(x)\) returns \(-1\) if the floating-point number x is negative and \(+1\) otherwise. This function behaves as expected for (signed) zeros and infinities, i.e., \(\rm {\rm\small SIGN}(\pm 0) = \pm 1\) and \(\rm {\rm\small SIGN}(\pm \infty) = \pm 1\).

  • \(\rm {\rm\small TAIL}(x,i)\) returns the trailing \({{p}}^{(h)}-i\) bits of the significand of x as an unsigned integer. For infinities, this function returns 0 by convention, thus \(\rm {\rm\small TAIL}(\pm \infty ,i) = 0\) for any i.

  • \(\rm {\rm\small TRUNC}(x,i)\) returns the number x with the last \({{p}}^{(h)}-i\) bits of the significand set to zero. Truncating an infinity leaves it unchanged, thus \(\rm {\rm\small TRUNC}(\pm \infty , i)= \pm \infty\) for any i.

  • \(\rm {\rm\small ULP}(x,i)\) returns the number \(2^{\rm {\rm\small EXPONENT}(x)-i+1}\), that is, the gap between x and its successor in a floating-point number system with i bits of precision. As noted by Muller [43], this function corresponds to the unit in the last place as defined by Overton [50, p. 14] and Goldberg [20].

How to implement these efficiently will be discussed in detail in Section 6.

Our rounding strategy is summarized in Algorithm 5.1. The function \(\rm {\rm CPF}{\rm\small{LOAT}}\) computes the representation of the floating-point number \(x \in \mathcal {F}^{ (h) }{}\) in a lower-precision format \(\mathcal {F}^{ (\ell) }\). In the pseudocode, \(\mathbb {N}^+\) denotes the set of positive integers. The input parameter \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) is a pointer to one of the functions in Algorithms 5.2, 5.4, or 5.5. A call to \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}(x,t,\zeta)\) returns the floating-point number \(\widetilde{x} \in \mathcal {F}^{ (h) }\), that is, x rounded to \(\mathcal {F}^{ (\ell) }\). The function starts by setting the underflow threshold \(\zeta\), which corresponds to the smallest subnormal number if \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {true}\) and to the smallest normal number if \({\mathfrak {s}_{\mathrm{n}}}^{(\ell)}= \mathbf {false}\). This value will be used by \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) to round numbers that are too small to be represented. Then the algorithm computes the number t of significant digits in the significand of the binary representation of \(\widetilde{x}\). If x falls within the normal range of \(\mathcal {F}^{ (\ell) }\), then its significand has \({{p}}\) significant digits and the algorithm sets \(t= {{p}}\). If, however, \(\vert x \vert\) is between \({x_{\mathrm{minsub}}}^{(\ell)}\) and \({x_{\min }}^{(\ell)}\), then the exponent of x is smaller than \({e_{\smash{\min }}}^{(\ell)}\) and the number t of significant binary digits may have to be reduced. If \({e_{\smash{\min }}}^{(\ell)}= {e_{\smash{\min }}}^{(h)}\), then x is subnormal and has the same number of leading zeros in both storage and target formats. Otherwise, the value of t is given by the difference between \({{p}}\) and the number of leading zeros in the representation of \(\widetilde{x}\), including the zero to the left of the binary point.

In the coming sections, we discuss how the function \({\scriptstyle\text{ROUND}}\text{F}\scriptstyle\text{UN}\) can be implemented for the six rounding strategies we consider.

5.1 Round-to-nearest

Our algorithm for rounding a floating-point number in \(\mathcal {F}^{ (h) }\) to the closest floating-point number in the lower-precision format \(\mathcal {F}^{ (\ell) }\) is given in Algorithm 5.2. The pseudocode describes, in particular, a variant of round-to-nearest known as ties-to-even, whereby a number exactly in-between two floating-point numbers is rounded to the rounding candidate with even significand, that is, a number that has a 0 in the \(t-1\) position to the right of the binary point. Two other variants, ties-to-zero and ties-to-away, will be briefly examined at the end of this section.

Initially, the function checks if the number to be rounded is too small to be represented in \(\mathcal {F}^{ (\ell) }\). If subnormals are supported, then the smallest representable number \({x_{\mathrm{minsub}}}^{(\ell)}\) has an odd fraction, and a tie value \(x \in \mathcal {F}^{ (h) }\) such that \(\vert x \vert = \zeta /2\) is rounded towards zero. If subnormals are not supported, then x is equidistant from two numbers with even significands, since the significand of the smallest normal number \({x_{\min }}^{(\ell)}\) in \(\mathcal {F}^{ (\ell) }\) is even. We still round x to zero, as this behavior is consistent with that of the GNU MPFR library.

Next, a number \(x \in \mathcal {F}^{ (h) }\) that is too large to be rounded to zero but has absolute value below the threshold \(\zeta\) is rounded to \(\operatorname{sign}(x) \cdot \zeta\). If \(\vert x \vert\) is larger than the threshold, then the algorithm checks whether x is within the dynamic range of \(\mathcal {F}^{ (\ell) }\): Following the IEEE 754 standard [32, Section 4.3.1], x will overflow to infinity without changes in sign if \(\vert x \vert \lt 2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\).

If x is within the range of numbers representable in \(\mathcal {F}^{ (\ell) }\), then the algorithm truncates \(x \in \mathcal {F}^{ (h) }\) to t significant digits to obtain \(\widetilde{x} \in \mathcal {F}^{ (\ell) }\), the largest number (in absolute value) that satisfies \(\operatorname{sign}(\widetilde{x}) = \operatorname{sign}(x)\) and \(\vert \widetilde{x} \vert \le \vert x \vert\). In general, \(\widetilde{x}\) is one of the two floating-point numbers in \(\mathcal {F}^{ (\ell) }\) closest to x, the other candidate being \(\widetilde{x} + \operatorname{sign}(x) \cdot \rm {\rm\small ULP}(\widetilde{x}, t)\). To choose a rounding direction, we examine the value of the discarded bits. The unsigned integer \(d:=\rm {\rm\small TAIL}(x,t)\) represents the trailing \({{p}}^{(h)}-t\) digits of the significand of x. Thus, \(0 \le d \le 2^{{{p}}^{(h)}-t}-1\), and if \(d\lt \gamma := 2^{{{p}}^{(h)}-t-1}\), then \(\widetilde{x}\) is the result to be returned, whereas if \(d \gt \gamma\), then it is necessary to add (or subtract, if \(\widetilde{x}\) is negative) \(\rm {\rm\small ULP}(\widetilde{x}, t)\) to obtain the correctly rounded value. If \(d = \gamma\), then we have a tie and we need to round to the nearest even number. Therefore, we add \(\operatorname{sign}(x) \cdot \rm {\rm\small ULP}(\widetilde{x}, t)\) if the bit in position \(t-1\) of the significand of \(\widetilde{x}\) is a 1, and we leave \(\widetilde{x}\) unchanged otherwise.

The latest revision of the IEEE 754 standard mentions two other tie-breaking rules for round-to-nearest: ties-to-zero, to be used for the recommended augmented operations, and ties-to-away, which is required for decimal formats. These can be implemented by changing the conditions of the if statements on lines 3 and 11 in Algorithm 5.2 to

Note that this implementation preserves the sign of zero, maps infinities to infinities, and does not change the encoding of quiet and signaling \({\text{N}\scriptstyle\text{A}}\text{N}\) values. The same observation is true for the rounding functions in Algorithm5.3, 5.4, and 5.5.

5.2 Round-to-odd

The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{O}}{\text{O}\scriptstyle\text{DD}}\) in Algorithm 5.3 implements round-to-odd according to the definition given by Boldo and Melquiond [3, Section 3.1], as this rounding mode is not part of the IEEE standard. Informally speaking, if x is exactly representable in \(\mathcal {F}^{ (\ell) }\), then the function returns it unchanged, otherwise it returns the number closest to x with an odd significand, that is, a significand with a trailing 1. In the spirit of the algorithms discussed so far, one could obtain \(\widetilde{x}\) by truncating \(\vert x \vert\) to the first t significant digits, checking the parity of the significand of \(\widetilde{x}\), and adding \(\rm {\rm\small ULP}(\widetilde{x}, t)\) to the result if \(\widetilde{x}\) is even. However, in this case, we know that the result of the truncation requires a correction only if the least significant digit of \(\widetilde{x}\) is a 0, in which case adding \(\rm {\rm\small ULP}(\widetilde{x}, t)\) amounts to setting that bit to 1. Therefore, we check the bits obliterated by the truncation, and if \(\rm {\rm\small TAIL}(x,t) \ne 0\), then we conclude that x is not exactly representable in \(\mathcal {F}^{ (\ell) }\) and that the significand of the rounded \(\widetilde{x}\) must be odd. We can ensure this by setting the digit in position \(t-1\) of the significand of \(\widetilde{x}\) to 1. In practice, this operation will have an effect only if that digit is not already set to 1, and in particular will have no effect when \(t= 1\), as the leading digit of the significand is not stored explicitly. The core idea of this algorithm is the same as that of the second of the two methods for round-to-odd discussed by Boldo and Melquiond [3, Section 3.4].

The algorithm must round \(\widetilde{x}\) to the closest odd number in \(\mathcal {F}^{ (\ell) }\) if it falls within the underflow or the overflow range. Since 0 is even, rounding to 0 is not an option when this rounding mode is used, and numbers too small to be represented must be rounded to the smallest floating-point number with an odd significand of corresponding sign. When subnormals are supported, the smallest representable number \({x_{\mathrm{minsub}}}^{(\ell)}\) is odd, thus numbers smaller than \({x_{\mathrm{minsub}}}^{(\ell)}\) in absolute value are simply rounded to \(\operatorname{sign}(x) \cdot {x_{\mathrm{minsub}}}^{(\ell)}\). If subnormals are not supported, however, then the smallest representable number \({x_{\min }}^{(\ell)}\) is even, and we are faced with a choice. We could round \(\widetilde{x}\) such that \(\vert \widetilde{x} \vert \lt {x_{\min }}^{(\ell)}\) to \(\operatorname{sign}(x) \cdot ({x_{\min }}^{(\ell)} + \rm {\rm\small ULP}({x_{\min }}^{(\ell)}, t))\), which is the closest number with odd significand. This choice, however, feels unnatural: The operator thus defined would not be rounding to either of the floating-point numbers closest to \(\widetilde{x}\). In our pseudocode, we prefer to round \(\widetilde{x}\) to the rounding candidate largest in magnitude, that is, \(\operatorname{sign}(x) \cdot {x_{\min }}^{(\ell)}\).

The definition given by Boldo and Melquiond cannot be applied directly to values in the overflow range, as in principle the significand of \(\pm \infty\) is neither odd nor even. Since \(-{x_{\max \vphantom{\min }}}^{(\ell)}\) and \({x_{\max \vphantom{\min }}}^{(\ell)}\) are necessarily odd, we prefer to round values outside the finite range of \(\mathcal {F}^{ (\ell) }\) to the closest finite number. Being exactly representable, infinities themselves represent an exception to this rule, and the algorithm leaves them unchanged.

5.3 Directed Rounding

The functions in Algorithm 5.4 show how to implement the three directed rounding modes prescribed by the IEEE 754 standard. The idea underlying the three functions is similar to that discussed for the function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{O}}{\text{N}\scriptstyle\text{EAREST}}\) in Algorithm 5.2, the main differences being (1) the use of the sign, which is relevant when the rounding direction is not symmetric with respect to 0, and (2) the conditions under which a unit in the last place has to be added.

We start by discussing the function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{P}\scriptstyle\text{LUS}}{\text{I}\scriptstyle\text{NFINITY}}\). First, we check whether x is within the range of numbers that are representable in \(\mathcal {F}^{ (\ell) }\). Numbers that are too small to be represented are rounded up to 0 if negative and to \(\zeta\) if positive. Finite positive numbers larger than the largest representable number \({x_{\max \vphantom{\min }}}^{(\ell)}\) overflow to \(+\infty\), whereas negative numbers smaller than the smallest representable number \(-{x_{\max \vphantom{\min }}}^{(\ell)}\) are rounded up to \(-{x_{\max \vphantom{\min }}}^{(\ell)}\), with the only exception of \(-\infty\), which is handled below using the fact that \(\rm {\rm\small TRUNC}(-\infty ,t) = -\infty\) and \(\rm {\rm\small TAIL}(-\infty ,t) = 0\).

Next, the function computes \(\widetilde{x}\), that is, the number x with significand truncated to t significant digits, and checks whether \(\widetilde{x}\) is smaller than the smallest number representable in \(\mathcal {F}^{ (\ell) }\). The rounding can be easily performed by noting that the truncation \(\widetilde{x}\) is the correct result if x is negative or exactly representable in \(\mathcal {F}^{ (\ell) }\). Otherwise, \(\widetilde{x}\) is incremented by \(\rm {\rm\small ULP}(\widetilde{x}, t)\).

The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{M}\scriptstyle\text{INUS}}{\text{I}\scriptstyle\text{NFINITY}}\) is identical, modulo some sign adjustments to take the opposite rounding direction into account. The algorithm starts by checking that x is within the range of numbers representable in \(\mathcal {F}^{ (\ell) }\). Numbers between \(-\zeta\) and 0 are rounded down to \(-\zeta\), whereas those between 0 and \(\zeta\) underflow and are flushed to 0. Numbers that are smaller than the smallest number representable in \(\mathcal {F}^{ (\ell) }\) overflow to \(-\infty\), whereas finite numbers greater than the largest number representable in \(\mathcal {F}^{ (\ell) }\) are rounded to \({x_{\max \vphantom{\min }}}^{(\ell)}\), the only exception in this case being \(+\infty\). To round a number that falls within the range of \(\mathcal {F}^{ (\ell) }\), we first compute \(\widetilde{x}\) by truncating the significand of x to t significant digits and then subtract \(\rm {\rm\small ULP}(\widetilde{x}, t)\) from \(\widetilde{x}\) if x is negative and not exactly representable with a t-digit significand.

The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{Z}\scriptstyle\text{ERO}}\) is simpler than the other two considered in this section, as truncation is sufficient to correctly round the significand of x to t significant digits. Underflow and overflow are also easier to handle: Finite numbers that are smaller than the smallest representable number in absolute value are flushed to 0, whereas numbers too large to be represented are rounded to the closest representable finite number.

5.4 Stochastic Rounding

The functions in Algorithm 5.5 describe how to implement the two variants of stochastic rounding we are concerned with.

The function \({{\scriptstyle\text{ROUND}}\text{T}\scriptstyle\text{OWARD}}{\text{S}\scriptstyle\text{TOCHASTIC}}\) implements the strategy that rounds \(x \in \mathcal {F}^{ (h) }\) to one of the two closest floating-point numbers with probability proportional to the distance. First, the algorithm considers numbers in the underflow range, whose rounding candidates are 0 and the threshold value \(\zeta\), which equals \({x_{\mathrm{minsub}}}\) if subnormals are supported and \({x_{\min }}\) if they are not. The distance between \(\vert x \vert\) and 0 depends not only on the significand but also on the magnitude of x, thus the algorithm starts by computing the two values \(e_x\) and \(m_x\), which represent the exponent and the integral significand of x, respectively. Being the exponent of a floating-point number in \(\mathcal {F}^{ (h) }\), \(e_x\) can be much smaller than \({e_{\smash{\min }}}^{(\ell)}\), in which case it may be necessary to rescale \(m_x\) to align its exponent to \({e_{\smash{\min }}}^{(\ell)}\). This is achieved by multiplying \(m_x\) by \(2^{e_x + 1 - {e_{\smash{\min }}}^{(\ell)}}\). In the pseudocode, we take the floor of the result to keep it integer, although this is not strictly necessary: We prefer to work with integer arithmetic here so the integers generated by the random number generator can be used without any further post-processing. This is desirable not only from a performance point of view, but also because drawing floating-point numbers from the uniform distribution over an interval is not a trivial task, even when a good integer pseudo-random number generator is available [21]. Finally, the algorithm generates a \({{p}}^{(h)}\)-digit random integer \(\gamma\), which is used as a threshold to choose the rounding direction: If the discarded bits, interpreted as an unsigned integer, are larger than \(\gamma\), then x is rounded away from zero, otherwise it is rounded towards zero.

The procedure for numbers in the representable range is easier. In this case, it suffices to compute \(\widetilde{x}\), the value of x truncated to t significant bits, and then generate a random integer r between 0 and \(2^{{{p}}^{(h)}- t}\). Since \(\rm {\rm\small TAIL}(x,t)\) represents the distance between x and \(\widetilde{x}\), we increment the absolute value of \(\widetilde{x}\) by \(\rm {\rm\small ULP}(\widetilde{x}, t)\) if \(\rm {\rm\small TAIL}(x,t) \gt r\) and leave it unchanged otherwise. For overflow, we use the threshold value that the IEEE 754 standard recommends for round-to-nearest and round numbers whose absolute value after rounding is larger than the threshold \(2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\) to infinity, leaving the sign unchanged.

The function \({{\scriptstyle\text{ROUND}}\text{S}\scriptstyle\text{TOCHASTIC}}{\text{E}\scriptstyle\text{QUAL}}\) deals with the simpler strategy that rounds x up or down with equal probability. Depending on the interval in which x falls, the function selects the two closest representable numbers in \(\mathcal {F}^{ (\ell) }\) and calls the function \({{\scriptstyle\text{RAND}}\text{S}\scriptstyle\text{ELECT}}\) to select one of them with equal probability. In the pseudocode, we use a single bit generated randomly to discriminate between the two rounding directions.

Skip 6EFFICIENT IMPLEMENTATION FOR IEEE-LIKE REPRESENTATION FORMATS Section

6 EFFICIENT IMPLEMENTATION FOR IEEE-LIKE REPRESENTATION FORMATS

The subroutines used in Section 5 can be implemented efficiently if we assume that the numbers are represented using the floating-point format described in Section 4. First, we need to define the semantics of the operators for bit manipulation that we will rely on. These are available in most programming languages, although the notation varies greatly from language to language. For clarity, we use a prefix notation for all the operators.

Let \(\mathbf {\textsf {a}}\) and \(\mathbf {\textsf {b}}\) be strings of n bits. The bits are indexed from left to right, so \(\mathbf {\textsf {a}}_{0}\) and \(\mathbf {\textsf {a}}_{n-1}\) denote the leftmost and the rightmost bit of \(\mathbf {\textsf {a}}\), respectively. For \(i\in \mathbb {N}\), we define the following operators:

  • Conjunction: \(\mathbf {\textsf {c}}=\rm {\rm\small AND}(\mathbf {\textsf {a}},\mathbf {\textsf {b}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k = 1\) if \(\mathbf {\textsf {a}}_k\) and \(\mathbf {\textsf {b}}_k\) are both set to 1, and \(\mathbf {\textsf {c}}_k = 0\) otherwise.

  • Disjunction: \(\mathbf {\textsf {c}}=\rm {\rm\small OR}(\mathbf {\textsf {a}},\mathbf {\textsf {b}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=1\) if at least one of \(\mathbf {\textsf {a}}_k\) and \(\mathbf {\textsf {b}}_k\) is set to 1, and \(\mathbf {\textsf {c}}_k=0\) otherwise.

  • Negation: \(\mathbf {\textsf {c}}=\rm {\rm\small NOT}(\mathbf {\textsf {a}})\) is an n-bit string such that \(\mathbf {\textsf {c}}_k = 1\) if \(\mathbf {\textsf {a}}_k=0\), and \(\mathbf {\textsf {c}}_k=0\) otherwise.

  • Logical shift left: \(\mathbf {\textsf {c}}=\rm {\rm\small LSL}(\mathbf {\textsf {a}},i)\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=0\) if \(k \gt (n-1) - i\), and \(\mathbf {\textsf {c}}_k = \mathbf {\textsf {a}}_{k+i}\) otherwise.

  • Logical shift right: \(\mathbf {\textsf {c}}=\rm {\rm\small LSR}(\mathbf {\textsf {a}},i)\) is an n-bit string such that \(\mathbf {\textsf {c}}_k=0\) if \(k \lt i\), and \(\mathbf {\textsf {c}}_k=\mathbf {\textsf {a}}_{k-i}\) otherwise.

Most of the operations used in Section 5 require extracting a certain subset of the bits in the binary representation of the floating-point number \(x \in \mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\) while zeroing out the remaining ones. This can be achieved by taking the bitwise conjunction between the binary string that represents x and a bitmask, that is, a string as long as the binary representation of x that has ones in the positions corresponding to the bits to be extracted and zeros everywhere else. More generally, the functions in Section 5 can be implemented using the operators above as follows. In the descriptions, \(\mathbf {\textsf {x}}\) denotes the binary floating-point representation of x, n denotes a positive integer, and i denotes an integer index between 0 and \(p-1\).

  • \(\rm {\rm\small ABS}(x)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {abs}})\), where \(\mathbf {\textsf {m}}_{\textrm {abs}}\) is constituted by a single leading 0 followed by ones.

  • \(\rm {\rm\small DIGIT}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {digit}})\), where \(\mathbf {\textsf {m}}_{\textrm {digit}}\) has a 1 in the position corresponding to the digit to be extracted and 0 everywhere else. We note that checking whether this digit is 0 or 1 does not require any additional operations in programming languages such as C where 0 is interpreted as \(\mathbf {false}\) and any other integer is interpreted as \(\mathbf {true}\).

  • \(\rm {\rm\small EXPONENT}(x)\) can be implemented as a sequence of logic and arithmetic operations. The raw bits of the exponents can be extracted with \(\mathbf {\textsf {c}}:=\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {exp}})\), where \(\mathbf {\textsf {m}}_{\textrm {exp}}\) has 1 in the positions corresponding to the exponent bits of the binary representation of x. This can be converted into the unsigned integer \(\rm {\rm\small LSR}(\mathbf {\textsf {c}},p-1)\), and the signed exponent can be obtained by subtracting the bias of the storage floating-point format. If x is subnormal in \(\mathcal {F}\langle {{p}},{e_{\smash{\min }}},{e_{\smash{\max }}},{\mathfrak {s}_{\mathrm{n}}}\rangle\), then the value computed in this way is \(-{e_{\smash{\max }}}= {e_{\smash{\min }}}-1\), and the correct value to return in this case is \({e_{\smash{\min }}}+ \lambda\), where \(\lambda\) is the number of trailing zeros in the significand of x, including the implicit bit.

  • \(\rm {\rm\small SIGNIFICAND}(x)\) can be implemented leveraging the function \(\rm {\rm\small EXPONENT}\). The digits to the right of the radix point can be obtained as \(\mathbf {\textsf {c}}:=\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {frac}})\) where \(\mathbf {\textsf {m}}_{\textrm {frac}}\) is the bitmask that has the \(p-1\) trailing bits set to 1 and the remaining bits set to 0. If \({{x_{\min }}\le \vert x \vert \le {x_{\max \vphantom{\min }}}}\), then \(\rm {\rm\small EXPONENT}(x) \gt {e_{\smash{\min }}}\) and the implicit bit must be set to 1. This can be achieved by using \(\mbox{$\rm {\rm\small OR}(\mathbf {\textsf {c}},\rm {\rm\small LSL}(1,p-1))$}\), for instance.

  • \(\rm {\rm\small RAND}(n)\) can be implemented by concatenating numbers produced by a pseudo-random number generator. Two m-bit strings \(\mathbf {\textsf {a}}\) and \(\mathbf {\textsf {b}}\) can be joined together by \(\rm {\rm\small OR}(\rm {\rm\small LSL}(\mathbf {\textsf {a}},m),\mathbf {\textsf {b}})\), and the unnecessary bits can be set to zero using a suitable bitmask.

  • \(\rm {\rm\small SIGN}(x)\) is relatively expensive to implement by means of bit manipulation. However, note that we only need to compute the product \(\operatorname{sign}(x) \cdot y,\) where y is a positive floating-point number. This operation can be implemented as \(\rm {\rm\small OR}(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {sign}}),\mathbf {\textsf {y}})\), where \(\mathbf {\textsf {m}}_{\textrm {sign}}\) is the bitmask with a leading 1 followed by zeros, and the string \(\mathbf {\textsf {y}}\) denotes the floating-point representation of y.

  • \(\rm {\rm\small TAIL}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}_{\textrm {tail}})\), where the trailing \(p-i\) bits of \(\mathbf {\textsf {m}}_{\textrm {tail}}\) are set to 1 and the remaining bits are set to 0. This way, bits i to \(p-1\) of the significand of \(\mathbf {\textsf {x}}\) are preserved while the rest of the bits, including those representing the sign and exponent, are set to zero.

  • \(\rm {\rm\small TRUNC}(x,i)\) can be implemented as \(\rm {\rm\small AND}(\mathbf {\textsf {x}},\mathbf {\textsf {m}}^{\prime })\), where \(\mathbf {\textsf {m}}^{\prime }=\rm {\rm\small NOT}(\mathbf {\textsf {m}}_{\textrm {tail}})\). This way, bits i to \(p-1\) of the significand of \(\mathbf {\textsf {x}}\) are set to zero while the rest of the bits of \(\mathbf {\textsf {x}}\), including the exponent and sign bits, are preserved.

  • \(\rm {\rm\small ULP}(x,p)\) is a rather expensive function to implement, because it requires extracting the exponent from the binary representation of x and then performing arithmetic operations on it. Increasing or decreasing x by \(\rm {\rm\small ULP}(x, p)\), on the contrary, can be achieved efficiently using only one bit shift and one integer arithmetic operation. In particular, it suffices to add to the binary representation of x, seen as an unsigned integer, a number that has 0 everywhere but in the position corresponding to the pth digit of x, that is, the digit in position \(p-1\) of the significand. We note that this technique could fail if \(x = \pm \infty\), since adding \(\rm {\rm\small ULP}(x, p)\) in this fashion would turn infinities into \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\). It is easy to check that this is not a problem in our setting, as we only add or subtract \(\rm {\rm\small ULP}(x, p)\) when x is finite and nonzero.

It is possible to implement some of the rounding routines even more efficiently by extending to other rounding modes the technique for round-to-nearest with ties-to-even used in the function convert_fp32_to_bfloat16 [35, p. 2–17], which manipulates the binary representation of the floating-point number by using only integer arithmetic. As a demonstration, here, we show the concrete values of the bitmasks, expressed in hexadecimal notation, that one would use to round a binary32 number y to binary16. These methods generalize easily to other combinations of storage and target formats, and we describe this in general for the conversion of a floating-point number \(x \in \mathcal {F}\langle {{p}}^{(h)},{e_{\smash{\min }}}^{(h)},{e_{\smash{\max }}}^{(h)},{\mathfrak {s}_{\mathrm{n}}}^{(h)}\rangle\) to \({\mathcal {F}\langle {{p}},{e_{\smash{\min }}}^{(\ell)},{e_{\smash{\max }}}^{(\ell)},{\mathfrak {s}_{\mathrm{n}}}^{(\ell)}\rangle }\).

We denote the 32-bit string containing the floating-point representation of y by \(\mathbf {\textsf {y}}\) and use the uppercase Latin letters X and Y to denote the unsigned integers that can be obtained by interpreting \(\mathbf {\textsf {x}}\) and \(\mathbf {\textsf {y}}\) as unsigned integers in radix 2. All the usual underflow and overflow checks are not included here—the aim is to demonstrate the core ideas for performing each type of rounding efficiently. We recall that the sign of a floating-point number can be determined by checking the leftmost bit, and that x (respectively, y) is negative if \(\mathbf {\textsf {x}}_{0}\) (respectively, \(\mathbf {\textsf {y}}_{0}\)) is set and positive otherwise. The rounding modes amenable to this approach can be implemented as follows:

  • Round-to-nearest with ties-to-even: isolate the bit in position \({{p}}-1\) of the significand of x, and then compute \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSR}(\mathbf {\textsf {m}}_{\textrm {tail}},1) + \rm {\rm\small DIGIT}(x, {{p}}- 1), {{p}})\). When rounding a binary32 number to binary16, the formula becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x7FFF} + \mathbf {\textsf {y}}_{15}, 16)\).

  • Round-to-nearest with ties-to-away: return \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSL}(\text{0x1},{{p}}^{(h)}-{{p}}-1), {{p}})\), which in our example becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x8000}, 16)\).

  • Round-to-nearest with ties-to-zero: return \(\rm {\rm\small TRUNC}(X+ \rm {\rm\small LSR}(\mathbf {\textsf {m}}_{\textrm {tail}},1), {{p}})\), which in our example becomes \(\rm {\rm\small TRUNC}(Y+ \text{0x7FFF}, 16)\).

  • Round-toward-\(+\infty\): return \(\rm {\rm\small TRUNC}(X,{{p}})\) if \(\mathbf {\textsf {x}}_{0}\) is set and \(\rm {\rm\small TRUNC}(X+ \mathbf {\textsf {m}}_{\textrm {tail}},{{p}})\) otherwise. For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\), if \(\mathbf {\textsf {y}}_{0}\) is set, and \(\rm {\rm\small TRUNC}(Y+ \text{0xFFFF}, 16)\), if not.

  • Round-toward-\(-\infty\): return \(\rm {\rm\small TRUNC}(X,{{p}})\) if \(\mathbf {\textsf {x}}_{0}\) is not set and \(\rm {\rm\small TRUNC}(X+ \mathbf {\textsf {m}}_{\textrm {tail}},{{p}})\) otherwise. For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\) if \(\mathbf {\textsf {y}}_{0}\) is not set and \(\rm {\rm\small TRUNC}(Y+ \text{0xFFFF}, 16)\) otherwise.

  • Round-toward-zero: return \(\rm {\rm\small TRUNC}(X,{{p}})\). For our example, return \(\rm {\rm\small TRUNC}(Y, 16)\).

  • Stochastic rounding with proportional probabilities: return \(\rm {\rm\small TRUNC}(X+\rm {\rm\small RAND}({{p}}^{(h)}- t),{{p}})\). For our example, return \(\rm {\rm\small TRUNC}(Y+\rm {\rm\small RAND}(13), 13)\).

Skip 7IMPLEMENTATION AND VALIDATION OF THE CODE Section

7 IMPLEMENTATION AND VALIDATION OF THE CODE

Our C implementation of the algorithms discussed in Sections 5 and 6 is available on GitHub. The code can be compiled as a static or dynamic library, but we also provide the option to use CPFloat as a header-only library.

A header-only library allows the user to take advantage of the inlining feature of the C language for maximum efficiency, and it also enhances the portability of the code, as packaging of the binaries and installation of the library are not required. To alleviate the main drawback of this approach, that is, a longer compilation time, we divided the library into two separate units, one for each supported storage format.

To achieve better performance on large data, our functions work directly on C arrays. All the algorithms discussed in Section 5 are embarrassingly parallel, and each element of an array can be rounded independently from all the others. Therefore, our code can leverage the OpenMP library if available on the system in use.

In general, OpenMP brings significant gains in terms of performance but greatly increases the execution time for arrays with just few elements. This well-known phenomenon is caused by the additional overhead of synchronization and loop scheduling [5], which is negligible for large arrays but can be significant when only a small amount of work is allocated to each OpenMP thread. The impact of this overhead is hard to quantify in general, as it depends on the hardware platform as well as the number of OpenMP threads and the compiler used [6]. Our library contains both a parallel and a sequential version of each function, but we were unable to provide a single threshold that would allow the code to switch automatically from one variant to the other for optimal performance. Thus, we devised a simple auto-tuning strategy that tries to determine the optimal threshold for the system in use by timing the rounding function on several arrays of different lengths and performing a binary search.

For generating the pseudo-random numbers required for stochastic rounding, we rely on algorithms from the family of permuted congruential generators developed by O’Neill [48], who provides a pure C implementation available on GitHub.7 In our code, we use the functions pcg32_random_r and pcg64_random_r to generate 32-bit and 64-bit random numbers, respectively; we initialize the random number generators with pcg32_srandom_r and pcg64_srandom_r and advance them with pcg32_advance_r and pcg64_advance_r, respectively. As initial state, we use the current time as returned by time(NULL). Use of the—considerably slower—default C pseudo-random number generator is also supported.

To validate our code experimentally, we wrote a suite of extensive unit tests. We describe in detail how we tested the rounding routines—all other functions in the library are tested by relying on them. We considered two storage formats, binary32 and binary64, which are available in C via the native data types float and double, respectively, and two target formats, binary16 and bfloat16. For each combination of storage and target formats, we performed three types of tests.

First, we checked that all the numbers that can be represented exactly in the target format, including subnormals and special values such as infinities and \({\text{N}\scriptstyle\text{A}}\text{N}\scriptstyle\text{S}\), are not altered by any of the rounding routines. As the target formats we consider do not have an unduly large cardinality, we can test that this property is true for all representable numbers.

To check the correctness of the code when rounding is necessary, exhaustive testing is not an option, as the storage formats contain too many distinct values. In this case, we opted for testing only a set of representative values. For deterministic rounding, the correctness of the function can be assessed by checking that the output of the rounding routine matches the value predicted by the definition. For each pair of numbers \(x_1, x_2 \in \mathcal {F}^{ (\ell) }\) such that \(x_1\) and \(x_2\) are consecutive in \(\mathcal {F}^{ (\ell) }\) and \(x_1 \lt x_2\), we considered five values in \(\mathcal {F}^{ (h) }\): \(\mathrm{nextafter}(x_1, +\infty)\), \(\mathrm{nextafter}(x_m,-\infty)\), \(x_m\), \(\mathrm{nextafter}(x_m,+\infty)\), and \(\mathrm{nextafter}(x_2,-\infty)\). Here, \(\mathrm{nextafter}(x,y)\) denotes the next number in \(\mathcal {F}^{ (h) }\) after x in the direction of y, and \(x_m\) denotes the mid point between \(x_1\) and \(x_2\). We used the same technique for numbers in the underflow range, whereas for testing the correctness of overflow, we used the values \(\mathrm{nextafter}(\pm {x_{\max \vphantom{\min }}}^{(\ell)},\pm \infty)\), \(\mathrm{nextafter}(\pm {x_{\mathrm{bnd}}}^{(\ell)},\mp \infty)\), \(\pm {x_{\mathrm{bnd}}}^{(\ell)}\), \(\mathrm{nextafter}(\pm {x_{\mathrm{bnd}}}^{(\ell)}, \pm \infty)\), where \({x_{\mathrm{bnd}}}^{(\ell)} := 2^{{e_{\smash{\max }}}^{(\ell)}}(2-2^{-{{p}}})\) is the IEEE 754 threshold for overflow in round-to-nearest.

This technique would not work for stochastic rounding, as each value that is not representable in \(\mathcal {F}^{ (\ell) }\) can be rounded to two different values. We produced a test set by taking, for each pair of numbers \(x_1, x_2 \in \mathcal {F}^{ (\ell) }\) such that \(x_1\) and \(x_2\) are consecutive in \(\mathcal {F}^{ (\ell) }\) and \(x_1 \lt x_2\), the numbers \((3x_1+x_2)/4\), \((x_1+x_2)/2\), and \((x_1+3x_2)/4\). We rounded each number 1,000 times and confirmed that the rounding routines always return either \(x_1\) or \(x_2\) and that the empirical probability distribution matches the expected one. We validated the correctness of the implementation for values in the underflow range by using the same technique, whereas for inputs in the overflow range, we repeated the test on the three values: \((3{x_{\max \vphantom{\min }}}^{(\ell)}+{x_{\mathrm{bnd}}}^{(\ell)})/4\), \(({x_{\max \vphantom{\min }}}^{(\ell)}+{x_{\mathrm{bnd}}}^{(\ell)})/2\), and \(({x_{\max \vphantom{\min }}}^{(\ell)}+3{x_{\mathrm{bnd}}}^{(\ell)})/4\).

The Makefile target

runs the test suite for the C implementations.

We designed a MEX interface for MATLAB and Octave that is in charge of parsing and checking the input, allocating the output, and calling our library to perform the rounding. To show that the interface is fully compatible with chop, we designed a set of tests by modifying the default test suite for chop. 8 These tests can be run with

in MATLAB and with

in Octave.

Skip 8PERFORMANCE EVALUATION Section

8 PERFORMANCE EVALUATION

The experiments were run on a machine equipped with two 12-core Intel Xeon CPU E5-2690 v3 running at 2.60 GHz. Exclusive node access was used to avoid timing artifacts and ensure that all 24 CPU threads were available for parallel runs. The C code was compiled with version 9.3.0 of the GNU Compiler Collection (GCC) with the optimization flag -O3 and the architecture option -march=native. The MATLAB experiments were run using the 64-bit GNU/Linux version of MATLAB 9.10 (R2021a). For our parallel implementations, we used version 4.5 of the OpenMP library. Source code and scripts to reproduce the experiments discussed in this section are available on GitHub. 9

We compare three versions of our codes.

  • cpfloat_seq denotes the sequential C codes in our library. Rounding is performed using the algorithms in Section 6.

  • cpfloat denotes the C codes in our library that leverage OpenMP and employ the auto-tuning technique discussed in Section 7 to switch between sequential and parallel implementations. Rounding is performed using the algorithms in Section 6.

  • cpfloat_ml denotes our MEX interface to cpfloat compiled in MATLAB. For large matrices, this function relies on the parallel version of our C codes.

As the numerical validation of the code has already been discussed in Section 7, here, we focus on timings. We time the C or C++ code by comparing the value returned by the function clock_gettime with CLOCK_MONOTONIC before and after the execution, and we take the median of 1,000 repetitions to reduce the influence of possible outliers. For the MATLAB code, we rely on the function timeit, which runs a portion of code several times and returns the median of the measurements.

8.1 Performance of the C Interface

In this section, we compare the performance of CPFloat, GNU MPFR, and FloatX by considering the following implementations.

  • chop_mpfr denotes the codes that rely on the GNU MPFR library.10 As storage format, we use the GNU MPFR custom data type mpfr_t. For rounding, our implementation sets the precision and exponent range of MPFR to the precision and exponent range of the target format and then converts the binary64 input using the function mpfr_set_d. Arithmetic operations use mpfr_t arrays for both input and output.

  • floatx denotes the codes that use the floatx class from the FloatX library.11 Arrays of binary64 numbers are converted to a lower-precision target format by invoking the constructor of the floatx class. This code requires that the parameters of the target format be specified at compile time, as the floatx class uses C++ templates that are instantiated only for the low-precision formats declared in the source code. For arithmetic operations, we use arrays of floatx objects as input and output.

  • floatxr denotes the codes that rely on the floatxr class in the FloatX library. Conversion is performed using the class constructor. This function is more flexible than floatx in that the number of bits of precision and the maximum exponent allowed for the target format to be specified at runtime. As with the other implementations, for arithmetic operations, we assume that input and output are arrays of floatxr objects.

Figure 1 compares the time required by chop_mpfr, floatxr, floatx, and cpfloat_seq to perform two operations: converting a square matrix from the storage format to the target format and computing the sum of two matrices in the target format. For format conversion, we report timings both including (left) and excluding (middle) the time needed to allocate the output vector. For the sum, we report only the time needed to execute the operation (right), assuming that the memory to store the result has already been allocated.

Fig. 1.

Fig. 1. Top: execution time, in seconds, of chop_mpfr, floatxr, floatx, and cpfloat_seq to convert matrices of order n from binary64 to binary16 (left and middle) and to compute the sum of two matrices of size n in binary16 (right) using round-to-nearest with even-on-ties. Bottom: Corresponding slowdown plots, where cpfloat_seq is taken as baseline. The timings include the allocation of the output vector only in the left-most panel.

As storage and target formats, we use binary64 and binary16, respectively. We repeated the experiment simulating bfloat16 and TensorFloat-32 arithmetic—we do not reproduce these results here, as they are indistinguishable from those for binary16. We use only round-to-nearest with ties-to-even, as the FloatX library currently does not support any other rounding modes. We observe, however, that chop_mpfr also supports directed rounding as prescribed by the IEEE 754 floating-point standard.

The plots in the top row report the median timing of 100 runs for each algorithm. The plots in the bottom row present the same data as slowdown with respect to the timings of cpfloat_seq. Unsurprisingly, the execution time of the four algorithms grows quadratically with the order of the matrix to be converted and thus linearly with the number of entries in the matrix. For matrices of small size, the performance of floatx and cpfloat_seq are essentially indistinguishable, and for floatxr there is only a difference when the time to allocate the memory is considered. For matrices with 100 or more elements (\(n \gt 10\)), cpfloat_seq is the fastest of the four implementations we consider. In this regime, floatx is typically two to three times slower than cpfloat_seq, whereas for the two other algorithms the performance varies, depending on the operation being considered. The slowdown factor of floatxr can get well over 10 for conversion if the time needed to allocate the output is factored in, but it goes below 5 when allocations are not considered. The performance of chop_mpfr, however, seems to depend mostly on what operation is performed: The slowdown factor is mostly over 20 for data conversion but is generally below 10 for sums.

We remark that chop_mpfr, floatxr, and cpfloat_seq are more flexible than floatx, as the latter requires the parameters of the target format be known at compile time for the compiler to instantiate the templates appropriately.

8.2 Performance of the MATLAB Interface

In this section, we compare the performance of cpfloat_ml with that of existing MATLAB alternatives.

First, we consider the MATLAB function chop, which is available on GitHub.12 Figure 2 reports the speedup of cpfloat_ml over chop. In each plot, we consider the conversion of square matrices of size n between 1 and \(10{,}000\) using the six rounding modes implemented in chop. We consider binary64 as storage format, and binary16 (left), bfloat16 (center), and TensorFloat-32 (right) as target formats.

Fig. 2.

Fig. 2. Ratio of the execution times of chop to that of cpfloat_ml on \(n \times n\) matrices of normal floating-point numbers stored in binary64. Target formats are binary16 (left), bfloat16 (center), and TensorFloat-32 (right).

The input data is obtained by manipulating the entries of an \(n \times n\) matrix X of pseudo-random numbers uniformly distributed in \((0,1)\). To generate a matrix of normal numbers, we add to each entry of X the constant value \({x_{\min }}^{(\ell)}\), which guarantees that \(x_{ij}\), for \(i,j = 1, \dots , n\), is distributed uniformly in the interval \(\big ({x_{\min }}^{(\ell)}, 1+{x_{\min }}^{(\ell)}\big)\).

In all cases, the speedup is greater than one and increases with the size of the input matrix. In other words, cpfloat_ml is always faster than chop, and particularly so for larger matrices. The two rounding modes for which the new algorithms bring the most significant gains are the two flavors of stochastic rounding. This is expected: Generating pseudo-random numbers accounts for a large fraction of the execution time for these rounding modes, and by using a more efficient pseudo-random number generator the new algorithms have a great advantage over chop. The remaining four rounding modes show a very similar speedup, although the curves for round-to-nearest are generally slightly favorable.

We repeated the experiment using binary32 instead of binary64 as storage format and considering matrices whose entries are subnormal, rather than normal, in the target format. The results of these experiments are not included here, as they are not noticeably different from those in Figure 2.

The timings used to generate Figure 2 were obtained allowing MATLAB to use all 24 computational threads available on the system whereon the experiment was run. To assess how the better performance of cpfloat_ml depends on the number of threads used, we run a strong scaling experiment. We took matrices of size \(n = 10{,}000\) (the largest value considered in Figure 2) and measured the speedup as the number t of computational threads increases. We did this by setting the maximum number of computational threads that MATLAB is allowed to use by means of the function maxNumCompThreads. The ratios between the execution times of chop and cpfloat_ml for binary16, bfloat16, and TensorFloat-32 are reported in Figure 3.

Fig. 3.

Fig. 3. Ratio of the execution times of chop to that of cpfloat_ml on a \(10{,}000 \times 10{,}000\) matrix of normal floating-point numbers stored in binary64 as the number of threads increases. Target formats are binary16 (left), bfloat16 (center), and TensorFloat-32 (right).

We note that the execution time of both functions increases as the number of threads is reduced, which confirms that they can both exploit parallelism and take advantage of additional computational threads available. On such large matrices, cpfloat_ml is always at least one order of magnitude faster than chop. For directed rounding modes, the speedup is just over \(20\times\) if a single thread is used; it increases until \(t \le 12\), which is the maximum number of cores of one of the available CPUs, and then settles down just below \(100\times\). The speedup for round-to-nearest oscillates between \(50\times\) and \(120\times\) but follows a similar trend. For the two stochastic rounding modes, however, the speedup does not stop increasing at \(t = 12\) and reaches just below \(300\times\) for \(t = 24\).

The data shows that chop and cpfloat_ml can both take advantage of additional computational resources, but the latter is more efficient than the former at doing so.

We compared the performance of cpfloat_ml and of the MATLAB function f_d_dec2floatp from the FLOATP_Toolbox13 in a similar way. This library is less efficient than chop at the task we examine: It is typically over 100 times slower than cpfloat_ml at simulating binary16 and bfloat16 arithmetic and always over 1,000 slower at simulating TensorFloat-32 arithmetic.

Finally, we run a similar comparison with INTLAB’s flround function, part of the fl custom precision arithmetic library [55] of INTLAB.14 The results are shown in Figure 4. This rounding function supports only binary64 as a storage format and does not support rounding to odd or stochastic rounding modes. These experiments show that cpfloat_ml is faster than INTLAB’s flround by up to \(9\times\).

Fig. 4.

Fig. 4. Ratio of the execution times of flround from INTLAB V12 to that of cpfloat_ml on \(n \times n\) matrices of normal floating-point numbers stored in binary64.

We remark that chop, the FLOATP_Toolbox, and INTLAB are entirely written in MATLAB, whereas CPFloat relies on optimized C code. Therefore, the performance observations in this section should be understood as a comparison of implementations available in MATLAB, rather than comparison among the underlying algorithms.

8.3 Overhead of the MATLAB Interface

As a final test, we consider the overhead introduced by MATLAB when calling the underlying C implementation of the rounding algorithms. In Figure 5, we compare the execution time required to convert a matrix to binary16 with a direct call to the C code (first column) or with a call to the MEX interface in MATLAB (second column). As the performance of the two algorithms is very similar and the data in the two series is hard to compare directly, we provide the speedup in the third column. We repeat the experiment for both binary32 (top row) and binary64 (bottom row). We remark that the C functions were tuned by using the make autotune command, whereas for the MEX interface, we used cpfloat_autotune, a MATLAB function included in the software package.

Fig. 5.

Fig. 5. Execution time, in seconds, of cpfloat (first column) and cpfloat_ml (second column) on matrices of increasing order n and target format binary16. The third column represents the ratio of the execution time in the first column to that in the second.

The raw timings show that in our implementations stochastic rounding is the slowest rounding mode. The performance of the other rounding modes is so similar that the lines are hard to distinguish for both the C and the MEX interface. The data in the right-most column shows that for both storage formats we consider, the overhead of the MEX interface is significant for small matrices, but becomes negligible for matrices of order 3,000 or more.

Our results seem to suggest that MATLAB code that requires the functionalities of cpfloat_ml should be translated into C to obtain the maximum efficiency. We would like to stress, however, that this translation might bring only a minor performance gain and in fact not be worth the effort, unless the cpfloat function is used extensively on matrices of small size. In fact, the overhead of the MEX interface is modest in absolute terms and is noticeable only in cases when the overall execution time of both cpfloat and cpfloat_ml is below 5 milliseconds. This suggests that, in most cases, switching to a pure C implementation would bring only a marginal benefit, if any.

Skip 9SUMMARY AND FUTURE WORK Section

9 SUMMARY AND FUTURE WORK

Motivated by the growing number of tools and libraries for simulating low-precision arithmetic, we considered the problem of rounding floating-point numbers to low precision in software. We developed low-level algorithms for a number of rounding modes, explained how to implement them efficiently using bit manipulation, and discussed how to validate their behavior experimentally by means of exhaustive testing. We developed CPFloat, an efficient C library that implements all the algorithms discussed here and can be used from within MATLAB and Octave by means of a MEX interface we provide. When used in C, CPFloat can act as a full custom-precision floating-point arithmetic library, as it supports elementary arithmetic operations and mathematical functions such as those available in math.h for binary32 and binary64. Our experimental results show that the new implementations outperform existing alternatives in C, C++, and MATLAB.

Traditionally, floating-point arithmetic has been the most widely adopted technique for working with non-integer numbers in high-performance scientific computing, but alternative methods have recently begun to gain popularity. In particular, we believe that the techniques we developed here could be adapted to posit arithmetic [12, 26], a generalization of the IEEE 754 floating-point number format, and to fixed-point arithmetic, a de facto standard technique for working with reals on systems that are not equipped with a floating-point unit. This will be the subject of future work.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

The authors are grateful to Nicholas J. Higham, for discussions about the MATLAB function chop and for insightful observations on early drafts of this document, to Laura Morgenstern and Anne Reinarz, for feedback on the manuscript, and to Theo Mary, for testing the MEX interface to CPFloat and for reporting bugs affecting the software. The authors thank the anonymous referees of ACM Transactions on Mathematical Software for their comments that substantially improved the quality of the article and led to a change in the scope of the library.

Footnotes

  1. 1 https://github.com/north-numerical-computing/cpfloat.

    Footnote
  2. 2 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21718#c25.

    Footnote
  3. 3 https://www.gnu.org/software/libc/.

    Footnote
  4. 4 https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.

    Footnote
  5. 5 https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top/optimization-and-programming/intel-oneapi-dpc-c-compiler-math-library.html.

    Footnote
  6. 6 https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/doc/OCML.md#supported-functions.

    Footnote
  7. 7 https://github.com/imneme/pcg-c.

    Footnote
  8. 8 https://github.com/higham/chop/blob/master/test_chop.m.

    Footnote
  9. 9 https://github.com/north-numerical-computing/cpfloat_experiments.

    Footnote
  10. 10 https://www.mpfr.org/.

    Footnote
  11. 11 https://github.com/oprecomp/FloatX.

    Footnote
  12. 12 https://github.com/higham/chop.

    Footnote
  13. 13 https://gerard-meurant.pagesperso-orange.fr/floatp.zip.

    Footnote
  14. 14 https://www.tuhh.de/ti3/rump/intlab/.

    Footnote

REFERENCES

  1. [1] Limited Arm. 2020. Arm Architecture Reference Manual. Technical Report ARM DDI 0487F.c (ID072120). Retrieved from https://developer.arm.com/documentation/ddi0487/fc/.Google ScholarGoogle Scholar
  2. [2] Aupy Guillaume, Benoit Anne, Cavelan Aurélien, Fasi Massimiliano, Robert Yves, Sun Hongyang, and Uçar Bora. 2016. A festschrift for selim g. akl. Emergent Computation, Andrew Adamatzky (Ed.). Emergence, Complexity and Computation (ECC), Vol. 24, Springer, Cham. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Boldo Sylvie and Melquiond Guillaume. 2008. Emulation of a FMA and correctly rounded sums: Proved algorithms using rounding to odd. IEEE Trans. Comput. 57, 4 (Feb.2008), 462471. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Brun Emeric, Defour David, Castro Pablo de Oliveira, Istoan Matei, Mancusi Davide, Petit Eric, and Vaquet Alan. 2021. A study of the effects and benefits of custom-precision mathematical libraries for HPC codes. IEEE Trans. Emerg. Topics Comput. 9, 3 (July2021), 14671478. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bull J. Mark. 1999. Measuring synchronisation and scheduling overheads in OpenMP. In Proceedings of 1st European Workshop on OpenMP. Wiley, 99105.Google ScholarGoogle Scholar
  6. [6] Bull J. Mark, Reid Fiona, and McDonnell Nicola. 2012. A microbenchmark suite for OpenMP tasks. In Proceedings of the 8th International Workshop on OpenMP(Lecture Notes in Computer Science, Vol. 7312). Springer-Verlag, Berlin, 271274. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Burgess Neil, Milanovic Jelena, Stephens Nigel, Monachopoulos Konstantinos, and Mansell David. 2019. Bfloat16 processing for neural networks. In Proceedings of the 26th IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 8891. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Connolly Michael P., Higham Nicholas J., and Mary Theo. 2021. Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput. 43, 1 (Jan.2021), A566–A585. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Croci Matteo, Fasi Massimiliano, Higham Nicholas J., Mary Theo, and Mikaitis Mantas. 2022. Stochastic rounding: Implementation, error analysis and applications. Roy. Societ. Open Sci. 9, 3 (Mar.2022). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Davies Mike, Srinivasa Narayan, Lin Tsung-Han, Chinya Gautham, Cao Yongqiang, Choday Sri Harsha, Dimou Georgios, Joshi Prasad, Imam Nabil, Jain Shweta, Liao Yuyun, Lin Chit-Kwan, Lines Andrew, Liu Ruokun, Mathaikutty Deepak, McCoy Steve, Paul Arnab, Tse Jonathan, Venkataramanan Guruguhanathan, Weng Yi-Hsin, Wild Andreas, Yang Yoonseok, and Wang Hong. 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38, 1 (Jan.2018), 8299. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Dawson Andrew and Düben Peter D.. 2017. rpe v5: An emulator for reduced floating-point precision in large numerical simulations. Geosci. Model Dev. 10, 6 (June2017), 22212230. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Dinechin Florent de, Forget Luc, Muller Jean-Michel, and Uguen Yohann. 2019. Posits: The good, the bad and the ugly. In Proceedings of the Conference for Next Generation Arithmetic. ACM Press, 110. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fasi Massimiliano, Higham Nicholas J., Lopez Florent, Mary Theo, and Mikaitis Mantas. 2023. Matrix multiplication in multiword arithmetic: Error analysis and application to GPU tensor cores. SIAM J. Sci. Comput. 45, 1 (Feb.2023), C1–C19. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Fasi Massimiliano, Higham Nicholas J., Mikaitis Mantas, and Pranesh Srikara. 2021. Numerical behavior of NVIDIA tensor cores. PeerJ Comput. Sci. 7 (Feb.2021), e330(1–19). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fasi Massimiliano and Mikaitis Mantas. 2021. Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. IEEE Trans. Emerg. Topics Comput. 9, 3 (July2021), 14511466. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Flegar Goran, Scheidegger Florian, Novaković Vedran, Mariani Giovani, Tomás Andrés E., Malossi A. Cristiano I., and Quintana-Ortí Enrique S.. 2019. FloatX: A C++ library for customized floating-point arithmetic. ACM Trans. Math. Softw. 45, 4 (Dec.2019), 123. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Forsythe George E.. 1950. Round-off errors in numerical integration on automatic machinery. Bull. Amer. Math. Societ. 56 (Nov.1950), 5564. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Forsythe George E.. 1959. Reprint of a note on rounding-off errors. SIAM Rev. 1, 1 (Jan.1959), 6667. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Fousse Laurent, Hanrot Guillaume, Lefèvre Vincent, Pélissier Patrick, and Zimmermann Paul. 2007. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33, 2 (June2007), 13:1–13:15. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Goldberg David. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar.1991), 548. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Goualard Frédéric. 2020. Generating random floating-point numbers by dividing integers: A case study. In Proceedings of the International Conference on Computational Science. Springer-Verlag, Cham, Switzerland, 1528. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Limited Graphcore. 2020. IPU Programmer’s Guide. Retrieved from https://www.graphcore.ai/docs/ipu-programmers-guide.Google ScholarGoogle Scholar
  23. [23] Group Khronos OpenCL Working. 2019. The OpenCL C++ 1.0 Specification. The Khronos Group. Retrieved from https://registry.khronos.org/OpenCL/specs/2.2/pdf/OpenCL_Cxx.pdf.Google ScholarGoogle Scholar
  24. [24] Group Khronos OpenCL Working. 2019. The OpenCL C 2.0 Specification. The Khronos Group. Retrieved from https://registry.khronos.org/OpenCL/specs/2.2/pdf/OpenCL_C.pdf.Google ScholarGoogle Scholar
  25. [25] Gupta Suyog, Agrawal Ankur, Gopalakrishnan Kailash, and Narayanan Pritish. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 37). PMLR, 17371746. Retrieved from http://proceedings.mlr.press/v37/gupta15.html.Google ScholarGoogle Scholar
  26. [26] Gustafson John L. and Yonemoto Isaac T.. 2017. Beating floating point at its own game: Posit arithmetic. Supercomput. Front. Innov. 4, 2 (June2017), 7186. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Higham Nicholas J.. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Higham Nicholas J. and Pranesh Srikara. 2019. Simulating low precision floating-point arithmetic. SIAM J. Sci. Comput. 41, 5 (Oct.2019), C585–C602. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Hopkins Michael, Mikaitis Mantas, Lester Dave R., and Furber Steve. 2020. Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. Philos. Trans. R. Soc. A 378, 2166 (Jan.2020), 22. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] IEEE. 1985. IEEE Standard for Binary Floating-point Arithmetic, ANSI/IEEE Standard 754-1985. Institute of Electrical and Electronics Engineers, Piscataway, NJ. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] IEEE. 2008. IEEE Standard for Floating-point Arithmetic, IEEE Std 754-2008 (revision of IEEE Std 754-1985). Institute of Electrical and Electronics Engineers, Piscataway, NJ. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] IEEE. 2019. IEEE Standard for Floating-point Arithmetic, IEEE Std 754-2019 (revision of IEEE Std 754-2008). Institute of Electrical and Electronics Engineers, Piscataway, NJ. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Innocente Vincenzo and Zimmermann Paul. 2022. Accuracy of Mathematical Functions in Single, Double, Extended Double and Quadruple Precision. Technical Report hal-03141101, version 2. Inria. Retrieved from https://hal.inria.fr/hal-03141101v2.Google ScholarGoogle Scholar
  34. [34] Corporation Intel. 2018. BFLOAT16—Hardware Numerics Definition. Retrieved from https://software.intel.com/en-us/download/bfloat16-hardware-numerics-defnition.Google ScholarGoogle Scholar
  35. [35] Corporation Intel. 2020. Intel Architecture Instruction Set Extensions and Future Features Programming Reference. Technical Report 319433-038 (March 2020). Retrieved from https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google ScholarGoogle Scholar
  36. [36] Corporation International Business Machines. 2017. Power ISA Version 3.0 B. (2017). Retrieved from https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv.Google ScholarGoogle Scholar
  37. [37] Lefèvre Vincent. 2013. SIPE: Small integer plus exponent. In Proceedings of the 21st IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 99106. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lefèvre Vincent. 2013. Sipe: A Mini-library for Very Low Precision Computations with Correct Rounding. Technical Report hal-00864580, version 1. Inria. Retrieved from https://hal.inria.fr/hal-00864580.Google ScholarGoogle Scholar
  39. [39] Lichtenau Cedric, Carlough Steven, and Mueller Silvia Melitta. 2016. Quad precision floating point on the IBM z13. In Proceedings of the 23rd IEEE Symposium on Computer Arithmetic. Institute of Electrical and Electronics Engineers, 8794. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Loosemore Sandra, with Richard M. Stallman, McGrath Roland, Oram Andrew, and Drepper Ulrich. 2022. The GNU C Library Reference Manual (for version 2.36 ed.). Free Software Foundation. Retrieved from https://www.gnu.org/software/libc/manual/pdf/libc.pdf.Google ScholarGoogle Scholar
  41. [41] Meurant Gérard. 2020. FLOATP_Toolbox. Retrieved from https://gerard-meurant.pagesperso-orange.fr/floatp.zip.Matlab software, variable precision floating point arithmetic.Google ScholarGoogle Scholar
  42. [42] Moler Cleve B.. 2017. “Half Precision” 16-Bit Floating Point Arithmetic. Blog post. Retrieved from https://blogs.mathworks.com/cleve/2017/05/08/half-precision-16-bit-floating-point-arithmetic.Google ScholarGoogle Scholar
  43. [43] Muller Jean-Michel. 2005. On the Definition of \(\mathrm{ulp}(X)\). Technical Report RR-5504, LIP RR-2005-09. Inria, LIP. Retrieved from https://hal.inria.fr/inria-00070503.Google ScholarGoogle Scholar
  44. [44] Muller Jean-Michel, Brunie Nicolas, Dinechin Florent de, Jeannerod Claude-Pierre, Joldes Mioara, Lefèvre Vincent, Melquiond Guillaume, Revol Nathalie, and Torres Serge. 2018. Handbook of Floating-point Arithmetic (2nd ed.). Birkhäuser. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] NVIDIA Corporation. 2016. NVIDIA Tesla P100 Architecture. Technical Report WP-08019-001_v01.1. Retrieved from https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  46. [46] NVIDIA Corporation. 2020. NVIDIA A100 Tensor Core GPU Architecture. Technical Report V1.0. Retrieved from https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  47. [47] NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Architecture. Technical Report V1.03. Retrieved from https://resources.nvidia.com/en-us-tensor-core.Google ScholarGoogle Scholar
  48. [48] O’Neill Melissa E.. 2014. PCG: A Family of Simple Fast Space-efficient Statistically Good Algorithms for Random Number Generation. Technical Report HMC-CS-2014-0905. Harvey Mudd College. Retrieved from https://www.cs.hmc.edu/tr/hmc-cs-2014-0905.pdf.Google ScholarGoogle Scholar
  49. [49] Osorio John, Armejach Adrià, Petit Eric, Henry Greg, and Casas Marc. 2022. FASE: A fast, accurate and seamless emulator for custom numerical formats. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 144146. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Overton Michael L.. 2001. Numerical Computing with IEEE Floating Point Arithmetic. Society for Industrial and Applied Mathematics, Philadelphia, PA. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Group Radeon Technologies. 2017. Radeon’s Next-generation Vega Architecture. Technical Report. Advanced Micro Devices. Retrieved from https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf.Google ScholarGoogle Scholar
  52. [52] Reinarz Anne, Gallard Jean-Mathieu, and Bader Michael. 2018. Influence of a-posteriori subcell limiting on fault frequency in higher-order DG schemes. In Proceedings of the 8th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS). Institute of Electrical and Electronics Engineers, Piscataway, NJ, 7986. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Roux Pierre. 2014. Innocuous double rounding of basic arithmetic operations. J. Formaliz. Reason. 7, 1 (July2014), 131142. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Rump Siegfried M.. 1999. INTLAB — INTerval LABoratory. In Developments in Reliable Computing. Springer-Verlag, Dordrecht, Netherlands, 77104. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Rump Siegfried M.. 2017. IEEE754 precision-k base-\(\beta\) arithmetic inherited by precision-m base-\(\beta\) arithmetic for \(k \lt m\). ACM Trans. Math. Softw. 43, 3 (Jan.2017), 115. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Ríos John Osorio, Armejach Adrià, Petit Eric, Henry Greg, and Casas Marc. 2021. Dynamically adapting floating-point precision to accelerate deep neural network training. In Proceedings of the 20th IEEE International Conference on Machine Learning and Applications. 980987. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Samfass Philipp, Weinzierl Tobias, Reinarz Anne, and Bader Michael. 2021. Doubt and redundancy kill soft errors—Towards detection and correction of silent data corruption in task-based numerical software. In Proceedings of the 11th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale. Institute of Electrical and Electronics Engineers, Piscataway, NJ. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Naigang, Choi Jungwook, Brand Daniel, Chen Chia-Yu, and Gopalakrishnan Kailash. 2018. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, 76757684. Retrieved from http://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.Google ScholarGoogle Scholar
  59. [59] Wilkinson James Hardy. 1963. Rounding Errors in Algebraic Processes. (Notes on Applied Science No. 32). Her Majesty’s Stationery Office.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Zhang Tianyi, Lin Zhiqiu, Yang Guandao, and Sa Christopher De. 2019. QPyTorch: A low-precision arithmetic simulation framework. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition. Institute of Electrical and Electronics Engineers, Piscataway, NJ, 1113. DOI:Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. CPFloat: A C Library for Simulating Low-precision Arithmetic
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader