Floating point number



Digital signal processing

Release date:2022/4/16         

In Japanese
Premise knowledge
Binary number


■What is a floating point number?

For example, when describing a value with a large number of digits such as 0.00000000123 or 12300000000, you can describe from a very small value to a large value by doing the following. Such a description method is called a floating point number (Float type) because the position of the decimal point moves.



The floating-point structure consists of a sign, an exponent part, and a mantissa part. There are single-precision floating point (Single type) and double-precision floating point (Double type). Double-precision floating point can describe up to a larger number of digits, but it has a large capacity (64bit), so you need to be careful when programming. Depending on the program language, the single type may be called the float type.



■Floating point arithmetic concept

Floating-point numbers are supposed to be binary numbers, but to understand them, let's start with decimal numbers. Make the number -252.025 an image of a floating point number.



As mentioned above, the numerical value is expressed using the sign indicating plus or minus and the exponent. This is expressed in binary as follows. This is not yet a floating point notation.



Floating point numbers are as follows (single precision)



I will explain the details. The sign part is 1 if it is negative and 0 if it is positive. The mantissa part is described by omitting the first 1. The reason is that in the case of binary numbers, the beginning is always 1, so it is an idea to increase the amount of information that can be expressed as much as possible. The exponent part offsets 127 so that the exponent can be expressed from -126 to 127. The bands of the index are as follows. 11111111 is reserved for use in special cases.



■Floating point range and number of significant digits

<Single type Floating point>

(1) Positive minimum
As below. Remember that the mantissa has a leading 1.



The negative minimum value is equivalent to multiplying the maximum value shown below by -1, so the explanation is omitted.

(2) Maximum value
The mantissa part is the calculation of the decimal point of the binary number, so add 20 + 2-1 + 2-2.



(3) Number of significant digits
The number of significant digits is determined by the mantissa. The mantissa part is 224 = 16777216, which can express up to 8 digits, but since all 8 digits cannot be expressed, the number of significant digits is 7 digits. Expressing the decimal point in binary tends to be a recurring decimal (details are explained here), and even if it is a floating point, the recurring decimal part with more than the number of significant digits is deleted. This is why in programming it is said that floating point values should not be matched and compared.



You can also check the number of significant digits by logarithmic calculation as shown below. This is because the logarithm is the concept of the number of digits.


<Double type Floating point>

Do the same calculation as single precision

 (1) The minimum positive value is 2.225074×10-308
 (2) Maximum value is 1.797693×10308
 (3) Number of significant digits is 15digits (log10253=15.95)









List of related articles



Digital signal processing