fulmanski.pl: tutorials

Chapter 6

FPU

Initial version: 2025-03-12
Last update: 2025-04-01

From human's perspective numbers are just numbers and it does not matter wether you add 1 to 2 or 1.0 to 2.0. From computer's perspective these are two completely different object held by two completely different "modules" – either being a separate CPU part or being a software emulation one by another. Yes, real numbers (floating point numbers) are different than integers and requires separate way of processing. In this chapter you will learn how to operate with such a numbers.

Table of contents

Know the place where you are
- Final words
Reverse Polish Notation – a brilliant idea for performing calculations
FPU internals
Usage examples
Problems you can try to solve
- Problem 1: Write equivalent 64-bit codes for given 32-bit codes.
- Problem 2: FPU stack overflow

Know the place where you are

In contemporary x86 technology portfolio there are a lot of names like FPU, MMX, SSE, or AVX. Each brings to this noble architecture some news in data processing, allowing to write better, in most cases faster code. Below I will shortly characterize each of them so you will know for what you can use them.

FPU The x87 Floating-Point Unit (FPU), also known as a co-processor, used to be an option when the first PCs came on the market. It provides high-performance floating-point processing capabilities and supports mostly the floating-point but also integer, and packed BCD integer data types together with the floating-point processing algorithms and exception handling architecture defined in the IEEE Standard 754 for Binary Floating-Point Arithmetic. Modern PCs are now all provided with a co-processor. It is worth to note that although the original PC-XT architecture (especially CPU) has evolved considerably over the years, the FPU itself has not changed apparently during that same period. The entire set of assembler instructions for the FPU is relatively small and the main difficulty is to avoid some of the pitfalls peculiar to the FPU.

The answer for the question, why FPU sticks in ancient times has two arguments:
- To keep backward compatibility. Wether you like it or not, this is something characterizing x86 architecture. Even now, after more than 40 years from first PC XT launch (on March 8, 1983), you can execute every program ever made for this architecture or any of the newest ancestor. This gives you great flexibility but also limits further evolution.
- Intel applies something that I call inherited evolution. Because of the previous argument instead of changing something what exists they rather extend architecture adding new component (and preserving existing in untouched form). In consequence new architecture inherits everything from its predecessor and extends with some new features. So if you are searching "new FPU" you should do this not in FPU itself but rather in other technologies. Keep reading and you will find them.
The FPU executes instructions from the processor’s normal instruction stream. The state of the FPU is independent from the state of the basic execution environment and from the state of SSE/SSE2/SSE3 extensions. However, the FPU and MMX instructions share state because the MMX registers are aliased to the x87 FPU data registers. Therefore, when writing code that uses FPU and MMX instructions, the programmer must explicitly manage the x87 FPU and MMX state.

How fast FPU was (is)? Well, I will give the floor to one of the computer users from those times.

The overall speed-up from using an 8087 instead of software floating routines is likely a good deal better than 100x. Back in the day when I was still in college, the class had an assignment to write a FORTRAN program that did a Fourier analysis. Over the holidays, I wrote the program on a COMPAQ Portable with the standard 4.77 MHz 8088 and no math coprocessor. When I ran the program to test it, I thought I had an infinite loop somewhere after the program did not complete after five or so minutes. After hours of debugging and finding no errors, someone suggested to me to let the program run and take a long break. Much to my surprise, the program completed, successfully and with the correct results, after about half an hour.

When I went to submit the program for grading, the teaching assistants in the lab would run the program with another dataset they had. The computers in the lab were standard IBM PCs with 4.77 MHz 8088 CPUs but they also had the 8087 installed. I was fully prepared to wait 30 minutes but, being that the machines had 8087s, I had guessed the program would complete execution in five or ten minutes.

To my utter shock, the program completed in one or two seconds. The MS-DOS cursor reappeared almost immediately after the TA had pressed the Enter key to run my program.

Of course, this is just one anecdote and I was very inexperienced at writing programs at the time so it is entirely possible that the program I wrote was unusually bad. But that is one of the war stories I've accumulated over the years from developing software where a very surprising result made an indelible impression on me. [anon_01]

Below I present execution times for selected 8087 numerical instructions and corresponding 8086 emulations [doc_01]:
```
Comparison of 8087 and 8086 Clock Times

                            | Approximate execution time (in us)
                            +-------------+---------------------
Instruction                 | 8087        | 8086 Emulation
                            | 8 MHz clock |
----------------------------+-------------+---------------------            
Add/Subtract                | 10          | 1000
Multiply (single precision) | 11.9        | 1000
Multiply (double precision) | 16.0        | 1312
Divide (single precision)   | 24.4        | 2000
Compare                     |  5.6        | 812
Load (double precision)     |  6.3        | 1062
Store (double precision)    | 13.1        | 750
Square root                 | 22.5        | 12250
Tangent                     | 56.3        | 8125
Exponentiation              | 62.5        | 10687
```
MMX I will make a step backward into the distant past and say few words about crucial data revolution which has changed the market and industry.

There was a time, around 90', when my HDD had 120 MiB capacity while the most typical was 80 MiB and many users had storage 60 MiB (yes mebibytes MiB not gibibytes GiB as today). And than gamechanger appeared.

In 1980 Sony and Philips finished their work on the Compact Disc Digital Audio (CD-DA) – new data medium capable to store digital audio data, offering the best sound quality and spacious enough to hold an hour of recording. And that was a challenge because the the amount of data that had to be saved was, for those times, very huge, close to 600 MiB. This potential, due to the digital nature of the data, which could actually be anything, was noticed relatively quickly.

The CD-DA was adapted to hold any form of digital data, with an initial storage capacity of 553 MiB – this "extended" form of CD-DA was named CD-ROM (Compact Disc Read Only Memory). Sony and Philips created the technical standard that defines the format of a CD-ROM in 1983, in what came to be called the Yellow Book. The CD-ROM was announced in 1984 and introduced by Denon and Sony at the first Japanese COMDEX computer show in 1985. In November 1985, several computer industry participants, including Microsoft, Philips, Sony, Apple and Digital Equipment Corporation, met to create a specification to define a file system format for CD-ROMs. The resulting specification, called the High Sierra format, was published in May 1986. It was eventually standardized, with a few changes, as the ISO 9660 standard in 1988. One of the first products to be made available to the public on CD-ROM was the Grolier Academic Encyclopedia, presented at the Microsoft CD-ROM Conference in March 1986.

So having CD-ROM creators could provide considerably much more informations and data. Especial finally they could enrich the content with an abundant number of multimedia materials like audios, images and videos. This way the problem of storing huge amount of data was transferred to another one: how to process it efficiency. Remember, this story is in times when no multicore and multithreading CPUS were available. The number of data is huge but these are very specific kind of data. consider for example an image and its brightness. If you want to change it, you have to add the same value to every pixel. A lot of operations but all of them are exactly the same. The same instruction on different data – single instruction, multiple data, you would say. All of this was an impuls for Intel to design and later introduced (on January 8, 1997) with its Pentium P microprocessors, named "Pentium with MMX Technology" an extension of basic instruction set known as MMX (Multimedia Extensions or sometimes also Matrix Math Extensions) – 57 new instruction of SIMD type. Technically Intel engineers faced, as always, with a big challenge. They had to introduce revolutionary approach without braking legacy with previous CPUs – new program had to run faster, old programs still had to be possible to execute "as so far".

MMX defines eight processor registers, named MM0 through MM7, and operations that operate on them. To avoid compatibility problems with the context switch mechanisms in existing operating systems, the MMX registers are aliases for the existing x87 floating-point unit (FPU) registers, which context switches would already save and restore. However, unlike the x87 registers, which behave like a stack, the MMX registers are each directly addressable (random access). Each 64-bit MMX register corresponds to the mantissa part of an 80-bit x87 register. The upper 16 bits of the x87 registers thus go unused in MMX, and these bits are all set to ones, making them Not a Number (NaN) data types, or infinities in the floating-point representation.

MMX introduced new data types called packed, but a much better term to describe their nature is vector. This "packing" involves treating 64-bit data (sequence of 64 bits) as consisting of a number of separate items (subsequences), all of the same size:
- 1 times 64 bits (quad word),
- 2 times 32 bits (packed dword),
- 4 times 16 bits (packed word),
- 8 times 8 bits (packed byte).
When operations are performed on vector types ("packed"), the same operation is performed for all cells simultaneously. For example, if two packed byte (8 times 8 bits) vectors are added, a single instruction performs eight 8-bit addition operations at once, and eight 8-bit results are stored.

MMX provides only integer operations and distinguishes between two types of arithmetic, in which the response to exceeding the range of numbers is different:
- Modulo arithmetic (wraparound), in which exceeding the range is not signaled in any way and only the least significant bits that fit in the result word are stored:
```
  8 bit case:
    
  11001000_(2) = 200_(10)
  11001000_(2) = 200_(10)
 ------------------------ ADD
 110010000_(2) = 400_(10) exact result
            |
            | wraparound by truncation
            V
 X10010000_(2) = 144_(10) = 400 mod 256
 |
 truncate
```
- Saturation arithmetic, in which the result exceeding the range that a given data type can hold, is trimmed to the extreme value
```
  8 bit case:
    
  11001000_(2) = 200_(10)
  11001000_(2) = 200_(10)
 ------------------------ ADD
 110010000_(2) = 400_(10) exact result
            |
            | maximum value on 8 bits
            | is 255
            V
  11111111_(2) = 255_(10)
```
Typical MMX applications include:
- decoding images like JPEG or PNG;
- decoding and encoding MPEG movies;
- displaying two-dimensional graphics (blue box, masking, transparency);
- displaying three-dimensional graphics: geometric transformations, shading, texturing;
- filtering signals in the form of static images, movies, sound;
- determining transforms like Haar or FFT.
It is not surprising that programs using MMX instructions were much faster than analogous programs using ordinary processor instructions. In a non-multimedia centered computer programs, the benefit of using MMX is practically none. MMX and idea behind it has subsequently been extended by Streaming SIMD Extensions (SSE), and Advanced Vector Extensions (AVX).
SSE (Streaming SIMD Extensions)is a family of successively introduced instruction sets and architecture changes (SSE - 1999, SSE2 - 2000, SSE3 – 2004, SSE4 - 2007 and SSE5 - 2009). All of them focus on applying MMX approach to real (floating point) numbers.

SSE floating-point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers. SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode.

SSE used only a single data type for XMM registers: four 32-bit single-precision floating-point numbers.

SSE2 would later expand the usage of the XMM registers to include:
- two 64-bit double-precision floating-point numbers or
- two 64-bit integers or
- four 32-bit integers or
- eight 16-bit short integers or
- sixteen 8-bit bytes or characters.
The addition of integer support in SSE2 made MMX largely redundant.

Because these 128-bit registers are additional machine states that the operating system must preserve across task switches, they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use the FXSAVE and FXRSTOR instructions, which is the extended pair of instructions that can save all x86 and SSE register states at once. This support was quickly added to all major IA-32 operating systems.
AVX AVX (Advanced Vector Extensions) is another instruction set of SIMD type introduced in 2011. It seems to be an evolution step of SSE. Why Intel decided to change the name to AVX? I don't know. In its initial shape this set uses sixteen YMM registers which are extended, from 128 bits to 256 bits, XMM registers renamed from XMM0–XMM7 to YMM0–YMM7 (in x86-64 mode, from XMM0–XMM15 to YMM0–YMM15). Each YMM register can hold and do simultaneous operations on:
- eight 32-bit single-precision floating point numbers or
- four 64-bit double-precision floating point numbers.
AVX introduces a three-operand SIMD instruction format (called VEX coding scheme), where the destination register is distinct from the two source operands. For example, instruction using so far the conventional two-operand form a := a + b can now use a non-destructive three-operand form c := a + b, preserving both source operands. Such form of instruction is very typical for RISC architectures.

Major changes introduced by further versions of this set of instructions can be summarized as follow:
- AVX2 (2013):
  - expansion of most vector integer SSE and AVX instructions to 256 bits.
- AVX-512 (2017):
  - extend register width from the 256-bit to the 512-bit;
  - introduce various new operations, such as new data conversions, scatter operations, and permutations;
  - the number of AVX registers is increased from 16 to 32 ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions. Legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31.
  - designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty.
- AVX10 (announced in July 2023):
  - addresses several issues of AVX-512, in particular that it is split into too many parts; it simplifies detection of supported instructions by introducing a version of the instruction set, where each subsequent version includes all instructions from the previous one;
  - later AVX10 versions will introduce new features.
I have to warn you that not all AVX-512 CPUs offer the same instruction sets. Unlike its previous iterations, the new vector instruction set consisted of 19 subsets: a core foundation, AVX-512F, that had to be offered to be compliant, and then a raft of very specific ones. These extra sets cover operations such as reciprocal math, integer FMA, or convolution neural network algorithms. What is more, in practice the use of AVX, in any form, results in the clocks being automatically decreased, the use of AVX-512 on such platforms would almost certainly be worse than using any of its predecessors, as it's even more demanding of power when running. [eva_20]

Final words

Now you have some background related to FPU and I can move on to discussing its internal structure and the way you can use it.

Before this I want to make few remarks:

You may come to a conclusion reading preceding part that today FPU is seldom used because you have much more efficient alternatives. Even if you are right, from theoretical, conceptual and pure didactic reasons it is good to spends some time writing few, even simple, programs with old-fashion FPU style.
If you think FPU is useless keep in mind that vectorized instructions are good but there are not as useful as they seems to be and there are a lot of voices saying that this is a road to nowhere.
- MMX, (S)SSE, and AVX aims to help with working with multiple smaller numbers or decimal points all at once. However, less than 10% of real software uses SIMD, and this software only uses SIMD to speed up less than 1% of its code (yes, there are exceptions, but even 3D games usually defer their number-crunching to the GPU and sometimes omit SIMD entirely). SIMD seems great on paper, but nearly zero software gets a significant speed-up from it. Yes, yes, individual sections get sped up. What I'm saying is that the software still spends a significant amount of time doing non-simidizable operations (especially branching) which account for most of the execution time. Additionally, it often takes a lot of time and testing for the developer to convert scalar code to use SIMD. [reddit_01]
- I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.
  
  I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.
  
  I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.
  
  Because absolutely nobody cares outside of benchmarks.
  
  The same is largely true of AVX512 now - and in the future. Yes, you can find things that care. No, those things don't sell machines in the big picture.
  
  And AVX512 has real downsides. I'd much rather see that transistor budget used on other things that are much more relevant. Even if it's still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.
  
  I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).
  
  Yes, yes, I'm biased. I absolutely destest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market.
  
  Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough. [linus]
As always a must read document about FPU, like any other aspect of the Intel architecture, is official documentation [intel_doc_01].

Reverse Polish Notation – a brilliant idea for performing calculations

To understand why x87 is organized as it is organized you have tu understand Reverse Polish Notation (RPN) – a brilliant idea for performing calculations.

We are used to commonly using infix notation in which operators are placed between operands:


OPERAND   OPERAND
  | OPERATOR  |
  |     |     |
  123   +   321

You may asked: "Are there any alternatives?" Yes, there are. You may place operator so it precede its operands or operator may follow its operands:


OPERATOR OPERAND OPERAND
   |        |       |
   +       123     123
   
   
OPERAND OPERAND OPERATOR 
   |       |       |
  123     123      +

In the first case, when operator precedes its operands, you have prefix notation. In the latter, when operator follows its operands, you have postfix notation.

Because prefix notation was invented by the Polish mathematician Jan Łukasiewicz (in the 1920s) hence its alternative name is Polish Notation. In the late 1950s, Australian philosopher and computer scientist Charles L. Hamblin suggested placing the operator after the operands and hence created widely used Reverse Polish Notation.

RPN has the property that brackets are not required to represent the order of evaluation or grouping of the terms. RPN expressions are simply evaluated from left to right and this greatly simplifies the computation of the expression within computer programs. In practice RPN can be conveniently evaluated using a stack structure.

To evaluate an expression in RPN, a stack-based algorithm is commonly used. Here is how it works:

Start with an empty stack.
Scan the expression from left to right.
If the current token is an operand, push it into the stack.
If the current token is an operator, pop the top operands from the stack, apply the operator to them, and push the result back into the stack.
Continue this process (steps 2-4) until all tokens have been processed.
The final result is the value remaining on the stack.


Infix expression: (6-2)*5
RPN expression  : 6 2 - 5 *

 1. token := next_token(expression) // token = 6
 2. Because token is a number push is into the stack  //  STACK: 6
 3. token := next_token(expression) // token = 2
 4. Because token is a number push is into the stack  //  STACK: 6 2
 5. token := next_token(expression) // token = -
 6. Because token is a two operands operator take two numbers
    from the stack and subtract first from the second:
    op1 := pop()  // STACK 6
    op2 := pop()  // STACK 2
    tmp := op2 - op1  // tmp := 6 - 2
    push(tmp)     // STACK 4 
 7. token := next_token(expression) // token = 5
 8. Because token is a number push it into the stack  //  STACK: 4 5
 9. token := next_token(expression) // token = *
10. Because token is a two operands operator take two numbers
    from the stack and multiply second by the first:
    op1 := pop()  // STACK 5
    op2 := pop()  // STACK 4
    tmp := op2 * op1  // tmp := 4 * 5
    push(tmp)     // STACK 20
11. token := next_token(expression) // token = NULL
    End of calculation.
12. Result is on the top of the stack
    result := pop()

As you can see this is very elegant and efficient calculation method. Any standard infix arithmetic expression can be easily converted to an RPN expression. A well-known algorithm for converting from infix to postfix notation is Dijkstra’s Shunting Yard Algorithm. This algorithm uses a queue and a stack to do the conversion and could provide you with some good programming practice for the study of data structures.

To carry out the process manually, you can use the 'fully parenthesising' method.

Consider the infix expression: 5 + 2 * 4

Using the rules relating to operator precedence, add brackets to the whole expression so that the inner set of brackets denotes the first part of the expression to be evaluated: (5 + (2 * 4)).

Now move each infix operator to the end of their corresponding set of brackets (just before the right hand bracket): (5 (2 4 x ) +).

Finally remove the brackets: 7 8 6 * +

If you want to practice, here you have some more examples:


infix: (3 + 5) / (9 - 5)
RPN:   3 5 + 9 2 - *

infix: 28 / (6 + 2 * 4)
RPN:   28 6 2 4 * + /

infix: (5 + 7) / ((8 – 6) * 3)
RPN:   5 7 + 8 6 - 3 * /

FPU internals

The FPU executes instructions from the processor’s normal instruction stream. The state of the FPU is independent from the state of the basic execution environment and from the state of SSE extensions. However, the FPU and MMX instructions share state because the MMX registers are aliased to the x87 FPU data registers. Therefore, when writing code that uses FPU and MMX instructions, you must explicitly manage the x87 FPU and MMX state.

The FPU execution environment consists of eight 80-bit data registers (from R0 to R7; Note that R0-R7 are internal names and can not be used by programmer and instead ST(0)-ST(7) are used what would be clarified further) and the following special-purpose registers:

status register (16-bit),
control register (16-bit),
tag word register (16-bit),
last instruction pointer register (48-bit),
last data (operand) pointer register (48-bit),
opcode register (11-bit).

FPU status register

The 16-bit FPU status register indicates the current state of the floating-point unit. The FPU sets the flags in this register to show the results of operations.

Exception flags (bits 0-7)
- IE, Invalid Operation, bit 0
- DE, Denormalized Operand, bit 1
- ZE, Zero Divide, bit 2
- OE, Overflow, bit 3
- UE, Underflow, bit 4
- PE, Precision, bit 5
- SF, Stack Fault Flag, bit 6
The stack fault flag indicates that stack overflow or stack underflow has occurred. The FPU explicitly sets the SF flag when it detects a stack overflow or underflow condition, but it does not explicitly clear the flag when it detects an invalid-arithmetic-operand condition. When this flag is set, the condition code flag C1 indicates the nature of the fault: overflow (C1 = 1) and underflow (C1 = 0). The SF flag is a ''sticky'' flag, meaning that after it is set, the processor does not clear it until it is explicitly instructed to do so (for example, by an FINIT/FNINIT instruction).
ES, Error Summary Status (bit 7). The x87 FPU detects the six classes of exception conditions:
- Invalid operation (#I), with two subclasses:
  - Stack overflow or underflow (#IS)
  - Invalid arithmetic operation (#IA)
- Denormalized operand (#D)
- Divide-by-zero (#Z)
- Numeric overflow (#O)