Chapter 6
FPU

Initial version: 2025-03-12
Last update: 2025-04-01

From human's perspective numbers are just numbers and it does not matter wether you add 1 to 2 or 1.0 to 2.0. From computer's perspective these are two completely different object held by two completely different "modules" – either being a separate CPU part or being a software emulation one by another. Yes, real numbers (floating point numbers) are different than integers and requires separate way of processing. In this chapter you will learn how to operate with such a numbers.

Table of contents


Know the place where you are


In contemporary x86 technology portfolio there are a lot of names like FPU, MMX, SSE, or AVX. Each brings to this noble architecture some news in data processing, allowing to write better, in most cases faster code. Below I will shortly characterize each of them so you will know for what you can use them.

  • FPU The x87 Floating-Point Unit (FPU), also known as a co-processor, used to be an option when the first PCs came on the market. It provides high-performance floating-point processing capabilities and supports mostly the floating-point but also integer, and packed BCD integer data types together with the floating-point processing algorithms and exception handling architecture defined in the IEEE Standard 754 for Binary Floating-Point Arithmetic. Modern PCs are now all provided with a co-processor. It is worth to note that although the original PC-XT architecture (especially CPU) has evolved considerably over the years, the FPU itself has not changed apparently during that same period. The entire set of assembler instructions for the FPU is relatively small and the main difficulty is to avoid some of the pitfalls peculiar to the FPU.

    The answer for the question, why FPU sticks in ancient times has two arguments:
    • To keep backward compatibility. Wether you like it or not, this is something characterizing x86 architecture. Even now, after more than 40 years from first PC XT launch (on March 8, 1983), you can execute every program ever made for this architecture or any of the newest ancestor. This gives you great flexibility but also limits further evolution.
    • Intel applies something that I call inherited evolution. Because of the previous argument instead of changing something what exists they rather extend architecture adding new component (and preserving existing in untouched form). In consequence new architecture inherits everything from its predecessor and extends with some new features. So if you are searching "new FPU" you should do this not in FPU itself but rather in other technologies. Keep reading and you will find them.


    The FPU executes instructions from the processor’s normal instruction stream. The state of the FPU is independent from the state of the basic execution environment and from the state of SSE/SSE2/SSE3 extensions. However, the FPU and MMX instructions share state because the MMX registers are aliased to the x87 FPU data registers. Therefore, when writing code that uses FPU and MMX instructions, the programmer must explicitly manage the x87 FPU and MMX state.

    How fast FPU was (is)? Well, I will give the floor to one of the computer users from those times.

    The overall speed-up from using an 8087 instead of software floating routines is likely a good deal better than 100x. Back in the day when I was still in college, the class had an assignment to write a FORTRAN program that did a Fourier analysis. Over the holidays, I wrote the program on a COMPAQ Portable with the standard 4.77 MHz 8088 and no math coprocessor. When I ran the program to test it, I thought I had an infinite loop somewhere after the program did not complete after five or so minutes. After hours of debugging and finding no errors, someone suggested to me to let the program run and take a long break. Much to my surprise, the program completed, successfully and with the correct results, after about half an hour.

    When I went to submit the program for grading, the teaching assistants in the lab would run the program with another dataset they had. The computers in the lab were standard IBM PCs with 4.77 MHz 8088 CPUs but they also had the 8087 installed. I was fully prepared to wait 30 minutes but, being that the machines had 8087s, I had guessed the program would complete execution in five or ten minutes.

    To my utter shock, the program completed in one or two seconds. The MS-DOS cursor reappeared almost immediately after the TA had pressed the Enter key to run my program.

    Of course, this is just one anecdote and I was very inexperienced at writing programs at the time so it is entirely possible that the program I wrote was unusually bad. But that is one of the war stories I've accumulated over the years from developing software where a very surprising result made an indelible impression on me. [anon_01]

    Below I present execution times for selected 8087 numerical instructions and corresponding 8086 emulations [doc_01]:

    
    Comparison of 8087 and 8086 Clock Times
    
                                | Approximate execution time (in us)
                                +-------------+---------------------
    Instruction                 | 8087        | 8086 Emulation
                                | 8 MHz clock |
    ----------------------------+-------------+---------------------            
    Add/Subtract                | 10          | 1000
    Multiply (single precision) | 11.9        | 1000
    Multiply (double precision) | 16.0        | 1312
    Divide (single precision)   | 24.4        | 2000
    Compare                     |  5.6        | 812
    Load (double precision)     |  6.3        | 1062
    Store (double precision)    | 13.1        | 750
    Square root                 | 22.5        | 12250
    Tangent                     | 56.3        | 8125
    Exponentiation              | 62.5        | 10687
    
  • MMX I will make a step backward into the distant past and say few words about crucial data revolution which has changed the market and industry.

    There was a time, around 90', when my HDD had 120 MiB capacity while the most typical was 80 MiB and many users had storage 60 MiB (yes mebibytes MiB not gibibytes GiB as today). And than gamechanger appeared.

    In 1980 Sony and Philips finished their work on the Compact Disc Digital Audio (CD-DA) – new data medium capable to store digital audio data, offering the best sound quality and spacious enough to hold an hour of recording. And that was a challenge because the the amount of data that had to be saved was, for those times, very huge, close to 600 MiB. This potential, due to the digital nature of the data, which could actually be anything, was noticed relatively quickly.

    The CD-DA was adapted to hold any form of digital data, with an initial storage capacity of 553 MiB – this "extended" form of CD-DA was named CD-ROM (Compact Disc Read Only Memory). Sony and Philips created the technical standard that defines the format of a CD-ROM in 1983, in what came to be called the Yellow Book. The CD-ROM was announced in 1984 and introduced by Denon and Sony at the first Japanese COMDEX computer show in 1985. In November 1985, several computer industry participants, including Microsoft, Philips, Sony, Apple and Digital Equipment Corporation, met to create a specification to define a file system format for CD-ROMs. The resulting specification, called the High Sierra format, was published in May 1986. It was eventually standardized, with a few changes, as the ISO 9660 standard in 1988. One of the first products to be made available to the public on CD-ROM was the Grolier Academic Encyclopedia, presented at the Microsoft CD-ROM Conference in March 1986.

    So having CD-ROM creators could provide considerably much more informations and data. Especial finally they could enrich the content with an abundant number of multimedia materials like audios, images and videos. This way the problem of storing huge amount of data was transferred to another one: how to process it efficiency. Remember, this story is in times when no multicore and multithreading CPUS were available. The number of data is huge but these are very specific kind of data. consider for example an image and its brightness. If you want to change it, you have to add the same value to every pixel. A lot of operations but all of them are exactly the same. The same instruction on different data – single instruction, multiple data, you would say. All of this was an impuls for Intel to design and later introduced (on January 8, 1997) with its Pentium P microprocessors, named "Pentium with MMX Technology" an extension of basic instruction set known as MMX (Multimedia Extensions or sometimes also Matrix Math Extensions) – 57 new instruction of SIMD type. Technically Intel engineers faced, as always, with a big challenge. They had to introduce revolutionary approach without braking legacy with previous CPUs – new program had to run faster, old programs still had to be possible to execute "as so far".

    MMX defines eight processor registers, named MM0 through MM7, and operations that operate on them. To avoid compatibility problems with the context switch mechanisms in existing operating systems, the MMX registers are aliases for the existing x87 floating-point unit (FPU) registers, which context switches would already save and restore. However, unlike the x87 registers, which behave like a stack, the MMX registers are each directly addressable (random access). Each 64-bit MMX register corresponds to the mantissa part of an 80-bit x87 register. The upper 16 bits of the x87 registers thus go unused in MMX, and these bits are all set to ones, making them Not a Number (NaN) data types, or infinities in the floating-point representation.

    MMX introduced new data types called packed, but a much better term to describe their nature is vector. This "packing" involves treating 64-bit data (sequence of 64 bits) as consisting of a number of separate items (subsequences), all of the same size:

    • 1 times 64 bits (quad word),
    • 2 times 32 bits (packed dword),
    • 4 times 16 bits (packed word),
    • 8 times 8 bits (packed byte).
    When operations are performed on vector types ("packed"), the same operation is performed for all cells simultaneously. For example, if two packed byte (8 times 8 bits) vectors are added, a single instruction performs eight 8-bit addition operations at once, and eight 8-bit results are stored.

    MMX provides only integer operations and distinguishes between two types of arithmetic, in which the response to exceeding the range of numbers is different:

    • Modulo arithmetic (wraparound), in which exceeding the range is not signaled in any way and only the least significant bits that fit in the result word are stored:

      
        8 bit case:
          
        11001000_(2) = 200_(10)
        11001000_(2) = 200_(10)
       ------------------------ ADD
       110010000_(2) = 400_(10) exact result
                  |
                  | wraparound by truncation
                  V
       X10010000_(2) = 144_(10) = 400 mod 256
       |
       truncate
      
    • Saturation arithmetic, in which the result exceeding the range that a given data type can hold, is trimmed to the extreme value

      
        8 bit case:
          
        11001000_(2) = 200_(10)
        11001000_(2) = 200_(10)
       ------------------------ ADD
       110010000_(2) = 400_(10) exact result
                  |
                  | maximum value on 8 bits
                  | is 255
                  V
        11111111_(2) = 255_(10)
      


    Typical MMX applications include:

    • decoding images like JPEG or PNG;
    • decoding and encoding MPEG movies;
    • displaying two-dimensional graphics (blue box, masking, transparency);
    • displaying three-dimensional graphics: geometric transformations, shading, texturing;
    • filtering signals in the form of static images, movies, sound;
    • determining transforms like Haar or FFT.
    It is not surprising that programs using MMX instructions were much faster than analogous programs using ordinary processor instructions. In a non-multimedia centered computer programs, the benefit of using MMX is practically none. MMX and idea behind it has subsequently been extended by Streaming SIMD Extensions (SSE), and Advanced Vector Extensions (AVX).
  • SSE (Streaming SIMD Extensions)is a family of successively introduced instruction sets and architecture changes (SSE - 1999, SSE2 - 2000, SSE3 – 2004, SSE4 - 2007 and SSE5 - 2009). All of them focus on applying MMX approach to real (floating point) numbers.

    SSE floating-point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers. SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode.

    SSE used only a single data type for XMM registers: four 32-bit single-precision floating-point numbers.

    SSE2 would later expand the usage of the XMM registers to include:
    • two 64-bit double-precision floating-point numbers or
    • two 64-bit integers or
    • four 32-bit integers or
    • eight 16-bit short integers or
    • sixteen 8-bit bytes or characters.
    The addition of integer support in SSE2 made MMX largely redundant.

    Because these 128-bit registers are additional machine states that the operating system must preserve across task switches, they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use the FXSAVE and FXRSTOR instructions, which is the extended pair of instructions that can save all x86 and SSE register states at once. This support was quickly added to all major IA-32 operating systems.
  • AVX AVX (Advanced Vector Extensions) is another instruction set of SIMD type introduced in 2011. It seems to be an evolution step of SSE. Why Intel decided to change the name to AVX? I don't know. In its initial shape this set uses sixteen YMM registers which are extended, from 128 bits to 256 bits, XMM registers renamed from XMM0XMM7 to YMM0YMM7 (in x86-64 mode, from XMM0XMM15 to YMM0YMM15). Each YMM register can hold and do simultaneous operations on:
    • eight 32-bit single-precision floating point numbers or
    • four 64-bit double-precision floating point numbers.
    AVX introduces a three-operand SIMD instruction format (called VEX coding scheme), where the destination register is distinct from the two source operands. For example, instruction using so far the conventional two-operand form a := a + b can now use a non-destructive three-operand form c := a + b, preserving both source operands. Such form of instruction is very typical for RISC architectures.

    Major changes introduced by further versions of this set of instructions can be summarized as follow:
    • AVX2 (2013):
      • expansion of most vector integer SSE and AVX instructions to 256 bits.
    • AVX-512 (2017):
      • extend register width from the 256-bit to the 512-bit;
      • introduce various new operations, such as new data conversions, scatter operations, and permutations;
      • the number of AVX registers is increased from 16 to 32 ZMM0–ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions. Legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31.
      • designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty.
    • AVX10 (announced in July 2023):
      • addresses several issues of AVX-512, in particular that it is split into too many parts; it simplifies detection of supported instructions by introducing a version of the instruction set, where each subsequent version includes all instructions from the previous one;
      • later AVX10 versions will introduce new features.


    I have to warn you that not all AVX-512 CPUs offer the same instruction sets. Unlike its previous iterations, the new vector instruction set consisted of 19 subsets: a core foundation, AVX-512F, that had to be offered to be compliant, and then a raft of very specific ones. These extra sets cover operations such as reciprocal math, integer FMA, or convolution neural network algorithms. What is more, in practice the use of AVX, in any form, results in the clocks being automatically decreased, the use of AVX-512 on such platforms would almost certainly be worse than using any of its predecessors, as it's even more demanding of power when running. [eva_20]


Final words


Now you have some background related to FPU and I can move on to discussing its internal structure and the way you can use it.

Before this I want to make few remarks:
  • You may come to a conclusion reading preceding part that today FPU is seldom used because you have much more efficient alternatives. Even if you are right, from theoretical, conceptual and pure didactic reasons it is good to spends some time writing few, even simple, programs with old-fashion FPU style.
  • If you think FPU is useless keep in mind that vectorized instructions are good but there are not as useful as they seems to be and there are a lot of voices saying that this is a road to nowhere.
    • MMX, (S)SSE, and AVX aims to help with working with multiple smaller numbers or decimal points all at once. However, less than 10% of real software uses SIMD, and this software only uses SIMD to speed up less than 1% of its code (yes, there are exceptions, but even 3D games usually defer their number-crunching to the GPU and sometimes omit SIMD entirely). SIMD seems great on paper, but nearly zero software gets a significant speed-up from it. Yes, yes, individual sections get sped up. What I'm saying is that the software still spends a significant amount of time doing non-simidizable operations (especially branching) which account for most of the execution time. Additionally, it often takes a lot of time and testing for the developer to convert scalar code to use SIMD. [reddit_01]
    • I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on.

      I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.

      I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.

      Because absolutely nobody cares outside of benchmarks.

      The same is largely true of AVX512 now - and in the future. Yes, you can find things that care. No, those things don't sell machines in the big picture.

      And AVX512 has real downsides. I'd much rather see that transistor budget used on other things that are much more relevant. Even if it's still FP math (in the GPU, rather than AVX512). Or just give me more cores (with good single-thread performance, but without the garbage like AVX512) like AMD did.

      I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).

      Yes, yes, I'm biased. I absolutely destest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market.

      Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough. [linus]
  • As always a must read document about FPU, like any other aspect of the Intel architecture, is official documentation [intel_doc_01].


Reverse Polish Notation – a brilliant idea for performing calculations


To understand why x87 is organized as it is organized you have tu understand Reverse Polish Notation (RPN) – a brilliant idea for performing calculations.

We are used to commonly using infix notation in which operators are placed between operands:


OPERAND   OPERAND
  | OPERATOR  |
  |     |     |
  123   +   321 
You may asked: "Are there any alternatives?" Yes, there are. You may place operator so it precede its operands or operator may follow its operands:


OPERATOR OPERAND OPERAND
   |        |       |
   +       123     123
   
   
OPERAND OPERAND OPERATOR 
   |       |       |
  123     123      +
In the first case, when operator precedes its operands, you have prefix notation. In the latter, when operator follows its operands, you have postfix notation.

Because prefix notation was invented by the Polish mathematician Jan Łukasiewicz (in the 1920s) hence its alternative name is Polish Notation. In the late 1950s, Australian philosopher and computer scientist Charles L. Hamblin suggested placing the operator after the operands and hence created widely used Reverse Polish Notation.

RPN has the property that brackets are not required to represent the order of evaluation or grouping of the terms. RPN expressions are simply evaluated from left to right and this greatly simplifies the computation of the expression within computer programs. In practice RPN can be conveniently evaluated using a stack structure.

To evaluate an expression in RPN, a stack-based algorithm is commonly used. Here is how it works:
  1. Start with an empty stack.
  2. Scan the expression from left to right.
  3. If the current token is an operand, push it into the stack.
  4. If the current token is an operator, pop the top operands from the stack, apply the operator to them, and push the result back into the stack.
  5. Continue this process (steps 2-4) until all tokens have been processed.
  6. The final result is the value remaining on the stack.



Infix expression: (6-2)*5
RPN expression  : 6 2 - 5 *

 1. token := next_token(expression) // token = 6
 2. Because token is a number push is into the stack  //  STACK: 6
 3. token := next_token(expression) // token = 2
 4. Because token is a number push is into the stack  //  STACK: 6 2
 5. token := next_token(expression) // token = -
 6. Because token is a two operands operator take two numbers
    from the stack and subtract first from the second:
    op1 := pop()  // STACK 6
    op2 := pop()  // STACK 2
    tmp := op2 - op1  // tmp := 6 - 2
    push(tmp)     // STACK 4 
 7. token := next_token(expression) // token = 5
 8. Because token is a number push it into the stack  //  STACK: 4 5
 9. token := next_token(expression) // token = *
10. Because token is a two operands operator take two numbers
    from the stack and multiply second by the first:
    op1 := pop()  // STACK 5
    op2 := pop()  // STACK 4
    tmp := op2 * op1  // tmp := 4 * 5
    push(tmp)     // STACK 20
11. token := next_token(expression) // token = NULL
    End of calculation.
12. Result is on the top of the stack
    result := pop()
As you can see this is very elegant and efficient calculation method. Any standard infix arithmetic expression can be easily converted to an RPN expression. A well-known algorithm for converting from infix to postfix notation is Dijkstra’s Shunting Yard Algorithm. This algorithm uses a queue and a stack to do the conversion and could provide you with some good programming practice for the study of data structures.

To carry out the process manually, you can use the 'fully parenthesising' method.

Consider the infix expression: 5 + 2 * 4

Using the rules relating to operator precedence, add brackets to the whole expression so that the inner set of brackets denotes the first part of the expression to be evaluated: (5 + (2 * 4)).

Now move each infix operator to the end of their corresponding set of brackets (just before the right hand bracket): (5 (2 4 x ) +).

Finally remove the brackets: 7 8 6 * +

If you want to practice, here you have some more examples:


infix: (3 + 5) / (9 - 5)
RPN:   3 5 + 9 2 - *

infix: 28 / (6 + 2 * 4)
RPN:   28 6 2 4 * + /

infix: (5 + 7) / ((8 – 6) * 3)
RPN:   5 7 + 8 6 - 3 * /


FPU internals


The FPU executes instructions from the processor’s normal instruction stream. The state of the FPU is independent from the state of the basic execution environment and from the state of SSE extensions. However, the FPU and MMX instructions share state because the MMX registers are aliased to the x87 FPU data registers. Therefore, when writing code that uses FPU and MMX instructions, you must explicitly manage the x87 FPU and MMX state.

The FPU execution environment consists of eight 80-bit data registers (from R0 to R7; Note that R0-R7 are internal names and can not be used by programmer and instead ST(0)-ST(7) are used what would be clarified further) and the following special-purpose registers:
  • status register (16-bit),
  • control register (16-bit),
  • tag word register (16-bit),
  • last instruction pointer register (48-bit),
  • last data (operand) pointer register (48-bit),
  • opcode register (11-bit).


FPU status register
The 16-bit FPU status register indicates the current state of the floating-point unit. The FPU sets the flags in this register to show the results of operations.

  • Exception flags (bits 0-7)
    • IE, Invalid Operation, bit 0
    • DE, Denormalized Operand, bit 1
    • ZE, Zero Divide, bit 2
    • OE, Overflow, bit 3
    • UE, Underflow, bit 4
    • PE, Precision, bit 5
    • SF, Stack Fault Flag, bit 6
    The stack fault flag indicates that stack overflow or stack underflow has occurred. The FPU explicitly sets the SF flag when it detects a stack overflow or underflow condition, but it does not explicitly clear the flag when it detects an invalid-arithmetic-operand condition. When this flag is set, the condition code flag C1 indicates the nature of the fault: overflow (C1 = 1) and underflow (C1 = 0). The SF flag is a ''sticky'' flag, meaning that after it is set, the processor does not clear it until it is explicitly instructed to do so (for example, by an FINIT/FNINIT instruction).
  • ES, Error Summary Status (bit 7). The x87 FPU detects the six classes of exception conditions:
    • Invalid operation (#I), with two subclasses:
      • Stack overflow or underflow (#IS)
      • Invalid arithmetic operation (#IA)
    • Denormalized operand (#D)
    • Divide-by-zero (#Z)
    • Numeric overflow (#O)
      • Numeric underflow (#U)
          Inexact result (precision) (#P)
        The exception summary status flag is set when any of the (unmasked) exception flags are set. The exception flags are “sticky” bits (once set, they remain set until explicitly cleared).
      • C0-C3, Condition Code, bit 8, 9, 10 and 14. The four condition code flags indicate the results of floating-point comparison and arithmetic operations. These condition code bits are used principally for conditional branching and for storage of information used in exception handling.
      • TOP, Top of Stack Pointer, bits 11 through 13. TOP is a pointer to the FPU data register that is currently at the top of the FPU register stack. This pointer is a binary value from 0 to 7.
      • B, FPU busy, bit 15.


      FPU Control Register
      The 16-bit control word controls the precision of the x87 FPU, rounding method used and also contains the FPU floating-point exception mask bits.
      • Bits 0 through 5 are exception mask bits.
      • The precision-control (PC) field (bits 8 and 9 of the FPU control word) determines the precision (64, 53, or 24 bits) of floating-point calculations made by the FPU. By default precision double extended precision, which uses the full 64-bit significand, is used.
      • The rounding-control (RC) field of the FPU control register (bits 10 and 11) controls how the results of FPU floating-point instructions are rounded (see table Rounding Modes and Encoding of Rounding Control (RC) Field).

        
        Table: Rounding Modes and Encoding of Rounding Control (RC) Field
        Rounding Mode    | RC Field Setting | Description
                         | (binary)         |
        -----------------+------------------+-------------
        Round to nearest | 00               | Rounded result is the closest to the infinitely
        (even)           |                  | precise result. If two values are equally close,
                         |                  | the result is the even value (that is, the one with
                         |                  | the least-significant bit of zero). Default mode.
                         |                  |
        Round down       | 01               | Rounded result is closest to but no greater than
                         |                  | the infinitely precise result.
                         |                  |
        Round up         | 10               | Rounded result is closest to but no less than the
                         |                  | infinitely precise result.
                         |                  |
        Round toward zero| 11               | Rounded result is closest to but no greater in absolute
        (Truncate)       |                  | value than the infinitely precise result.
        
      • Bits 6,7 and 13-15 are not used.


      FPU Tag Word Register


      The 16-bit tag word indicates the contents of each the 8 registers in the FPU data-register stack (one 2-bit tag per register). The tag codes indicate whether a register contains a valid number (00), zero (01), or a special floating-point number as NaN, infinity, denormal, or unsupported format (10), or whether it is empty (11).

      FPU data registers


      The FPU data registers consist of eight 80-bit registers. Values are stored in these registers in the double extended-precision floating-point format:

      
      77  66       0
      98  43       0            
      SEEEECCCCCCCCC
      | |   |
      | |   significand or coefficient (64 bits)
      | |
      | exponent (15 bits)
      |
      sign (1 bit)
      
      When floating-point, integer, or packed BCD integer values are loaded from memory into any of the FPU data registers, the values are automatically converted into double extended-precision floating-point format (if they are not already in that format). When computation results are subsequently transferred back into memory from any of the x87 FPU registers, the results can be left in the double extended-precision floating-point format or converted back into a shorter floating-point format, an integer format, or the packed BCD integer format.

      The eight FPU data registers R0-R7 are treated as a register stack where R7 is a base and stack growths towards R0. All addressing of the data registers is relative to the register on the top of the stack. The register number of the current top-of-stack register is stored in the TOP field in the FPU status word. The current TOP register is always named as ST(0) or simply ST, and ST(i) is used to specify the $i$-th register from TOP in the stack where $i=\{0,\dots,7\}$:

      
      FPU Data Register Stack
      
         7 xxx ST(3)
         6 xxx ST(2) 
         5 xxx ST(1) 
         4 xxx ST(0) <--- TOP = 100_(2)
         3 xxx
         2 xxx
         1 xxx
         0 xxx
         
      Growth stack: stack growth from higher register (R7) to lower (R0).
      
      Registry-like FPU memory organization reflects the approach to perform arithmetic operations, in particular the evaluation of a mathematical expression. Unlike the "ordinary part" of the code, which is processed by the "universal execution unit", numerical calculations are most conveniently carried out according to a RPM schema explained earlier. And it is to this scheme that the shape of the floating-point unit is adapted.

      Load operations decrement TOP by one and load a value into the new top-of-stack register, and store operations store the value from the current TOP register in memory and then increment TOP by one (note that there are also load and store operations that do not move top of the stack). You can think about load operation as equivalent to a push and a store operation as equivalent to a pop.

      The stack logically is organized as a circular queue. If a load operation is performed when TOP is at 0, register wraparound occurs and the new value of TOP is set to 7. The floating-point stack-overflow exception indicates when wraparound might cause an unsaved value to be overwritten. Many floating-point instructions have several addressing modes that permit the programmer to implicitly operate on the top of the stack, or to explicitly operate on specific registers relative to the TOP.

      FPU addressing modes


      • Stack mode. In this mode an instruction is written without any arguments -- by default registers ST(0) and ST(1) are used. In this case both arguments are replaced by the result of instruction:
        
        FADD --> FADDP ST(1), ST(0) --> ST(1) + ST(0) -> ST(1) and free ST(0)
        
      • Register mode. In this mode two arguments are used: ST(0) and ST(i):
        
        FADD ST(0), ST(i) --> ST(0) + ST(i) -> ST(0)
        FADD ST(i), ST(0) --> ST(i) + ST(0) -> ST(i)
        
      • Register mode with stack pop. In this mode source argument is on the top of the stack and destination in register ST(i). When instruction is completed, source argument is poped from a stack:
        
        FADDP ST(i), ST(0) --> ST(i) + ST(0) -> ST(i) and free ST(0)
        
      • Mode with memory argument. In this mode source argument is taken from memory and destination is located in ST(0):
        
        FADD memory --> ST(0) + memory -> ST(0)
        


      FPU stack usage example
      Typically the stack structure of the FPU registers and instructions are used in the following way. Assume that you want to calculate simple dot product of two vectors: $v_1 = [1.2, 3.4]$ and $v_2=[5.6, 7.8]$ (and that TOP contains 100 which means that register R4 is the top of the stack). You can do this with code:

      
      FLD  [vec1]      
      FMUL [vec2]
      FLD  [vec1 + 8]
      FMUL [vec2 + 8]
      FADD ST(1)
      


      • FLD [vec1] This instruction decrements the stack register pointer (TOP) and loads the value 1.2 from memory into ST(0) (physical register R3).
      • FMUL [vec2] The second instruction multiplies the value in ST(0) by the value 5.6 from memory and stores the result in ST(0). At this moment R3 register contains value 6.72.
      • FLD [vec1 + 8] The third instruction decrements TOP and loads the value 3.4 in ST(0). At this moment registers R3 contains value 6.72 while R2 contains 3.4.
      • FMUL [vec2 + 8] The fourth instruction multiplies the value in ST(0) by the value 7.8 from memory and stores the result in ST(0). At this moment registers R3 contains value 6.72 and R2 contains value 26.52.
      • FADD ST(1) The fifth instruction adds the value from ST(0) and the value from ST(1) and stores the result in ST(0). At this moment only registers R3 contains value 6.72 and R2 contains a final result.


      Usage examples


      Instructions related to the FPU internals


      In the example below I will use 32-bit code as it is much easier to make a loop preparing all data required for printf function than in case of 64-bit convention. As an exercise for you I leave to write corresponding 64-bit code.

      
      section .data
      
      fmt: db 10,"exception: %d",10,"top: %d",10,"R7 %d",10,"R6 %d",10,"R5 %d"
           db 10,"R4 %d",10,"R3 %d",10,"R2 %d",10,"R1 %d",10,"R0 %d",10,0
      
      section .bss 
          
      env: resd 7                 ; You need 28 bytes for saving the current
                                  ; FPU operating environment
      
      section .text
      
      extern printf
      
      global main
      
      main:
      
        finit                     ; Initialize FPU
        fld1                      ; Push +1.0 onto the FPU register stack.
        fld1
        fld1
        fld1
        call aux_print            ; Call auxiliary print code
        faddp st3, st0            ; Add ST(0) to ST(i) (in this case i=3),
                                  ; store result in ST(i), and pop the
                                  ; register stack. 
        call aux_print
      
      ; Exit
        mov   eax, 0              ; Exit code, 0=normal
        ret                       ; Main returns to operating system
      
      ; Auxiliary print code
      aux_print:
        fstenv [env]              ; Saves the current FPU operating environment
                                  ; at the memory location specified with
                                  ; the destination operand
        xor eax, eax
        mov ax, [env+8]           ; Copy to AX contents of the FPU tag word
      
        mov ecx, 0                ; Set counter as 0
      
        loop:                     ; do-while loop begin
          mov ebx, eax            ; At the beginning EAX = Tag Word
          and ebx, 3              ; Extract bits 0 and 1
          shr eax, 2              ; Shift right to extract next two bits
                                  ; in next iteration
          push ebx                ; Save extracted two bits on the stack
        
          inc ecx                 ; Increase value of the counter
          cmp ecx, 8              ; While condition test
        jne loop                  ; do-while loop end
      
      
        xor eax, eax              ; Clear eax register
        fstsw ax                  ; Save status word
        mov ebx, eax
        shr bx, 11                ; Shift ax right by 11 to get top-of-stack
                                  ; (TOP) pointer value
        and bx, 7                 ; A bit-wise AND of the two operands:
                                  ; BX and binary pattern 111 
        push ebx                  ; Save TOP on the stack
      
        mov ebx, eax              ; Prepare to extract some exceptions flags
               ;xxxxxxxxx1xxxxx1    bit 7 - Stack Fault         (64 decimal)
                                  ; bit 1 - Invalid Operation   ( 1 decimal)
        and bx, 0000000001000001b ; A bit-wise AND of the two operands:
                                  ; BX and binary pattern 1 
        push ebx                  ; Save some exceptions flags's bits on the stack
        
        push  fmt                 ; Address of format string
        call  printf              ; Call C function
        add   esp, 44             ; Pop stack 11*4 bytes
        ret
      
      ; End of the code
      
      Expected output is:

      
      fulmanp@fulmanp-ThinkPad-T540p:~$ nasm -f elf fpu_test_01_32.asm -o fpu_test_01_32.o
      fulmanp@fulmanp-ThinkPad-T540p:~$ gcc -m32 fpu_test_01_32.o -o fpu_test_01_32 -no-pie
      /usr/bin/ld: warning: fpu_test_01_32.o: missing .note.GNU-stack section implies executable stack
      /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
      fulmanp@fulmanp-ThinkPad-T540p:~$ ./fpu_test_01_32 
      
      exception: 0
      top: 4
      R7 0
      R6 0
      R5 0
      R4 0
      R3 3
      R2 3
      R1 3
      R0 3
      
      exception: 0
      top: 5
      R7 0
      R6 0
      R5 0
      R4 3
      R3 3
      R2 3
      R1 3
      R0 3
      


      The code should be easy to understand thanks to comments. Bellow I give extended information about some parts:
      • finit
        FINIT sets the FPU control, status, tag, instruction pointer, and data pointer registers to their default states. The FPU control word is set to 037FH (round to nearest, all exceptions masked, 64-bit precision). The status word is cleared (no exception flags set, TOP is set to 0). The data registers in the register stack are left unchanged, but they are all tagged as empty (11B). Both the instruction and data pointers are cleared.
      • fld1
        FLDX where X is one of the following values: 1, L2T, L2E, PI, LG2, LN2, Z push one of seven commonly used constants (in double extended-precision floating-point format) onto the FPU register stack. The constants that can be loaded with these instructions include +1.0 (1), +0.0 (Z), $log_{2}10$ (L2T), $log_{2}e$ (L2E), $\pi$ (PI), $log_{10}2$ (LG2), and $log_{e}2$ (LN2).
      • `faddp st3, st0
        Adds the destination and source operands and stores the sum in the destination location. In this case FADDP ST(i), ST(0) (for i=3) add ST(0) to ST(i), store result in ST(i), and pop the register stack.

        The destination operand is always an FPU register; the source operand can be a register or a memory location. Source operands in memory can be in single-precision or double-precision floating-point format or in word or doubleword integer format. Please check [intel_doc_01] for reference to other floating point add instruction (FADD/FADDP/FIADD).
      • fstenv
        Instruction FSTENV saves the current FPU operating environment at the memory location specified with the destination operand, and then masks all floating-point exceptions. The FPU operating environment consists of the FPU: control word, status word, tag word, instruction pointer, data pointer, and last opcode.

        Figures 8-9 through 8-12 in [intel_doc_01], show the layout in memory of the stored environment, depending on the operating mode of the processor (protected or real) and the current operand-size attribute (16-bit or 32-bit). In virtual-8086 mode, the real mode layouts are used. According to this 14 or 28 bytes are needed to save all values.

        Expected output is:
        
        3          1 1
        1          6 5          0
        |xxxxxxxxxxx|Control Word| B3 - B0
        |xxxxxxxxxxx|Status Word | B7 - B4
        |xxxxxxxxxxx|Tag Word    | B11 - B8
                                 | B15 - B12
                                 | B19 - B16
                                 | B23 - B20
                                 | B27 - B24
        
        x - not used
        Contents of bytes B12-B27 depends on mode
        and is not relevant to this example.
        
      • mov ax, [env+8]
        Copy to AX contents of the FPU tag word. Next you extract every 2-bits pair and associate it with floating-point register.


      FPU control word usage


      This code also should be easy to follow. It shows how to control rounding mode. Please make some test with rounding and find some examples how it works.

      The code in 32-bit version:
      
      section .data
      
      fmt: db "result is %d", 10, 0
      a:  dq  2.5
      b:  dq  3.0
      
      section .bss
          
      tmp: resq 1
      buf: resw 1
      
      section .text
      
      extern printf
      
      global main
      
      main:
      
        finit             ; Initialize FPU
        fstcw [buf]       ; Save control word
                      ;xxxx11xxxxxxxxxx
        or word [buf], 0000110000000000b ; Bits 11-10 controls rounding:
                          ; 00 round to nearest (default),
                          ; 01 round down, [0,1) -> 0 [1,2) -> 1 [-1,0) -> -1 [-2,-1) -> -2
                          ; 10 round up,   (0,1] -> 1 (1,2] -> 2 (-1,0] ->  0 (-2,-1] -> -1
                          ; 11 round toward zero [0,1) -> 0 [1,2) -> 1 (-1,0] ->  0 (-2,-1] -> -1
                          ;     for positives behaves like round down
                          ;     for negatives behaves like round up
        fldcw [buf]       ; Load updated control word
          
        fld  qword [a]    ; Load a to FPU
        fmul qword [b]    ; Multiply by b
        fist dword [tmp]  ; Cast result to int
      
        push  dword [tmp]
        push  fmt
        call  printf
        add   esp, 8
      
      ; Exit
        mov   eax, 0      ; Exit code, 0=normal
        ret               ; Main returns to operating system
      ; End of the code
      


      Expected output is:

      
      fulmanp@fulmanp-ThinkPad-T540p:~$ nasm -f elf fpu_test_02_32.asm -o fpu_test_02_32.o
      fulmanp@fulmanp-ThinkPad-T540p:~$ gcc -m32 fpu_test_02_32.o -o fpu_test_02_32 -no-pie
      /usr/bin/ld: warning: fpu_test_02_32.o: missing .note.GNU-stack section implies executable stack
      /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
      fulmanp@fulmanp-ThinkPad-T540p:~$ ./fpu_test_02_32 
      result is 7
      


      FPU status word usage


      
      section .data
      
      fmt: db "status word value %d", 10, 0
      a:  dq  2.5
      b:  dq  0.0
      
      section .bss
          
      tmp: resq 1
      buf: resw 1
      
      section .text
      
      extern printf
      
      global main
      
      main:
      
        finit             ; Initialize FPU
        fld  qword [a]    ; Load a to FPU
        fdiv qword [b]    ; Divide by b
        
        xor eax, eax
        fstsw ax          ; Stores the current value of the FPU status word
                          ; in the destination location. The destination
                          ; operand can be either a two-byte memory location
                          ; or the AX register.
      
        push  eax
        push  fmt
        call  printf
        add   esp, 8
      
      ; Exit
        mov   eax, 0      ; Exit code, 0=normal
        ret               ; Main returns to operating system
      ; End of the code
      


      Expected output is:

      
      fulmanp@fulmanp-ThinkPad-T540p:~$ nasm -f elf fpu_test_03_32.asm -o fpu_test_03_32.o
      fulmanp@fulmanp-ThinkPad-T540p:~$ gcc -m32 fpu_test_03_32.o -o fpu_test_03_32 -no-pie
      /usr/bin/ld: warning: fpu_test_03_32.o: missing .note.GNU-stack section implies executable stack
      /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
      fulmanp@fulmanp-ThinkPad-T540p:~$ ./fpu_test_03_32 
      status word value 14340
      
      Decimal value 14340 is equal to binary 11100000000100 which means that the ZE (Zero Divide) flag was set. Also we can see that TOP has decimal value 7 (111 binary).

      Problems you can try to solve


      Problem 1: Write equivalent 64-bit codes for given 32-bit codes.


      Problem 2: FPU stack overflow


      Write code to test FPU stack overflow. Other words, push into the stack more than eight data.