Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
apollo_core:fpu [2020/01/04 02:30]
muaddib [FPU Benchs]
apollo_core:fpu [2020/08/02 12:37] (current)
Line 3: Line 3:
 {{:​cpu.png?​nolink&​64|}} {{:​cpu.png?​nolink&​64|}}
  
 +===== Overview ===== 
  
-Last update 2018FEBRUARY.+Since the **GOLD2.7 Core**the Vampire boards embed a brand new __FPU Core in the FPGA__.
  
 +This new FPU Core is:
  
- +  ​* mostly **hardware-implemented**, ​
-===== DISCLAIMER =====  +
- +
-This article is a **WORK IN PROGRESS**. +
- +
-Some of the following specifications might change in the final **GOLD 2.7 release**. +
- +
-If so, this article will be updated to take them into account. +
- +
- +
- +
-===== OVERVIEW =====  +
- +
-Since the **GOLD 2.7 core**, the Vampire 500/600 accelerator boards embed a brand __new FPU core in the FPGA__. +
- +
-This new FPU core is +
- +
-  ​* mostly **hardware implemented**, ​+
   * very close to an **FPU060**, technically speaking,   * very close to an **FPU060**, technically speaking,
-  * **100% ​Pipelined** (parallel CPU/FPU coding support),  +  * **100% ​pipelined** (parallel CPU/FPU coding support),  
-  * **Fast** (around 35/40 MFLOPS in SysInfo), ​+  * **fast** (around 35-40 MFLOPS in SysInfo), ​
   * able to use a dedicated Floating Point Software Package (FPSP080).   * able to use a dedicated Floating Point Software Package (FPSP080).
  
 The whole set of instructions from the __MC68040 and MC68060__ is available, as well as the whole __MC68881 and MC68882__ subset. The whole set of instructions from the __MC68040 and MC68060__ is available, as well as the whole __MC68881 and MC68882__ subset.
  
-From an end-user or end-coder perspective,​ the AC68080 ​FPU does not provide new floating point instructions, nor any big changesthe FPU instruction set is the legacy one. More or less, as it will be described in the article.+From an end-user or end-coder perspective,​ the 68080 FPU does not provide ​any new floating point instructions ​or any other big changesthe FPU instruction set is the legacy one.
  
-The APOLLO-Team experimented with different options, based on the work initiated by Jari Eskelinen in **FEMU** (kudos to you, Jari).+The APOLLO-Team experimented with different options, based on the work initiated by Jari Eskelinen in **FEMU** (kudos to you, Jari). This work offered the Team an awesome test-bed to incrementally improve the FPU in hardware, and to figure out how to efficiently handle the emulated FPU instructions that are NOT implemented in hardware. During the investigations,​ it was clear that the usual TRAP'​ing mechanism is costly, wasting precious cycles.
  
-This work offered ​the team an awesome test-bed ​to incrementally improve the FPU in hardware, ​and to figure how to efficiently handle the emulated ​FPU instructions that are NOT implemented in hardware.+Since the 68080 is aiming at MC68040 compatibility,​ __the Team decided ​to implement a full FPU040/​FPU060 Core in hardware__ ​and to offer an efficient ​FPU interface for the instructions that are not implemented in hardware.  As a consequence,​ all the legacy MC68881 and MC68882 instructions are handled with new optimized mechanics.
  
-During ​the investigations,​ it was clear that the usual TRAP'​ing mechanism is costlywasting precious cycles.+This new approach intends to neatly solve all the constraints ​the Team has to deal with, such as compatibility,​ speedand room in FPGA.
  
-Since the AC68080 is aiming at MC68040 compatibility,​ __the team decided to implement a full FPU040/​FPU060 core in hardware__ and to offer an efficient ​FPU interface for the instructions that are not implemented in hardware.+The result of this work consists of an autonomous ​FPUnot relying on any third-party tool / library anymore (as opposed to what FEMU does, or, on other machines, to what 68040.library/​68060.library does).
  
-As a consequence,​ all the legacy MC68881 and MC68882 instructions are handled with new optimized mechanics.+===== Architecture =====
  
-This new approach intends to neatly solve all the constraints the Team has to deal with such as Compatibility,​ Speed, and room in FPGA. +Below is a simplified diagram of the 68080 FPU architecture,​ featuring a hardware-implemented **FPU Core** offering all the materials to operate floating-point calculations at the hardware level, and **two subsets** of instructions (040/060 and 881/882 subsets).
- +
-The result of this work consists in an autonomous FPU, not relying anymore on any third-party tool / library (as opposite to what FEMU does, or on other machines, to what 040/​060.library does). +
- +
- +
- +
-===== ARCHITECTURE ===== +
- +
-Below is a simplified diagram of the AC68080 ​FPU architecture,​ featuring a hardware implemented **FPU core** offering all the materials to operate floating-point calculationsat the hardware level, and **two subsets** of instructions (040/060 and 881/882 subsets).+
  
 {{:​ac68080fpu_overview.png?​direct|}} {{:​ac68080fpu_overview.png?​direct|}}
  
- +===== Registers =====
- +
- +
-===== FPU Registers =====+
  
   * FP0 to FP7   * FP0 to FP7
Line 67: Line 41:
   * FPU Vector   * FPU Vector
  
-All FPU registers are implemented in hardware and fully operational.+All FPU registers are implemented in hardware and are fully operational.
  
 The FPU Vector is the entry point to the FPSP interface. It can be modified only in Supervisor mode. The FPU Vector is the entry point to the FPSP interface. It can be modified only in Supervisor mode.
  
-In some extrem usecases, the end-coder can also access ​the whole AC68080 64bits ​registers (E00 to E23) in addition to the usual FPn registers.+In some extreme use cases, the end-coder can also access ​all 68080 64-bit ​registers (E00 to E23)in addition to the usual FPn registers.
  
- +===== Data types =====
- +
-===== FPU Datatypes ​=====+
  
   * (.b) BYTE   * (.b) BYTE
Line 85: Line 57:
   * (#imm) IMMEDIATES   * (#imm) IMMEDIATES
  
-All INTEGER from/to FLOAT castings ​are handled in hardware+All INTEGER from/to FLOAT casts are handled in hardware.
- +
-All IMMEDIATES are handled in hardware, which was not the case anymore on the MC68060. +
- +
-Most of the work is done in hardware, before calling the emulation vector when dealing with 881/2 instructions.+
  
-PACKED datatype (.p) is handled ​through ​the emulation vector.+All IMMEDIATES are handled ​in hardware, which was not the case on the MC68060.
  
-===== FPU EA modes =====+Most of the work is done in hardware, before calling the emulation vector when dealing with 881/882 instructions.
  
-All existing legacy FPU Effective Address modes (EAare computed by the AC68080 FPU core, at the hardware level.+PACKED data type (.pis handled through ​the emulation vector.
  
-Motorola did give up on integrating all the EA modes in hardware in the MC68060 CPU, the AC68080 FPU core brings them all back again.+===== EA modes =====
  
-The EA computationeven for emulated instructions is done in hardware before calling ​the emulation vector.+All existing legacy FPU Effective Address (EA) modes are computed by the 68080 FPU Coreat the hardware level.
  
 +Motorola gave up on integrating all the EA modes in hardware in the MC68060 CPU. (See "​IMMEDIATES"​ in the "Data types" section above.) The 68080 FPU Core brings them all back again.
  
 +Even for emulated instructions,​ the EA computation is done in hardware before calling the emulation vector.
  
-===== FPU Hardware instructions =====+===== Hardware instructions =====
  
-  * FMOVE, ​ FDMOVE, FSMOVE +  * FMOVE, FDMOVE, FSMOVE, 
-  * FMOVEM, FMOVECR +  * FMOVEM, FMOVECR, 
-  * FABS,   ​FDABS, ​ FSABS,  +  * FABS, FDABS, FSABS,  
-  * FNEG,   ​FDNEG, ​ FSNEG,  +  * FNEG, FDNEG, FSNEG,  
-  * FADD,   ​FDADD, ​ FSADD,  +  * FADD, FDADD, FSADD,  
-  * FSUB,   ​FDSUB, ​ FSSUB,  +  * FSUB, FDSUB, FSSUB,  
-  * FMUL,   ​FDMUL, ​ FSMUL, FSGLMUL +  * FMUL, FDMUL, FSMUL, FSGLMUL, 
-  * FDIV,   ​FDDIV, ​ FSDIV, FSGLDIV +  * FDIV, FDDIV, FSDIV, FSGLDIV, 
-  * FCMP,   ​FBCC,   ​FSCC,  FTST,  FINTRZ,  +  * FCMP, FBCC, FSCC, FTST, FINTRZ, 
-  * FSAVE, ​ FRESTORE+  * FSAVE, FRESTORE,
   * FNOP   * FNOP
  
-Instructions embedded in hardware offer a large floating-point ​instructions ​set, very close to the 040/060 FPUs. Most of the time, a coder can easily avoid the non-implemented ones since all the necessary primitives are available. For example, ​Amiga Quake can run 100% hardware instructions.+Instructions embedded in hardware offer a large floating-point ​instruction ​set, very close to the 040/060 FPUs. Most of the time, a coder can easily avoid the non-implemented ones since all the necessary primitives are available. For example, Quake can run 100% on hardware instructions.
  
-All those instructions run more or less in **1 cycle, so they are very fast**. It all depends on how well the ASM/​Coder ​makes smart use of the superscalar,​ and cache hits. +All those instructions run more or less in **1 cycle, so they are very fast**. It all depends on how well the ASM coder makes smart use of the superscalar,​ and on cache hits. 
  
-All those instructions can be used in the emulated instructions.+All those hardware ​instructions can be used in the emulated instructions.
  
 They are largely used in the embedded FPSP to accelerate the emulation. They are largely used in the embedded FPSP to accelerate the emulation.
  
 +===== Emulated instructions =====
  
- +  ​* FSQRT, FDSQRT, FSSQRT, 
-===== FPU Emulated instructions ===== +  * FACOS, FCOS, FCOSH, 
- +  * FASIN, FSIN, FSINH, FSINCOS, 
-  ​* FSQRT, ​  ​FDSQRT, ​ FSSQRT +  * FATAN, FATANH, FTAN, FTANH, 
-  * FACOS, ​  ​FCOS,    FCOSH +  * FETOX, FETOXM1, FTENTOX, FTWOTOX, 
-  * FASIN, ​  ​FSIN,    FSINH, ​  ​FSINCOS +  * FGETEXP, FGETMAN, FINT, 
-  * FATAN, ​  ​FATANH, ​ FTAN,    FTANH +  * FMOD, FREM, FSCALE, 
-  * FETOX, ​  ​FETOXM1, FTENTOX, FTWOTOX +  * FLOG10, FLOG2, FLOGN, FLOGN1P
-  * FGETEXP, FGETMAN, FINT +
-  * FMOD,    FREM,    FSCALE +
-  * FLOG10, ​ FLOG2, ​  ​FLOGN, FLOGN1P+
  
 Those instructions are handled using some **optimized FPSP code**. Those instructions are handled using some **optimized FPSP code**.
Line 142: Line 110:
 The FPSP code is instantiated by using a new dedicated FPU Vector. ​ The FPSP code is instantiated by using a new dedicated FPU Vector. ​
  
-It takes full advantages ​of the hardware FPU core, such as the precomputed ​EA modes, ​Datatypes ​casting, and makes uses of the implemented primitives. +It takes full advantage ​of the hardware FPU Core, such as the pre-computed ​EA modes, ​data type casting, and implemented primitives.
- +
-Depending on the FPGA size constraints,​ the APOLLO-Core is able to propose different implementations. +
- +
-Some instructions can be emulated in the FPSP or embedded into the FPU core, such as the **SQRT** instruction. +
  
 +Depending on the FPGA size constraints,​ the APOLLO Core is able to propose different implementations.
  
-===== FPU Precision =====+Some instructions can be emulated in the FPSP or embedded into the FPU Core, such as the **SQRT** instruction.
  
-By design, the APOLLO-Core FPU core can provide all calculations in 80-bits, like the original 68k FPUs do.+===== Precision =====
  
-But to make the AC68080 FPU core fit into the not-that-fat ​**C3 40KLE FPGA** used in the Vampire 500/600 generation, the width of the FPU was reduced to allow fitting and to still run games and demos.+The original ​**APOLLO FPU Core** was designed to perform all calculations ​in **80-bit** "​extended precision"​like the original 68k FPUs It was a very, very large Core, much bigger than the whole TG68 Core itself!
  
-Reduction is in discussion and depends on the remaining room in the FPGA__either 64bits or little less__.+When integrating ​the **APOLLO FPU Core** into the current Vampire generationthe precision was reduced to **64-bit** "​double precision",​ and the Core was renamed to **68080 FPU Core** This decision was made after careful consideration of the following:
  
-People shall **__use ​the FPU for what it is intended to on the Vampire__**, that is to run most of the Amiga Demos and Games requiring an FPU.+  ​"​Extended precision"​ wastes too much space on the FPGA. 
 +  * In real-world scenarios, a precision higher than 64 bits is almost never needed. 
 +  * In rare situations where a higher precision is needed, it can be simulated using other techniques. 
 +  * CPU manufacturers no longer think that "​extended precision"​ is a good idea.  Modern CPUs such as PowerPC and ARM do not support "​extended precision"​. ​ Even Motorola dropped support ​for it in their ColdFire processor, which is derived from the 68k architecture. 
 +  ​Other than a couple of old fractal explorersthere is no Amiga software ​that is known to require "​extended precision"​.
  
-Next Vampire ​generationwhere the FPGA is biggerwill increase this precision to either 64bits or moredepending ​on which compatibility level will be considered realistically useful in Amiga software-land.+The Vampire ​Standalone has an **Altera Cyclone V** FPGAwhich has enough space to accommodate ​the full 64-bit FPU Core.  However, Vampire accelerator boards connected to a classic Amiga have an **Altera Cyclone III** FPGA, which does not have enough space for the full 64-bit FPU Core.  Therefore, on these boards, the FPU Core had to be reduced to **52-bit** ​precision, to make it fit.  This precision should be enough ​to run most appsgames and demos requiring an FPU.  There are very few programs that require the full 64 bits of precision and so do not work on the 52-bit FPU Core.  The APOLLO-Team is ready to help the authors of those programs to adapt their software ​to work with 52-bit precision If adapting the software is not possible, then you can try the following:
  
 +  - Turn off the 52-bit FPU Core using ''​[[:​system_tools:​vcontrol|VControl FPU]]''​.
 +  - Check if the program you want to use detects the missing FPU and falls back to non-FPU routines. ​ If the detection is not automatic:
 +    * You might need to manually switch to non-FPU routines in the program'​s settings.
 +    * You might need to load a separate program file that contains the non-FPU version.
 +    * You might need to reinstall the program so that it installs the non-FPU version.
 +  - If it turns out that the program absolutely requires an FPU, you can employ a full-precision FPU emulator in software.
 +    * If you are using a Macintosh emulator, you can emulate an extended-precision FPU using [[https://​www.macintoshrepository.org/​2639-softwarefpu-3-x|SoftwareFPU]].
  
-===== FPU Performance =====+===== Performance =====
  
-Overall, the performance of the GOLD 2.7 FPU is very acceptable+Overall, the performance of the GOLD2.7 FPU is very acceptable:
  
-  * showing a nice 35/40 MFLOPS ​on SysInfo. ​+  * showing a nice 35-40 MFLOPS ​in SysInfo. ​
   * showing incredible floating point results in AIBB.    * showing incredible floating point results in AIBB. 
   * able to execute FPU instructions in parallel to CPU code.   * able to execute FPU instructions in parallel to CPU code.
   * able to run most of the Amiga RTG+FPU demos/​compos at full-speed. ​   * able to run most of the Amiga RTG+FPU demos/​compos at full-speed. ​
-  * able to render scenes in 3D software ​modelers ​as fast as the fastest existing 060.  +  * able to render scenes in 3D modeling ​software as fast as the fastest existing 060.  
-  * able to run Amiga Quake at a decent frame rate (25 fps in Low-res, more than 15 fps in High-res).+  * able to run Quake at a decent frame rate (25 fps in Low Res, more than 15 fps in High Res).
  
-The following scores are produced by a small program written specifically to measure the cycle count per FPU instruction,​ giving a welcomed ​overview of the actual speed.+The following scores are produced by a small program written specifically to measure the cycle count per FPU instruction,​ giving a useful ​overview of the actual speed.
  
-**RAW FPU cycles ​scores ​(pre-GOLD2.7 ​release) :**+**Raw FPU cycles (GOLD2.7):​**
   ​   ​
   * FABS       : ​       1 cycles   * FABS       : ​       1 cycles
Line 243: Line 218:
   * FNOP       : ​       1 cycles   * FNOP       : ​       1 cycles
  
-The cycles count clearly shows what is handled in Hardware ​or through FPSP. +The cycles count clearly shows what is handled in hardware ​or through FPSP. 
  
 +===== Benchmarks =====
  
-===== FPU Benchs =====+Below are some well-known Amiga benchmark measurements,​ from GOLD2.7 Core.
  
-Below are some well-known Amiga benchs measurements,​ from last pre-GOLD2.7 core. +**SysInfo ​4.0**
- +
-**SysInfo ​V4.0**+
  
 {{::​gold27_sysinfo.jpg?​direct&​400|}} {{::​gold27_sysinfo.jpg?​direct&​400|}}
Line 274: Line 248:
 {{:​gold27_aibb_fmatrix.jpg?​direct&​400|}} {{:​gold27_aibb_fmatrix.jpg?​direct&​400|}}
  
-**Amiga Quake in 320x200**+**Quake in 320x200**
  
 {{::​gold27_quake.jpg?​direct&​400|}} {{::​gold27_quake.jpg?​direct&​400|}}
Line 280: Line 254:
 {{:​gold27_quakecli.jpg?​direct&​400|}} {{:​gold27_quakecli.jpg?​direct&​400|}}
  
-**Amiga Cinema 4D FPU** (Colourtext,​ C4D 080fpu@77, 640x480, AA 4x4: 15m28s)+**Cinema 4D FPU** (Colourtext,​ C4D 080fpu@77, 640x480, AA 4x4: 15m 28s)
  
 {{::​gold27_c4d.png?​direct&​400|}} {{::​gold27_c4d.png?​direct&​400|}}
  
-**Amiga POVRay ​3.1** (Teapot rendering)+**POV-Ray ​3.1** (Teapot rendering)
  
 {{::​gold27_povray31.png?​direct&​400|}} {{::​gold27_povray31.png?​direct&​400|}}
  
-**Amiga LightWave 3D 3.5 (FP version)** (Benchmark scene, rendered on A600 ECS screen+**LightWave 3D 3.5 (FP version)** (Benchmark scene, rendered on A600 ECS screen: 1h 2m 15s)
- +
-Vampire 500/600 GOLD 2.7 x11: 1h 2m 15s+
  
 {{::​gold27_lw35fpu_ecs.png?​direct&​400|}} {{::​gold27_lw35fpu_ecs.png?​direct&​400|}}
  
-Compared ​to Amiga 1200 ACA 1233/40MHz FPU 50MHz: 11h 22m 24s+(...compared ​to Amiga 1200 ACA 1233/40MHz FPU 50MHz: 11h 22m 24s)
  
-https://​youtu.be/​q-yuF-A73Ks+{{youtube>​q-yuF-A73Ks?​size=400x225}}
  
 +===== Videos =====
  
-===== FPU Videos =====+Below are some videos recorded by the beta-testers.
  
-Below, some videos recorded by the beta-testers.+  * [[https://​youtu.be/​qSrnbSjsSnE|FPU happy tests]] 
 +  * [[https://​www.youtube.com/​watch?​v=6fy7HFNp178&​t=37s|Imagine 4.0 FPU version]] 
 +  * [[https://​www.youtube.com/​watch?​v=IrYxMsm_Xm8|CineMorph FPU version]] 
 +  * [[https://​www.youtube.com/​watch?​v=BqvtjONlsD0&​t=3s|FPU RTG demo]] 
 +  * [[https://​www.youtube.com/​watch?​v=FM7F3VLw_lk|Mini Metal Slug FPU in an RTG Workbench window]] 
 +  * [[https://​www.youtube.com/​watch?​v=w2S4suHf-l8&​t=173s|Quake FPU]] 
 +  * [[https://​www.youtube.com/​watch?​v=I3XycCr5KRM|VistaPro Makepath FPU]]
  
-FPU happy testings +===== DeprecatedFEMU by Jari =====
-https://​youtu.be/​qSrnbSjsSnE+
  
-Imagine 4.0 FPU version +The current ​FPU Core is NOT based on the FEMU program anymore. It is a full rewrite, aiming at a cleaner, faster, and safer emulation code In fact, FEMU __was calling the OS math libs to emulate the missing instructions__,​ which is NOT the case anymore.
-https://www.youtube.com/​watch?​v=6fy7HFNp178&​t=37s+
  
-CineMorph FPU version +As reference for comparisonhere is a sample ​of FEMU scores on a GOLD2.5 ​Core:
-https://​www.youtube.com/​watch?​v=IrYxMsm_Xm8 +
- +
-FPU RTG DEMO +
-https://​www.youtube.com/​watch?​v=BqvtjONlsD0&​t=3s +
- +
-Mini METAL SLUG FPU in RTG Workbench Window +
-https://​www.youtube.com/​watch?​v=FM7F3VLw_lk +
- +
-Quake FPU +
-https://​www.youtube.com/​watch?​v=w2S4suHf-l8&​t=173s +
- +
-Vista Pro Makepath FPU +
-https://​www.youtube.com/​watch?​v=I3XycCr5KRM +
- +
-===== Additional notes ===== +
- +
-Below some free notes. +
- +
-==== FPU integration on current generation ==== +
- +
-The original **APOLLO-Core 80bits FPU** and the associated VHDL code, written long ago, is big, very big. +
- +
-It is much bigger than the whole TG68 core itself ! +
- +
-The integration of the **APOLLO-Core FPU** into the current Vampire generation, is called the **AC68080 FPU core**. +
- +
-It obviously needed many work to make it fit in the **Altera Cyclone III 40KLE FPGA** used in the Vampire 500/600. +
- +
-People need to understand that it was done at some costs, and needed smart moves. +
- +
-However, we, in the team, are proud of how it was integrated. +
- +
-Of course, it might (or even, it surely) **have some remaining bugs**. +
- +
-This is normal for first release, and we would be interested in all kind of feedback about it, once released into the wild. +
- +
- +
- +
-==== FEMU by Jari ==== +
- +
-The current FPU core is NOT based anymore on the FEMU program. +
- +
-It is a full rewrite, aiming at a cleaner, faster, and safier emulation code. +
- +
-In fact, FEMU __did call the OS MATHLIBS to emulate the missing instructions__,​ which is NOT the case anymore. +
- +
-As a comparison reference,​ +
- +
-FEMU scores ​sample ​on a GOLD2.5-based core :+
   ​   ​
   * FATAN      :     1327 cycles   * FATAN      :     1327 cycles
Line 372: Line 300:
   * FTENTOX ​   :     4258 cycles   * FTENTOX ​   :     4258 cycles
   * FTWOTOX ​   :     2565 cycles   * FTWOTOX ​   :     2565 cycles
- 
- 
  
 ---- ----
  
 <​php>​tpl_youarehere();</​php>​ <​php>​tpl_youarehere();</​php>​
Last modified: le 2020/08/02 12:37