APOLLO 68080 FPU Core article

Last update 2018, FEBRUARY.

This article is a WORK IN PROGRESS.

Some of the following specifications might change in the final GOLD 2.7 release.

If so, this article will be updated to take them into account.

Since the GOLD 2.7 core, the Vampire V500/V600 accelerator boards embed a brand new FPU core in the FPGA.

This new FPU core is

  • mostly hardware implemented,
  • very close to an FPU060, technically speaking,
  • 100% Pipelined (parallel CPU/FPU coding support),
  • Fast (around 35/40 MFLOPS in SysInfo),
  • able to use a dedicated Floating Point Software Package (FPSP080).

The whole set of instructions from the MC68040 and MC68060 is available, as well as the whole MC68881 and MC68882 subset.

From an end-user or end-coder perspective, the AC68080 FPU does not provide new floating point instructions, nor any big changes, the FPU instruction set is the legacy one. More or less, as it will be described in the article.

The APOLLO-Team experimented with different options, based on the work initiated by Jari Eskelinen in FEMU (kudos to you, Jari).

This work offered the team an awesome test-bed to incrementally improve the FPU in hardware, and to figure how to efficiently handle the emulated FPU instructions that are NOT implemented in hardware.

During the investigations, it was clear that the usual TRAP'ing mechanism is costly, wasting precious cycles.

Since the AC68080 is aiming at MC68040 compatibility, the team decided to implement a full FPU040/FPU060 core in hardware and to offer an efficient FPU interface for the instructions that are not implemented in hardware.

As a consequence, all the legacy MC68881 and MC68882 instructions are handled with new optimized mechanics.

This new approach intends to neatly solve all the constraints the Team has to deal with such as Compatibility, Speed, and room in FPGA.

The result of this work consists in an autonomous FPU, not relying anymore on any third-party tool / library (as opposite to what FEMU does, or on other machines, to what 040/060.library does).

Below is a simplified diagram of the AC68080 FPU architecture, featuring a hardware implemented FPU core offering all the materials to operate floating-point calculations, at the hardware level, and two subsets of instructions (040/060 and 881/882 subsets).

  • FP0 to FP7
  • FPSR
  • FPCR
  • FPU Vector

All FPU registers are implemented in hardware and fully operational.

The FPU Vector is the entry point to the FPSP interface. It can be modified only in Supervisor mode.

In some extrem usecases, the end-coder can also access the whole AC68080 64bits registers (E00 to E23) in addition to the usual FPn registers.

  • (.b) BYTE
  • (.w) WORD
  • (.l) LONG
  • (.s) SINGLE
  • (.d) DOUBLE
  • (.x) EXTENDED
  • (#imm) IMMEDIATES

All INTEGER from/to FLOAT castings are handled in hardware.

All IMMEDIATES are handled in hardware, which was not the case anymore on the MC68060.

Most of the work is done in hardware, before calling the emulation vector when dealing with 881/2 instructions.

PACKED datatype (.p) is handled through the emulation vector.

All existing legacy FPU Effective Address modes (EA) are computed by the AC68080 FPU core, at the hardware level.

Motorola did give up on integrating all the EA modes in hardware in the MC68060 CPU, the AC68080 FPU core brings them all back again.

The EA computation, even for emulated instructions is done in hardware before calling the emulation vector.

  • FNOP

Instructions embedded in hardware offer a large floating-point instructions set, very close to the 040/060 FPUs. Most of the time, a coder can easily avoid the non-implemented ones since all the necessary primitives are available. For example, Amiga Quake can run 100% hardware instructions.

All those instructions run more or less in 1 cycle, so they are very fast. It all depends on how well the ASM/Coder makes smart use of the superscalar, and cache hits.

All those instructions can be used in the emulated instructions.

They are largely used in the embedded FPSP to accelerate the emulation.


Those instructions are handled using some optimized FPSP code.

The FPSP code is instantiated by using a new dedicated FPU Vector.

It takes full advantages of the hardware FPU core, such as the precomputed EA modes, Datatypes casting, and makes uses of the implemented primitives.

Depending on the FPGA size constraints, the APOLLO-Core is able to propose different implementations.

Some instructions can be emulated in the FPSP or embedded into the FPU core, such as the SQRT instruction.

By design, the APOLLO-Core FPU core can provide all calculations in 80-bits, like the original 68k FPUs do.

But to make the AC68080 FPU core fit into the not-that-fat C3 40KLE FPGA used in the Vampire V500/V600 generation, the width of the FPU was reduced to allow fitting and to still run games and demos.

Reduction is in discussion and depends on the remaining room in the FPGA, either 64bits or little less.

People shall use the FPU for what it is intended to on the Vampire, that is to run most of the Amiga Demos and Games requiring an FPU.

Next Vampire generation, where the FPGA is bigger, will increase this precision to either 64bits or more, depending on which compatibility level will be considered realistically useful in Amiga software-land.

Overall, the performances of the GOLD 2.7 FPU are very acceptable,

  • showing a nice 35/40 MFLOPS on SysInfo.
  • showing incredible floating point results in AIBB.
  • able to execute FPU instructions in parallel to CPU code.
  • able to run most of the Amiga RTG+FPU demos/compos at full-speed.
  • able to render scenes in 3D software modelers as fast as the fastest existing 060.
  • able to run Amiga Quake at a decent frame rate (25 fps in Low-res, more than 15 fps in High-res).

The following scores are produced by a small program written specifically to measure the cycle count per FPU instruction, giving a welcomed overview of the actual speed.

RAW FPU cycles scores (pre-GOLD2.7 release) :

* FABS       :        1 cycles
* FACOS      :      182 cycles
* FADD       :        1 cycles
* FASIN      :      185 cycles
* FATAN      :      231 cycles
* FATANH     :      180 cycles
* FCMP       :        1 cycles
* FCOS       :      253 cycles
* FCOSH      :      333 cycles
* FDABS      :        1 cycles
* FDADD      :        1 cycles
* FDDIV      :        2 cycles
* FDIV       :        2 cycles
* FDMOVE     :        1 cycles
* FDMUL      :        1 cycles
* FDNEG      :        1 cycles
* FDSQRT     :      221 cycles
* FDSUB      :        1 cycles
* FETOX      :      296 cycles
* FETOXM1    :      289 cycles
* FGETEXP    :       91 cycles
* FGETMAN    :       91 cycles
* FINT       :      117 cycles
* FINTRZ     :        2 cycles
* FLOG10     :      281 cycles
* FLOG2      :      291 cycles
* FLOGN      :      266 cycles
* FLOGN1P    :      266 cycles
* FMOD       :      134 cycles
* FMOVERm    :        1 cycles (Read from memory)
* FMOVEWm    :        1 cycles (Write to memory)
* FMOVERi    :        1 cycles (Read from register)
* FMOVEWi    :        1 cycles (Write to register)
* FMOVECR    :        1 cycles (Read constants)
* FMOVECTRL  :        8 cycles (Fmovem fpsr/fpcr/fpiar,(An))
* FMOVEMR    :        8 cycles (Fmovem (An),fp0-fp7)
* FMOVEMW    :       19 cycles (Fmovem fp0-fp7,(An))
* FMUL       :        1 cycles
* FNEG       :        1 cycles
* FREM       :      155 cycles
* FSABS      :        1 cycles
* FSADD      :        1 cycles
* FSCALE     :      121 cycles
* FSDIV      :        2 cycles
* FSGLDIV    :        2 cycles
* FSGLMUL    :        1 cycles
* FSIN       :      266 cycles
* FSINCOS    :      336 cycles
* FSINH      :      354 cycles
* FSMOVE     :        1 cycles
* FSMUL      :        1 cycles
* FSNEG      :        1 cycles
* FSQRT      :      223 cycles
* FSSQRT     :      221 cycles
* FSSUB      :        1 cycles
* FSUB       :        1 cycles
* FTAN       :      244 cycles
* FTANH      :      343 cycles
* FTENTOX    :      289 cycles
* FTST       :        1 cycles
* FTWOTOX    :      340 cycles
* FBCC       :        1 cycles
* FSCC       :        1 cycles
* FNOP       :        1 cycles

The cycles count clearly shows what is handled in Hardware or through FPSP.

Below are some well-known Amiga benchs measurements, from last pre-GOLD2.7 core.

SysInfo V4.0

WhichAmiga 1.3.25

AIBB 6.5 BeachBall bench (Rendered on PAL: High Res 4 Colors)

AIBB 6.5 BeachBall bench (Rendered on PAL: High Res 16 Colors)

AIBB 6.5 FMath bench

AIBB 6.5 FMatrix bench

Amiga Quake in 320×200

Amiga Cinema 4D FPU (Colourtext, C4D 080fpu@77, 640×480, AA 4×4: 15m28s)

Amiga POVRay 3.1 (Teapot rendering)

Amiga LightWave 3D 3.5 (FP version) (Benchmark scene, rendered on A600 ECS screen)

Vampire 500/600 GOLD 2.7 x11 : 1h 2m 15s

Compared to Amiga 1200 ACA 1233/40MHz FPU 50MHz : 11h 22m 24s


Below, some videos recorded by the beta-testers.

FPU happy testings https://youtu.be/qSrnbSjsSnE

Imagine 4.0 FPU version https://www.youtube.com/watch?v=6fy7HFNp178&t=37s

CineMorph FPU version https://www.youtube.com/watch?v=IrYxMsm_Xm8

FPU RTG DEMO https://www.youtube.com/watch?v=BqvtjONlsD0&t=3s

Mini METAL SLUG FPU in a RTG Workbench Window https://www.youtube.com/watch?v=FM7F3VLw_lk

Quake FPU https://www.youtube.com/watch?v=w2S4suHf-l8&t=173s

Vista Pro Makepath FPU https://www.youtube.com/watch?v=I3XycCr5KRM

Below some free notes.

FPU integration on current generation

The original APOLLO-Core 80bits FPU and the associated VHDL code, written long ago, is big, very big.

It is much bigger than the whole TG68 core itself !

The integration of the APOLLO-Core FPU into the current Vampire generation, is called the AC68080 FPU core.

It obviously needed many work to make it fit in the Altera Cyclone 3 40KLE FPGA used in the V500/V600.

People need to understand that it was done at some costs, and needed smart moves.

However, we, in the team, are proud of how it was integrated.

Of course, it might (or even, it surely) have some remaining bugs.

This is normal for a first release, and we would be interested in all kind of feedback about it, once released into the wild.

FEMU by Jari

The current FPU core is NOT based anymore on the FEMU program.

It is a full rewrite, aiming at a cleaner, faster, and safier emulation code.

In fact, FEMU did call the OS MATHLIBS to emulate the missing instructions, which is NOT the case anymore.

As a comparison reference,

FEMU scores sample on a GOLD2.5-based core :

* FATAN      :     1327 cycles
* FCOS       :     2522 cycles
* FLOG10     :     3030 cycles
* FLOG2      :     3561 cycles
* FLOGN      :     2909 cycles
* FSIN       :     1558 cycles
* FSINH      :     1219 cycles
* FSINCOS    :     4972 cycles
* FTAN       :     1531 cycles
* FTANH      :     2460 cycles
* FTENTOX    :     4258 cycles
* FTWOTOX    :     2565 cycles

Home | Links | APOLLO |