* * * * *

            The speed of Microsoft's BASIC floating point routines

I was curious about how fast Microsoft's BASIC (Beginners' All-purpose
Symbolic Instruction Code) floating point [1] routines were. This is easy
enough to test, now that I can time assembly code inside the assembler [2].
The code calculates -2π^3/3! using Color BASIC routines, IEEE (Institute of
Electrical and Electronics Engineers)-754 single precision and double
precision.

First, Color BASIC:

-----[ Assembly ]-----
       .tron   timing
ms_fp           ldx     #.tau
               jsr     CB.FP0fx        ; FP0 = .tau
               ldx     #.tau
               jsr     CB.FMULx        ; FP0 = FP0 * .tau
               ldx     #.tau
               jsr     CB.FMULx        ; FP0 = FP0 * .tau
               jsr     CB.FP1f0        ; FP1 = FP0
               ldx     #.fact3
               jsr     CB.FP0fx        ; FP0 = 3!
               jsr     CB.FDIV         ; FP0 = FP1 / FP0
               neg     CB.fp0sgn       ; FP0 = -FP0
               ldx     #.answer
               jsr     CB.xfFP0        ; .answer = FP0
       .troff
               rts

tau             fcb     $83,$49,$0F,$DA,$A2
fact3           fcb     $83,$40,$00,$00,$00
answer          rmb     5
               fcb     $86,$A5,$5D,$E7,$30     ; precalculated result
-----[ END OF LINE ]-----

I can't use the .FLOAT directive here since that only supports either the
Microsoft format or IEEE-754 but not both. So for this test, I have to define
the individual bytes per float. The last line is what the result should be
(by checking a memory dump of the VM (Virtual Machine) after running). Also,
tao is 2π [3], just in case that wasn't clear. This ran in 8,742 cycles,
taking 2,124 instructions and 4.12 cycles per instruction (I modified the
assembler to record this additional information).

Next up, IEEE-754 single precision:

-----[ Assembly ]-----
       .tron   timing
ieee_single     ldu     #.tau
               ldy     #.tau
               ldx     #.answer
               ldd     #.fpcb
               jsr     REG
               fcb     FMUL    ; .answer = .tau * .tau

               ldu     #.tau
               ldy     #.answer
               ldx     #.answer
               ldd     #.fpcb
               jsr     REG
               fcb     FMUL    ; .answer = .answer * .tau

               ldu     #.answer
               ldy     #.fact3
               ldx     #.answer
               ldd     #.fpcb
               jsr     REG
               fcb     FDIV    ; .answer = .answer / 3!

               ldy     #.answer
               ldx     #.answer
               ldd     #.fpcb
               jsr     REG
               fcb     FNEG    ; .answer = -.answer
       .troff
               rts

fpcb            fcb     FPCTL.single | FPCTL.rn | FPCTL.proj
               fcb     0
               fcb     0
               fcb     0
               fdb     0

tau             .float  6.283185307
fact3           .float  3!
answer          .float  0
               .float  -(6.283185307 ** 3 / 3!)
-----[ END OF LINE ]-----

The floating point control block (.fpcb) configures the MC6839 to use single
precision, normal rounding and projective closure (not sure what that is, but
it's the default value). And it does calculate the correct result. It's
amazing that code written 42 years ago for an 8-bit CPU (Central Processing
Unit) works flawlessly. What is isn't is fast. This code took 14,204 cycles
over 2,932 instructions (average 4.84 cycles per instruction).

The higher than average cycle type could be due to position independent
addressing modes, but I'm not entirely sure what it's doing to take nearly
twice the time. The ROM (Read Only Memory) does use the IEEE-754 extended
format (10 bytes) internally, with more bit shifts to extract the exponent
and mantissa, but twice the time?

Perhaps it's code to deal with ±∞ and NaN (Not a Number)s.

The IEEE-754 double precision is the same, except for the floating point
control block configuring double precision and the use of .FLOATD instead of
FLOAT; otherwise the code is identical. The result, however, isn't. It took
31,613 cycles over 6,865 instructions (average 4.60 cycles per instruction).
And being twice the size, it took nearly twice the time as single precision,
which is expected.

The final bit of code just loads the ROMs into memory, and calls each
function to get the timing:

-----[ Assembly ]-----
               org     $2000
               incbin  "mc6839.rom"
REG             equ     $203D   ; register-based entry point

               org     $A000
               incbin  "bas12.rom"

       .opt    test    prot    rw,$00,$FF      ; Direct Page for BASIC
       .opt    test    prot    rx,$2000,$2000+8192 ; MC6839 ROM
       .opt    test    prot    rx,$A000,$A000+8192 ; BASIC ROM

       .test   "BASIC"
               lbsr    ms_fp
               rts
       .endtst

       .test   "IEEE-SINGLE"
               lbsr    ieee_single
               rts
       .endtst

       .test   "IEEE-DOUBLE"
               lbsr    ieee_double
               rts
       .endtst
-----[ END OF LINE ]-----

Really, the only surprising thing here was just how fast Microsoft BASIC was
at floating point.

[1] https://en.wikipedia.org/wiki/Microsoft_Binary_Format
[2] gopher://gopher.conman.org/0Phlog:2023/12/19.3
[3] https://tauday.com/tau-manifesto

Email author at [email protected]