* * * * *

   Notes on optimizing an O(n)+C algorithm where the C matters quite a bit

[Note: all hexadecimal values are preceeded with a “$”, which is old-school.
I know these days it's hip to use “0x” to designate a hexadecimal value, but
I hope this is just a passing fad. Also, the “K” here stands for “kilobyte,”
which is 1024 bytes. This too, is old fasioned but I don't want to use a word
that sounds like pet food. Geeze, kids these days!] [Here's an onion—put it
on your belt, you geezer! —The kids] [Hey! I resemble that remark! Get off my
lawn! —Sean]

I was doing a bit of retro computing over the weekend, writing 6809 code and
running it on a Color Computer emulator [1] (because the Color Computer was
my first computer and the 6809 is a criminially underrated 8-bit CPU (Central
Processing Unit) in my opinion). Part of the coding was enabling all 64K of
RAM (Random Access Memory) in the machine. Normally, the Color Computer only
sees the lower 32K of RAM, with the upper 32K being ROM (Read-Only Memory)
(the BASIC (Beginners All-Purpose Symbolic Instruction Code) interpreter). To
enable all 64K of RAM, all that's needed is to stuff any value into memory
location $FFDF, which can be done with “POKE &HFFDF,0”. The problem with that
is once the ROM goes away, so does BASIC, and the CPU starts executing Lord
knows what since the RAM isn't initialized. So the actual procedure is to
copy the ROM contents into RAM, which is simple enough:

-----[ Assembly ]-----
       orcc    #$50    ; disable interrupts
       ldx     #$8000  ; start of ROM
loop    sta     $FFDE   ; enable ROM mapping
       lda     ,x      ; read byte
       sta     $FFDF   ; enable RAM
       sta     ,x+     ; write byte back, increment pointer
       cmpx    #$FF00  ; are we done?
       bne     loop
       andcc   #$AF    ; enable interrupts
-----[ END OF LINE ]-----

We don't actually want to copy the full 32K of ROM, since the upper 256 bytes
of the memory map is for I/O (Input/Output) devices. We also disable
interrupts since the default interrupt handlers are in ROM and if an
interrupt happens when we have RAM mapped, there may not be an interrupt
handler to handle it.

The code is straightforward, simple, and unfortunately, slow. Here's the main
loop again, this time with the number of cycles and bytes each instruction
takes:

-----[ Assembly ]-----
                       ; cycles        bytes
loop1   sta     $FFDE   ; 5             3
       lda     ,x      ; 4             2
       sta     $FFDF   ; 5             3
       sta     ,x+     ; 6             2
       cmpx    #$FF00  ; 4             3
       bne     loop1   ; 3             2
-----[ END OF LINE ]-----

The loop is 15 bytes, taking 27 cycles for each iteration. Since we copy one
byte per iteration and we have 35,512 iterations, it takes 877,824 cycles to
run (there are no pipe lines or caches to worry about, so this will always
take 877,824 cycles to run). Given the Color Computer runs at .89 MHz
(Megahertz) (yes, it's not a fast computer) this is nearly a second to copy
32K of RAM.

Can we do better?

Well, yes. Should we is another matter. I mean, this code is typically run
once, so does it matter if it takes a second to run? Meh. It'll take longer
to load the code from disk, but hey, I'm doing recreational retro programming
and want to have a bit of fun.

The first and easiest optimization is to read and write 16 bits at a time
instead of 8. And that's easy enough to do—the 6809 does have a 16-bit
accumulator so it's an easy change:

-----[ Assembly ]-----
                       ; cycles        bytes
loop2   sta     $FFDE   ; 5             3
       ldd     ,x      ; 5             2
       sta     $FFDF   ; 5             3
       std     ,x++    ; 8             2
       cmpx    #$FF00  ; 4             3
       bne     loop2   ; 3             2
-----[ END OF LINE ]-----

The loop now takes 30 cycles per iteration, but it's doing 16-bits per
iteration instead of 8. The code now takes 487,680 cycles, which is almost
half the time, and we've cut the cycles per byte to 15 from 27, and the size
of the code hasn't changed. Not bad for a simple optimization.

But doing 4 bytes per iteration should be better, right? It'll take another
register (which we have) and some additional bytes, but yes, it is better:

-----[ Assembly ]-----
                       ; cycles        bytes
loop3   sta     $FFDE   ; 5             3
       ldd     ,x      ; 5             2
       ldu     2,x     ; 6             2
       sta     $FFDF   ; 5             3
       std     ,x++    ; 8             2
       stu     ,x++    ; 8             2
       cmpx    #$FF00  ; 4             3
       bne     loop3   ; 3             2
-----[ END OF LINE ]-----

Each iteration now takes 44 cycles, but we're moving 4 bytes per iteration,
giving us 11 cycles per byte, and it takes a total of 357,632 cycles. Again
an improvement but we're starting to hit diminishing improvements. Doing 6
bytes per iteration (still easy to add) takes us down to 9.833 cycles per
byte, and 8 bytes per iteration takes us down to 9.24 cycles per byte, but we
require the use of the S register (the system stack pointer, which needs to
be saved and restored) to do so. It's not worth trying to use the last
register available to us (DP) because you can't load it directly. The loop
also increases in size, from 19 bytes (shown above) to 23 bytes for the 8-
bytes-per-iteration version. Also, the upper address will have to change for
the 6 byte version, since 6 doesn't cleanly divide 32,512. It's not a show
stopper, since not all the memory in the upper 32K contains useful code, so a
few bytes not copied won't crash BASIC.

But we can still do better!

Taking inspiration from “A Great Old-Timey Game-Programming Hack [2]” we can
copy 8 bytes per iteration (first, the commented code, then the code with the
cycles/bytes columns):

-----[ Assembly ]-----
       orcc    #$50    ; disable interrupts
       sts     savesp  ; we need the S register
       lds     #$FF00-8; and because we're using the stack,
                       ; we need to start at the top of ROM
                       ; and work our way down
loop4   sta     $FFDE   ; enable ROM mapping
       puls    u,x,y,d ; pull 4 2-byte registers from memory (read memory)
       sta     $FFDF   ; enable RAM
       pshs    u,x,y,d ; push 4 2-byte registers to memory (write memory)
       leas    -8,s    ; point to next 8-byte block to transfer
       cmps    #$8000-8; are we done?
       bne     loop4
       lds     savesp  ; restore S register
       andcc   #$AF    ; enable interrupts
-----[ END OF LINE ]-----

The PULS instruction pulls the listed registers off the stack, and the PSHS
instruction pushes the listed registers onto the stack. The order of
registers in the instructions doesn't matter; they get pushed and pulled such
that it all works out.

-----[ Assembly ]-----
                       ; cycles        bytes
loop4   sta     $FFDE   ; 5             3
       puls    u,x,y,d ; 13            2
       sta     $FFDF   ; 5             3
       pshs    u,x,y,d ; 13            2
       leas    -8,s    ; 5             2
       cmps    #$8000-8; 5             4
       bne     loop4   ; 3             2
-----[ END OF LINE ]-----

The main loop is now 18 bytes, so it's on par with the third version. But
each iteration now takes 49 cycles, but given it moves 8 bytes per iteration,
we get an effective rate of 6.125 cycles per byte, and a total time of
199,136 cycles. We could do nine bytes per iteration (as the PSHS and PULS
instructions support the DP register) to get us down to 5.666 cycles per byte
(184,235 cycles total). We could also add in the CC register for a total of
10 bytes per iteration (5.3 cycles per byte; 172,314 cycles total) but we run
the risk of enabling interrupts at the wrong time, unless we disable
interrupts at the hardware level (which we can do, but it's more code and
more invasive of system state). You can see we'd be getting into diminishing
returns again, so I'm happy with just doing 8 bytes per iteration.

Table: Summary of results of copying 32,512 bytes from ROM to RAM
loop    cycles­/­iteration    cycles­/­byte cycles total    bytes­/­iteration     code size
------------------------------
loop1   27      27.000  877,824 1       15
loop2   30      15.000  487,680 2       15
loop3   44      11.000  357,632 4       19
loop4   49      6.125   199,136 8       18
(theoretical)   53      5.300   172,314 10      18

Is this important these days? Not really. Is it fun? Yes, it certainly beats
Enterprise Agile any day of the week.

[1] http://www.6809.org.uk/xroar/
[2] https://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
---

Discussions about this page

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit | Lobsters
 https://lobste.rs/s/ojiuck

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit | Hacker News
 https://news.ycombinator.com/item?id=35145162

Email author at [email protected]