Notes on optimizing an O(n)+C algorithm where the C matters quite a bit

* * * * *

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit

[Note: all hexadecimal values are preceeded with a “$”, which is old-school.
I know these days it's hip to use “0x” to designate a hexadecimal value, but
I hope this is just a passing fad. Also, the “K” here stands for “kilobyte,”
which is 1024 bytes. This too, is old fasioned but I don't want to use a word
that sounds like pet food. Geeze, kids these days!] [Here's an onion—put it
on your belt, you geezer! —The kids] [Hey! I resemble that remark! Get off my
lawn! —Sean]

I was doing a bit of retro computing over the weekend, writing 6809 code and
running it on a Color Computer emulator [1] (because the Color Computer was
my first computer and the 6809 is a criminially underrated 8-bit CPU (Central
Processing Unit) in my opinion). Part of the coding was enabling all 64K of
RAM (Random Access Memory) in the machine. Normally, the Color Computer only
sees the lower 32K of RAM, with the upper 32K being ROM (Read-Only Memory)
(the BASIC (Beginners All-Purpose Symbolic Instruction Code) interpreter). To
enable all 64K of RAM, all that's needed is to stuff any value into memory
location $FFDF, which can be done with “POKE &HFFDF,0”. The problem with that
is once the ROM goes away, so does BASIC, and the CPU starts executing Lord
knows what since the RAM isn't initialized. So the actual procedure is to
copy the ROM contents into RAM, which is simple enough:

-----[ Assembly ]-----
orcc #$50 ; disable interrupts
ldx #$8000 ; start of ROM
loop sta $FFDE ; enable ROM mapping
lda ,x ; read byte
sta $FFDF ; enable RAM
sta ,x+ ; write byte back, increment pointer
cmpx #$FF00 ; are we done?
bne loop
andcc #$AF ; enable interrupts
-----[ END OF LINE ]-----

We don't actually want to copy the full 32K of ROM, since the upper 256 bytes
of the memory map is for I/O (Input/Output) devices. We also disable
interrupts since the default interrupt handlers are in ROM and if an
interrupt happens when we have RAM mapped, there may not be an interrupt
handler to handle it.

The code is straightforward, simple, and unfortunately, slow. Here's the main
loop again, this time with the number of cycles and bytes each instruction
takes:

-----[ Assembly ]-----
; cycles bytes
loop1 sta $FFDE ; 5 3
lda ,x ; 4 2
sta $FFDF ; 5 3
sta ,x+ ; 6 2
cmpx #$FF00 ; 4 3
bne loop1 ; 3 2
-----[ END OF LINE ]-----

The loop is 15 bytes, taking 27 cycles for each iteration. Since we copy one
byte per iteration and we have 35,512 iterations, it takes 877,824 cycles to
run (there are no pipe lines or caches to worry about, so this will always
take 877,824 cycles to run). Given the Color Computer runs at .89 MHz
(Megahertz) (yes, it's not a fast computer) this is nearly a second to copy
32K of RAM.

Can we do better?

Well, yes. Should we is another matter. I mean, this code is typically run
once, so does it matter if it takes a second to run? Meh. It'll take longer
to load the code from disk, but hey, I'm doing recreational retro programming
and want to have a bit of fun.

The first and easiest optimization is to read and write 16 bits at a time
instead of 8. And that's easy enough to do—the 6809 does have a 16-bit
accumulator so it's an easy change:

-----[ Assembly ]-----
; cycles bytes
loop2 sta $FFDE ; 5 3
ldd ,x ; 5 2
sta $FFDF ; 5 3
std ,x++ ; 8 2
cmpx #$FF00 ; 4 3
bne loop2 ; 3 2
-----[ END OF LINE ]-----

The loop now takes 30 cycles per iteration, but it's doing 16-bits per
iteration instead of 8. The code now takes 487,680 cycles, which is almost
half the time, and we've cut the cycles per byte to 15 from 27, and the size
of the code hasn't changed. Not bad for a simple optimization.

But doing 4 bytes per iteration should be better, right? It'll take another
register (which we have) and some additional bytes, but yes, it is better:

-----[ Assembly ]-----
; cycles bytes
loop3 sta $FFDE ; 5 3
ldd ,x ; 5 2
ldu 2,x ; 6 2
sta $FFDF ; 5 3
std ,x++ ; 8 2
stu ,x++ ; 8 2
cmpx #$FF00 ; 4 3
bne loop3 ; 3 2
-----[ END OF LINE ]-----

Each iteration now takes 44 cycles, but we're moving 4 bytes per iteration,
giving us 11 cycles per byte, and it takes a total of 357,632 cycles. Again
an improvement but we're starting to hit diminishing improvements. Doing 6
bytes per iteration (still easy to add) takes us down to 9.833 cycles per
byte, and 8 bytes per iteration takes us down to 9.24 cycles per byte, but we
require the use of the S register (the system stack pointer, which needs to
be saved and restored) to do so. It's not worth trying to use the last
register available to us (DP) because you can't load it directly. The loop
also increases in size, from 19 bytes (shown above) to 23 bytes for the 8-
bytes-per-iteration version. Also, the upper address will have to change for
the 6 byte version, since 6 doesn't cleanly divide 32,512. It's not a show
stopper, since not all the memory in the upper 32K contains useful code, so a
few bytes not copied won't crash BASIC.

But we can still do better!

Taking inspiration from “A Great Old-Timey Game-Programming Hack [2]” we can
copy 8 bytes per iteration (first, the commented code, then the code with the
cycles/bytes columns):

-----[ Assembly ]-----
orcc #$50 ; disable interrupts
sts savesp ; we need the S register
lds #$FF00-8; and because we're using the stack,
; we need to start at the top of ROM
; and work our way down
loop4 sta $FFDE ; enable ROM mapping
puls u,x,y,d ; pull 4 2-byte registers from memory (read memory)
sta $FFDF ; enable RAM
pshs u,x,y,d ; push 4 2-byte registers to memory (write memory)
leas -8,s ; point to next 8-byte block to transfer
cmps #$8000-8; are we done?
bne loop4
lds savesp ; restore S register
andcc #$AF ; enable interrupts
-----[ END OF LINE ]-----

The PULS instruction pulls the listed registers off the stack, and the PSHS
instruction pushes the listed registers onto the stack. The order of
registers in the instructions doesn't matter; they get pushed and pulled such
that it all works out.

-----[ Assembly ]-----
; cycles bytes
loop4 sta $FFDE ; 5 3
puls u,x,y,d ; 13 2
sta $FFDF ; 5 3
pshs u,x,y,d ; 13 2
leas -8,s ; 5 2
cmps #$8000-8; 5 4
bne loop4 ; 3 2
-----[ END OF LINE ]-----

The main loop is now 18 bytes, so it's on par with the third version. But
each iteration now takes 49 cycles, but given it moves 8 bytes per iteration,
we get an effective rate of 6.125 cycles per byte, and a total time of
199,136 cycles. We could do nine bytes per iteration (as the PSHS and PULS
instructions support the DP register) to get us down to 5.666 cycles per byte
(184,235 cycles total). We could also add in the CC register for a total of
10 bytes per iteration (5.3 cycles per byte; 172,314 cycles total) but we run
the risk of enabling interrupts at the wrong time, unless we disable
interrupts at the hardware level (which we can do, but it's more code and
more invasive of system state). You can see we'd be getting into diminishing
returns again, so I'm happy with just doing 8 bytes per iteration.

Table: Summary of results of copying 32,512 bytes from ROM to RAM
loop cycles/iteration cycles/byte cycles total bytes/iteration code size
------------------------------
loop1 27 27.000 877,824 1 15
loop2 30 15.000 487,680 2 15
loop3 44 11.000 357,632 4 19
loop4 49 6.125 199,136 8 18
(theoretical) 53 5.300 172,314 10 18

Is this important these days? Not really. Is it fun? Yes, it certainly beats
Enterprise Agile any day of the week.

[1] http://www.6809.org.uk/xroar/
[2] https://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
---

Discussions about this page

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit | Lobsters
https://lobste.rs/s/ojiuck

Notes on optimizing an O(n)+C algorithm where the C matters quite a bit | Hacker News
https://news.ycombinator.com/item?id=35145162

Email author at [email protected]