Skip to content

Michaelangel007/apple2_count_million

Repository files navigation

Count to 1,000,000 on Apple 2 in 6502 assembly

The January 1984 issue of Nibble Magazine

Cover

Had this article HYPERCOUNTER by Ron Macken and Bill Consoli on page 62:

Page 62

Hypercounter.s:

 ORG $300 SCRN EQU $5B8 ;FIRST LOCATION ON THE SCREEN HOME EQU $FC58 300:20 58 FC JSR HOME ;CLEAR SCREEN 303:A9 B0 LDA #$B0 ;ASC (IN HEX) OF THE DIGIT 0 305:A2 08 LDX #8 307:9D B8 05 NUMCL STA SCRN,X ;SET XTH DIGIT TO ZERO 30A:CA DEX 30B:D0 FA BNE NUMCL ;GOTO NUMCL IF ALL 8 DIGITS AREN'T 0 30D:A2 08 COUNT LDX #8 30F:A9 B9 CHECK LDA #$B9 ; WILL THIS DIGIT BE GREATER 311:DD B8 05 CMP SCRN,X ;THAN 9 IF WE INCREMENT IT? 314:F0 06 BEQ KICK ;YES? THEN IT NEEDS TO BE KICKED OVER 316:FE B8 05 INC SCRN,X ;ACTUAL COUNTING IS DONE HERE 319:4C 0D 03 JMP COUNT ;START OVER 31C:A9 B0 KICK LDA #$B0 ;KICKS DIGIT OVER FROM 9 TO 0 31E:9D B8 05 STA SCRN,X ;PUTS THE DIGIT ON THE SCREEN 321:CA DEX 222:D0 EB BNE CHECK ;GO BACK AND DO IT AGAIN 324:60 RTS ;WE'VE REACHED 99,999,999 ! 

Which generates this binary:

300:20 58 FC A9 B0 A2 08 9D 308:B8 05 CA D0 FA A2 08 A9 310:B9 DD B8 05 F0 06 FE B8 318:05 4C 0D 03 A9 B0 9D B8 320:05 CA D0 EB 60 

Unfortunately it is S-L-O-W and buggy. :-/

  • The Y-Reg is unused. We could cache two constants in registers:
    • A = $B9 (9) and
    • Y = $B0 (0) instead of constantly reloading A on $30F and $31C
  • It uses the slow 7 cycle INC $abs,X. This is done 1,000,000 times. Faster is the 4 cycle INC $abs
  • It has an off-by-one bug -- it doesn't actually display 100,000,000. The last number two numbers output are:
    • 99,999,999
    • 00,000,000

Other problems include:

  • It includes your typical "No Shit, Sherlock!" crappy verbose commenting.

We are going to change the range above (0 - 99,999,999) to (1 - 1,000,000) since on c.s.a2 this similar problem was asked:

Can anyone provide the quickest native machine language routine equivalent to FOR I = 1 TO 1E6: PRINT I: NEXT 

Now, the fastest way to count from 1 to 1,000,000 is to only change the bytes that actually change from x to x+1.

This means we can remove ALL compares and branching!

We can write a C program to spit out 6502 assembly for us:

 for( i = 0; i < end; i++ ) { int x = NUM-1; char before[ NUM+1 ]; char after [ NUM+1 ]; sprintf( before, "%0*d", NUM, i+0 ); sprintf( after , "%0*d", NUM, i+1 ); while( x > 0 ) { int delta = after[x] - before[x]; if( delta == 0 ) ; // skip digit else if( delta == 1 ) { if( (before[x] == '0') && (x != 1) ) one( x ); else if( (before[x] == '1') && (x != 1) ) two( x ); else inc( x ); } else if( delta == -9 ) zer( x ); x--; }

You'll notice there are 4 variations of x++. Why 4? Consider what happens to each digit:

  • 0 -> 1
  • 1 -> 2
  • 9 -> 0
  • remaining 2,3,4,5,6,7,8

Since the 6502 has 3 registers, A, X, and Y we have 3 specializations where we can cache the constant values in registers:

  • X = 0
  • Y = 1
  • A = 2

To also keep things focused on a pure benchmark we also remove that clear screen.

Memory Size

Depending on how much we want to loop-unroll we get this tradeoff:

End Memory Size
1000 $1568 < 1 KB
10000 $8A98 ~ 33 KB
100000 $51E75 >325 KB
1000000 $32DCB7 > 3 MB (3,333,330 bytes)

We'll unroll the first 10,000 numbers so it fits into the 48KB of RAM.

With a little bit of setup code ...

digit = $401 ; VTAB 1:HTAB 2  ORG $800 Prologue  LDA #'2' + $80 ; +2 = 2  LDX #'0' + $80 ; +2 = 4  LDY #'1' + $80 ; A,BCD,EFG ; +2 = 6  STX digit - 1 ; 0,BCD,EFG ; +4 = 10  STX digit + 0 ; 0,0CD,EFG ; +4 = 14  STX digit + 1 ; 0,00D,EFG ; +4 = 18  STX digit + 2 ; 0,000,EFG ; +4 = 22  STX digit + 3 ; 0,000,0FG ; +4 = 26  STX digit + 4 ; 0,000,00G ; +4 = 30 ; STX digit + 5 ; 0,000,000 ; +4 = 34 ; <-- not needed since Count_10_000 sets to '1' Work  JSR Count_100_000 ; 100000  JSR Count_100_000 ; 200000  JSR Count_100_000 ; 300000  JSR Count_100_000 ; 400000  JSR Count_100_000 ; 500000  JSR Count_100_000 ; 600000  JSR Count_100_000 ; 700000  JSR Count_100_000 ; 800000  JSR Count_100_000 ; 900000  JSR Count_100_000 ; 000000 Epilogue ;==10  STX digit + 0 ; 000000  STY digit - 1 ;1000000  RTS Count_100_000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 00000  STX digit + 1 ; 000000  INC digit + 0 ; 100000  RTS Count_10_000

... Bob's your uncle.

Is that the best we can do?

No, we can continue unrolling even more!

; 100000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 0:0000  STX digit + 1 ; 000000  INC digit + 0 ; 100000 ; 200000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 1:0000  STX digit + 1 ; 100000  INC digit + 0 ; 200000 ; 300000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 2:0000  STX digit + 1 ; 200000  INC digit + 0 ; 300000 ; 400000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 3:0000  STX digit + 1 ; 400000  INC digit + 0 ; 400000 ; 500000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 4:0000  STX digit + 1 ; 400000  INC digit + 0 ; 500000 ; 600000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 5:0000  STX digit + 1 ; 500000  INC digit + 0 ; 600000 ; 700000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 6:0000  STX digit + 1 ; 600000  INC digit + 0 ; 700000 ; 800000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 7:0000  STX digit + 1 ; 700000  INC digit + 0 ; 800000 ; 900000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 8:0000  STX digit + 1 ; 800000  INC digit + 0 ; 900000 ;1000000  JSR Count_10_000 ; 010000  JSR Count_10_000 ; 020000  JSR Count_10_000 ; 030000  JSR Count_10_000 ; 040000  JSR Count_10_000 ; 050000  JSR Count_10_000 ; 060000  JSR Count_10_000 ; 070000  JSR Count_10_000 ; 080000  JSR Count_10_000 ; 090000  JSR Count_10_000 ; 9:0000  STX digit + 1 ; 900000 ; +4 = 64  STX digit + 0 ; 000000 ; +4 = 68 Epilogue ;==10  STY digit - 1 ;1000000 ; +4  RTS ; +6

Total Time

Prologue = +30 Work = +730 Count_10_000 = +6,000,600 Epilogue = +10 ================== 6,001,370 cycles 

Theoritical Best Case?

So what is the theoritical fastest case if we display every number?

Loop unrolling all 1,000,000 we have this timing:

A) We would have 30 cycles of prologue code:

digit = $401 ; VTAB 1:HTAB 2 ORG $0 Prologue LDA #'2' + $80 ; +2 = 2 LDX #'0' + $80 ; +2 = 4 LDY #'1' + $80 ; A,BCD,EFG ; +2 = 6 STX digit - 1 ; 0,BCD,EFG ; +4 = 10 STX digit + 0 ; 0,0CD,EFG ; +4 = 14 STX digit + 1 ; 0,00D,EFG ; +4 = 18 STX digit + 2 ; 0,000,EFG ; +4 = 22 STX digit + 3 ; 0,000,0FG ; +4 = 26 STX digit + 4 ; 0,000,00G ; +4 = 30 

B) 4 cycles epilogue:

 STY digit - 1 ;1000000 

C) And 5,999,996 of counting from 1 to 1,000,000 with these stats:

  • SIZE = $32DCB7 (3,333,330 bytes)
  • CYCLES = 5,999,996

For a best case of:

= 30 + 4 + 5,999,996 = 6,000,030 cycles.

You can verify yourself with unrolled.1000000.s

Results

How well did we do?

= 6,000,030 / 6,001,370 * 100 = 99.97% of peak performance.

Not bad!

Who knew counting could be so complicated! :-)

QED.

About

One of the fastest ways to count to 1,000,000 on the Apple 2

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors