Porting mruby/c for the SNES (Super Famicom) RubyKaigi 2024 May 17 2024 - Ryota Egusa

Ryota Egusa @gedorinku Software engineer at Wantedly, Inc.

Overview ● Running mruby/c on an actual SNES console ○ Developing SNES games using mruby/c ○ The mruby/c porting process

The SNES ● Known as Super Famicom in Japan ● CPU: 65C816 ○ 16 bit processor ○ 1.79 MHz, 2.68 MHz, 3.58 MHz ■ depending on the memory speed ○ Multiplication and division are handled either by the coprocessor or implemented in software ● RAM (W-RAM): 128KB ● VRAM: 64KB

Why run mruby/c on SNES? ● Inspired by Yuji Yokoo's presentation at RubyKaigi 2022 about porting mruby/c to Sega Mega Drive ● I have been programming on the SNES as a hobby before that ● Since 2023, the development of OSS C compiler for 65C816 has become more active (?)

The hardware ● PPU (Picture Processing Unit) acts as a fixed pipeline ● Writing values to PPU registers or VRAM causes the PPU to output the display in sync with NTSC (or PAL) signal timing

BG and Sprite
BG2 BG3 Sprite

BG ● Tile Maps ○ Created by combining references to 8x8 images and color palettes ● The number of available BG (1 to 4) and the number of colors per BG tile (4 to 256) vary depending on the "BG Mode"

Sprites ● You can set display positions and other settings for each sprite ● Characters in games are generally rendered using this feature

Video and timing
Vertical blanking interval (VBlank) ● For NTSC: ● 262 scanlines / frame ● Of those, 37 scanlines are VBlank Screen Horizontal blanking interval (HBlank)

The Game implementation while true SNES::Pad.wait_for_scan pad = SNES::Pad.current(0) # (Game routine) SNES.wait_for_vblank end

The Game implementation while true SNES::Pad.wait_for_scan pad = SNES::Pad.current(0) # (Game routine) SNES.wait_for_vblank end
Wait for NTSC(or PAL) Vertical blanking interval

The Game implementation while true SNES::Pad.wait_for_scan pad = SNES::Pad.current(0) # (Game routine) SNES.wait_for_vblank end

The Game implementation SNES::Bg.scroll( 1, camera_x, camera_y ) SNES::OAM.set( 0, x, y, priority, 0, 0, frame, 0 )

C compilers ● PVSnesLib ○ Includes the compiler, linker and wrappers for the SNES I/O ● WDC Tools ○ The official tools by The Western Design Center, Inc. ○ Includes the C compiler and linker ○ The source code is not publicly available ○ Does not support C99

Address and C pointer
$7e 8000 Bank address (8 bit) 24 bit address ● CPU registers are 16 bit, address space is 24 bit

Address and C pointer
lda.w $8000 →Reads using the Data Bank Register (DB) as the Bank Address. lda.l $7e8000 →Reads from address $7e8000.

Address and C pointer
● Pointer Type: ○ 32 bit (only 24 bits are used) ● Global Variables: ○ All placed in the $7e bank and addressed with 16 bit addressing ● Function Calls: ○ All use 24-bit addressing (jsr.l/rtl) ○ The way addresses are pulled from the stack changes between 16 bit and 24 bit on return

mruby/c HAL Implementation ● Remove the implementation related to Scheduler (rrt0c.c, rrt0.h) ● Only one function needs to be implemented. ● int hal_write(int fd, const void *buf, int nbytes)

mruby/c HAL Implementation #define HAL_BUF_SIZE (1024) static char hal_write_buf[HAL_BUF_SIZE]; int hal_write( int fd, const void *buf,int nbytes ) { // (Write to hal_write_buf) }

Debug ● There is no console available for outputting text ● Even attempting to display on the screen may fail due to bugs ○ Use hal_write_buf for debugging output. ● Debugging is primarily done using an emulator ● Bugs that only reproduce on actual hardware can be difficult to fix

Mesen2 - emulator / debugger ●

Debug struct RObject { // mrbc_value mrbc_vtype tt : 8; union { mrbc_int_t i; ... struct RClass *cls; struct RInstance *instance; // Object#object_id SET_INT_RETURN( v[0].i );

Debug ● Problems difficult to reproduce in emulators: ○ Incorrect ROM formatting ○ Timing issues involving hardware ■ Example: Reading the Pad register immediately after VBlank starts, which should not be possible ● Solutions: ○ Use multiple emulators ○ Use the Programmable I/O pin ■ (I have never used this for debugging)

Performance Improvement ● Scrolling just one BG layer results in about 8 fps ● Improved this to nearly 3 times faster ● Actions taken: ○ Utilizing enhancement chip ○ C compiler optimizations

Enhancement chips ● Chips embedded within the cartridge ● Perform tasks such as graphics processing on behalf of the console ● Examples ○ Super FX chip ■ For 2D and 3D graphics ○ ST018 ■ ARMv3 32 bit processor ■ Used in "Hayazashi Nidan Morita Shogi 2" for Shogi AI

SA-1 ● Uses the same 65C816 architecture ○ Not binary compatible, but porting is relatively easy ● Additional memory (depends on the cartridge): ○ I-RAM: 2KB ○ BW-RAM: 128KB ● Differences from the S-CPU (CPU on SNES): ○ Cannot directly access registers such as the PPU ○ Different memory mapping

SA-1
S-CPU (65C816) W-RAM PPU Game Cartridge SA-1 (65C816) I-RAM BW-RAM ROM …

SA-1 Memory mapping
I-RAM $00 $0000 $0800 $40 ROM $8000 $10000 BW-RAM $50 ROM $60 $70 ROM $80 I-RAM $3000 $3800 Registers $2000 I-RAM ROM I-RAM Registers $C0 $100 ROM

SA-1 Memory mapping
I-RAM $00 $0000 $0800 $40 ROM $8000 $10000 BW-RAM $50 ROM $60 $70 ROM $80 I-RAM $3000 $3800 Registers $2000 I-RAM ROM I-RAM Registers $C0 $100 ROM No mapping for W-RAM

SA-1 Memory mapping
I-RAM $00 $0000 $0800 $40 ROM $8000 $10000 BW-RAM $50 ROM $60 $70 ROM $80 I-RAM $3000 $3800 Registers $2000 I-RAM ROM I-RAM Registers $C0 $100 ROM No registers such as PPU

SA-1 Memory mapping
I-RAM $00 $0000 $0800 $40 ROM $8000 $10000 BW-RAM $50 ROM $60 $70 ROM $80 I-RAM $3000 $3800 Registers $2000 I-RAM ROM I-RAM Registers $C0 $100 ROM Twice as fast as BW-RAM. Used for the stack.

SA-1 Memory mapping
I-RAM $00 $0000 $0800 $40 ROM $8000 $10000 BW-RAM $50 ROM $60 $70 ROM $80 I-RAM $3000 $3800 Registers $2000 I-RAM ROM I-RAM Registers $C0 $100 ROM Mapped to the same location in the S-CPU. Convenient for memory sharing.

SA-1 ● Describes metadata about the cartridge, such as the size of the ROM ● $FFD6 $35 ○ $30: SA-1 ○ $05: ROM + coprocessor + RAM + battery ● $FFD8 $07 ○ RAM size ○ 1<<7 = 128KB ROM Header

SA-1 void call_s_cpu(void (*target_func)(), size_t args_size, ...); call_s_cpu(bg_set_scroll, sizeof(int) * 3, 1, x, y); Calling the S-CPU
Writes to shared memory. S-CPU simply polls this memory.

SA-1 Calling the S-CPU
$0000 $2000 $3000 $3800 args of target_func call_s_cpu_targe t_func's frame Copy and call target_func SA-1 stack mapped in S-CPU S-CPU stack args of target_func

Running mruby/c on SA-1 ● S-CPU and SA-1 operate in parallel ● When SNES is reset, S-CPU executes the address of the Reset vector ○ At this point, SA-1 is not yet active.

Running mruby/c on SA-1 lda #__start_sa1 ; Set Reset vector sta $2203 sep #$20 ; Set A register to 8 bit stz $2200 ; Run SA-1

Running mruby/c on SA-1 __start_sa1: (Initialize memory and registers here) jsr.l sa1_main int sa1_main(void) { (Run mruby/c VM) }

Demo

Feature work ● Performance Improvement ○ Further optimize the C compiler ○ Optimize memory usage (use I-RAM as much as possible) ○ Support DMA using Array (like object) in mruby/c ● Allow to run without SA-1

Conclusion ● There's still a lot of potential to improve performance and stability of C compiler for 65C816 ● To run mruby/c on SNES, you need the enhancement chip for now

References ● ● ● ● ● ● ● SNESdev Wiki ○ ● SFC Development Wiki ○ ● W65C816S 8⁄16–bit Microprocessor ○ on/w65c816s.pdf