What Not to Use
Starting a new project from scratch is hard. Although conceptualizing a new game - the mechanics, the story, the art - is a blast, getting down to the brass tacks of building and coding the game can be somewhat of a chore. Starting development is even more arduous when the intended platform is a retro console.
It’s also difficult when the seemingly popular hardware has no real homebrew support. The SNES is such a platform. Despite being one of the most popular consoles for retro homebrew games, the tools available for the SNES are somewhat immature and limited.
Having a lot of experience in 6510 (which is 99.5% compatible with the 6502) I started researching the available tools. Since the NES has a very good cc65 support I was surprised that this suite does not support the SNES or the 65816 - however, the ca65 does support the 65816 which is the CPU of the console.
The only C compiler I found was a proof-of-concept 65816 version of the TCC (Tiny C Compiler) which is known to have some bugs. As if that wasn't enough, the accompanying assembler, called WLA-ZX, was also somewhat clunky.
This is not as bad as it sounds since it is totally possible to create a working game with it. However, I didn’t have the courage nor the time to hunt bugs within the tools themselves. I shall leave that task to some intrepid soul with the patience and fortitude to sift through these retro development tools.
Instead, I chose to use ca65 which is a proven and stable assembler for 6502 and I hoped the 65816 support is similarly well made.
The SNES is a very well designed console, even by today’s standards. However, after programming for NES the expected limits are somewhat lower than expected. With these tools in hand, what I needed was a framework for my excursion into SNES development.
Found a framework or it found me?
There are a few frameworks around but so far I found the only one which uses ca65. It is called LibSFX.
Basically, it is a collection of macros and also a boilerplate to bootstrap development for the SNES. They also include some examples which are supposed to ease the development.
LibSFX is great for me, because:
- I am familiar with 6510 assembly
- It hides most of the complexities of the SNES
- It’s small
- It also comes with tools for graphics
- It does require attention, unlike C it doesn’t reduce code complexity
- The examples and the documentation are bare-bones and only the essentials are covered
- The ca65 is not compatible with WLA-DX. Examples found on the net has to be modified
After the first few examples are compiled and ran, I was good to go.
One of these is the VRAM, which is large by NES standards: 64K. However, the VRAM usable for sprites are two times 16x16 tiles only. This is the same as the NES, although twice the amount of the tiles.
The sprites are also larger, there are two sizes of 16x16, 32x32 or 64x64 can be selected. The combinations are fixed and there are some unsupported variations too. This means that one of the 256 tiles is used for one sprite size and the remaining is for another. This is not set in stone; in fact, it can even be changed on the fly but even this can’t help the next issue.
The first problem is when you have more frames for the animation that actually fits into this VRAM segment. What do you do? There are no mappers for SNES in the NES sense; also, the VRAM similarly cannot be modified during rendering.
Luckily the SNES has something the NES lacks: DMA channels. 8 of them actually. This is much faster than updating the VRAM with the CPU - but even that’s faster than the NES but note that the sprites are also larger so more data is needed to be modified.
So the idea is to stream the sprite data during screen refresh. The DMA controller is so fast that about 5.5K of data can be transmitted during vblank.
LibSFX provides a macro “VRAM_memcpy” which does exactly this. It can be even dropped into the VBL handler which also can be set up by the framework. All it does is to set up DMA 7 to do the transfer from the given memory area to a designated VRAM pointer.
It also can take constants or registers, which is very convenient.
The friendly nature of the 65816
It looks good at first sight. You get your humble 6502 CPU with extras. Not only it’s faster but you can switch your A and/or X,Y registers to 16 bit. This is necessary in order to address more than 64K of data.
Yes, that’s right, you can address up to 24 bit. The A register can hold 16 bits too. It can also do all the operations on 16 bit, no need for separate 16-bit arithmetic workarounds. What doesn’t seem to be apparent is that this is where all hell breaks loose.
First, you have to enable 16-bit in the processor status registers. You can do it for A and X/Y separately. Problem is, most mnemonics will look the same, such as:
This is the same as on the good old 6502. However, in 16-bit mode it means: LDA #$1000
Not only it looks different but compiled into a totally different opcode than the 8-bit counterpart. It also takes up 3 bytes of course.
What it actually means: after enabling 16 bit your compiler also will generate 16-bit code. Or not. The best way to do it is to TELL your compiler to generate 16-bit from here - since you have to tell the CPU and the compiler too!
Now, what if a subroutine leaves the A in 8-bit mode, and after return, you expect it in 16-bit mode? Your code simply will not run. You will see in a debugger it doesn’t even look like your code anymore since in 8-bit addressing a 16-bit opcode just won’t work. That’s why you have to track what mode you’re in all the time.
To add to the surprise, the VBL handler may strike at any time since it is tied to the vblank and not synchronized to your code anymore. Since you may also need an 8-bit code in your VBL handler (I do) but you’re unsure what mode you need to return to - it adds a bit of difficulty as well.
LibSFX helps somewhat by providing macros for tracking the processor state. You can set explicitly it to 16-bit mode, or use it as a stack: push a mode (like a8i16) then pop it at the end. Very handy for VBL handlers.
By the way, you still don’t get multiply or divide instructions of any kind - at least not from the 65816. More on this later.
The ca65 does not know what you’re doing
Setting the register width this way is very helpful except for a small problem: the compiler is not aware of any of this. It will often tell that this or that code does not fit into 8-bit. The compiler does know your code flow. It compiles the code sequentially.
To avoid this you would need to tell the compiler that this code is 16-bit (with .A16 or .I16) and this code is 8-bit (.A8 and/or I.8 respectively).
In LibSFX there is a macro called RW_assume. This does something similar. It tries to follow up the register states and tell the compiler that this code - although doesn’t look like - is in fact in the right mode.
There is an example in their documentation, that sums it up really well:
“Assemble nicely” is an understatement. I started all my subroutines with “RW_push set:a16i16” to follow the states but after putting a JSR into anywhere simply reset the state back to a8 - was very annoying. I was also surprised that either the running code is total garbage or it outputs 8-bit numbers which means it also handles all of my .word variables as .byte.
This above saved me a lot of headaches.
They have documentation online which is included with the framework too:
What I found useful is the LZ4 package which is a lightweight compressor. It’s not fast enough to do it during vblank (although probably small datasets might work) but decompressing map or char data to EXRAM then use VRAM_memcpy.
Speaking of VRAM_memcpy there is CGRAM_memcpy and WRAM_memcpy as well.
The caA65 still doesn’t know what you’re doing
To be fair, this can be attributed both to LibSFX and CA65 itself. The beauty of the situation comes from the fact that not only the CPU has to be switched between modes but the compiler itself too. Preferably at the very same time otherwise bad things can happen. How bad?
Well, I already talked about the friendly nature of the 65816, let me write a few thoughts about how friendly LibSFX can be as well. It all starts with good ident, giving you the macros I talked earlier, that is RW_push, RW_assume, RW_pull, and so on.
However: while it sounds very convenient to use a stack-based approach for storing CPU-and compiler states there are a few problems. Usually one finds it out the hard way when the code simply causes a black screen while not even changing anything seemingly important.
The problems are the following:
- The compiler has no idea of the code flow. It compiles the assembly source from bottom to top. It also switches 8 or 16 code generation while doing this, and it can only do that by finding the explicing .A8, .I8, .A16 or I16 directives - or - by tracking the REP, SEP instructions in the code itself.
- The macros LibSFX provides does *NOT* emit any compiler directives although it should - in my opinion.
- The stack-based approach breaks with a .proc or .endproc declaration. It simply won’t work in it. Period.
- It also breaks with the code flow. You may set it to 8 or 16 bit before calling a subroutine but the generated code may not match. Also after calling the subroutine the stack “sometimes” resets back to 8-bit - even if the compiler gets the idea.
This gives a nice 3-fold issue that can happen anytime, anywhere in the code. Even if you use the RW_print macro to see what the compiler is about to generate it doesn’t mean the CPU is switched.
I think the explanation lies in LibSFX itself. The manual says: “The switching code is only generated if it’s necessary.” But how does it know if it’s necessary? The answer is simple: it doesn’t.
The correct solution is within LibSFX itself: don’t overuse the RW_set and RW_pull macros. This convenience there is only to help you in really difficult circumstances.
Oh, did I mention these macros do not set the compiler state? Yes, you need to track that by yourself. This alone is a good reason why the SNES would crash with a black screen: there is no way of telling what your 8-bit code means in 16-bit or vice-versa (though usually, it is a loop or a BRK).
The solution: map your code explicitly for 8-or 16-bit segments. You can switch back-and-forth but be clear about which section is 8-bit and which is 16-bit.
There is a macro that helps you, so why not use that - or you can make your own. The real key here is after setting the CPU state set the compiler state too, IMMEDIATELY.
The macro is called RW, can’t be simpler. Also, you can leave X,y to be 16-bit and only switch A around:
No guessing, no stacking no .proc just straight and explicit segments. You’ll thank yourself later.
If you need to switch X,Y around then you can use RW a16i16 or RW a8i8 following by .a16 and i16 and .a8 .i8 respectively.
Even with a considerable amount of retro development experience, programming for the SNES was something of a challenge for me. However, with patience, the right tools, and meticulous attention to code, developing for this beloved retro console can be a joy.