A toy RTOS inside Super Mario Bros. using emulator save states
180 notorious_pgb 46 5/28/2025, 8:15:43 PM prettygoodblog.com ↗
This started as a throwaway metaphor in a blog post, but is now fully runnable: a toy RTOS with preemptive multitasking inside of Super Mario Bros. on the NES.
Essentially, this is:
- A rudimentary preemptive RTOS
- Using an unmodified NES emulator (FCEUX) as the CPU
- "Unmodified" depending on how you define terms
- With emulator save states as the thread contexts- With support for (very basic) mutexes, interrupt masking, and condition variables
- Demonstrated using Super Mario Bros. 1-1 with sections of the map dedicated to various synchronization primitives
There are many simplifications and shortcuts taken (doesn't even have task priorities), and it doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice), but I think it's A) pretty interesting and B) a unique visceral explanation of threads.
From your previous post
> The first thing to understand about threads is that processors do not support them.
Yes, but I would say by changing the emulator (augmenting the emulator software with your other software, sure) in the outside world, not doing something in the inside world, you have are implementing something closer akin to hardware concurrency support or processes.
> It doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice
Right, but I think if we're going to make a threads vs process distinction (vs concurrency in general) whether we have only a small opt-in set of communication channels / synchronization primitives (the mutex and barrier), or a wide open one (shared memory, be careful!) is the operative distinction.
The metaphor I usually go for when it comes to for threads is having several widgets that need to be assembled. To begin working on Widget B, you have to pause the work on Widget A. Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person. Multicore is having multiple workbenches each with their own toolbox so they can operate completely in parallel.
A mutex is like having a specialized tool that can only be used by one widget assembly at a time. Like, each workbench might have a full set of screwdrivers, hammers, wrenches, sockets, etc., but you only have 1 welder.
A semaphore is a tool that can be used by a limited number of widgets, like an oven that can fit up to 4 widgets.
What drove me to write this post (as someone who didn't attend college where I might've learned exactly what I'm about to say -- big grain of salt) is that it took me being forced by circumstance to become a firmware engineer for a time to actually understand not just how to multithread well, but what the hell this all even was.
And I guess I saw one too many Reddit comments (mistake #1) which downplayed the idea that there's any benefit to knowing the answer to such things.
That all said, this blog post / metaphor is definitely aimed at someone who's already been using threads effectively for some time -- I don't think it'd fare very well as a first introduction.
STM is pretty similar to how some databases implement transactions.
No comments yet
This sounds like hyperthreading – or, maybe I'm missing the screwdriver in the metaphor :)
EDIT: And it should have been SMT anyways!
From now on, I’ll simply send out a link to this. Thank you!
I worried that I'd come across as a high-horse dickhead in the mystification section, when that's the fundamental opposite of my intention and attitude here. I'm glad to hear that it resonated with another.
Although I guess we could both be high-horse dickheads.
I took a real-time operating systems course in university as an elective. One of the hardest courses I took the whole four years, but also one of the most interesting. Had a great professor, who gave really demanding, but very instructive, project-based assignments.
I need to find a toy project to play around with this domain again.
I'm curious how effective you feel this specific example might've been if it were delivered during your course. I suspect I've stumbled across a really helpful teaching tool, but having not gone to university, I don't actually know how this stuff is being taught :v
It's the sickest possible outcome for me, so: worth a lot!
https://youtu.be/sYSP_elDdZw
- Run the emulator one frame normally, using real polled input. Somehow snapshot the state.
- Then, run the emulator n frames (usually just one) with the same input. Present the video + audio from the last frame.
- Synchronize. (Some emulators can get very fancy; instead of just waiting for vsync, they'll also delay until the end of the window minus estimated processing time, to poll input at the last possible moment.)
- Roll back to the saved snapshot. (I believe you can also optimize if you know the inputs really didn't change, avoiding a lot of rollbacks at the cost of less predictable frame times.)
The main reason this is even a good idea is because most games will have some of their own processing latency by design, so jumping a frame or two ahead usually doesn't have any noticeable side-effects. This is a pretty cool idea since obviously modern computers with LCD screens have a lot more latency basically everywhere versus older simpler machines connected to CRTs.
Unfortunately, this sort of approach only works when your emulator's state is small enough and fast enough to restore.
I actually have been dying to experiment with designing an emulator to have fast incremental snapshots from the ground up to see if you could manage to make this feasible for more modern consoles. You could, for example, track dirty memory pages with userfaultfd/MEM_WRITE_WATCH, and design structures like JIT caches to be able to handle rewinding without having to drop the entire cache. I'm actually not sure that all emulators clear their caches upon loading state, but you know, more generally, I would like to know how fast and small you could get save states to be if you were designing for that from the ground up.
https://archive.ph/msHkO
I think the solution requires that the game code is what's executing threading syscalls, not the Lua around the game code.
It'd definitely be a super sick evolution to add true shared in-game memory.
The way I've been looking at it, if I want to take this further, processes would maybe be _NES ROMs_, and each process can have multiple threads. Swapping between processes would inherently be more expensive than between threads within a process, which I think would be a really instructive aspect. Also, the whole idea of "sharing memory" between entirely different games is obviously ridiculous, which could be yet another teaching tool.
Then if we want to expand it even further, maybe multicore support would look like multiple emulator instances, connected via sockets to support cross-core IPC. You'd probably want to institute an "all the threads of a process have to be on the same core" rule (so they can locally share primitives in the same Lua script), which would be a great way to demonstrate the principle of building systems which adapt to their realistic constraints, as opposed to building systems which exactly model a textbook.
I've gotten ahead of myself, though.
For a game like Mario, you can split up the memory and decide what you want to be shared and what do you want to be per-thread. E.g. starting small you can rig up just a few fixed variables like scores or whatever to be persisted from one game to the next. trying to push that further without causing endless corruption should be fun :)
I struggled with how to convey this; the ultimate goal is to viscerally demonstrate what a thread IS, so I'm comfortable with it being... something strange... but I do want it to resemble actual concurrent systems to a useful degree.
Like, the thread scheduler code is userland code -- similar to, say, FreeRTOS -- but the "threads" themselves aren't exactly threads in an exact sense, since their context packages are entire save states for a console, not just its processor.
Also, the game code (the true userland?) has no ability to access the threading API whatsoever, which really harms the analogy.
So I went with RTOS because it's a preemptive scheduler with synchronous reschedule-on-block; where the scheduler is implemented in the same codebase as the code using it. But to be honest, nothing I know of fits.
But congrats. This is worth being included in a uni course.
One would have to get really clever...