A toy RTOS inside Super Mario Bros. using emulator save states

180 notorious_pgb 46 5/28/2025, 8:15:43 PM prettygoodblog.com ↗

This started as a throwaway metaphor in a blog post, but is now fully runnable: a toy RTOS with preemptive multitasking inside of Super Mario Bros. on the NES.

Essentially, this is:

- A rudimentary preemptive RTOS

- Using an unmodified NES emulator (FCEUX) as the CPU

    - "Unmodified" depending on how you define terms

- With emulator save states as the thread contexts

- With support for (very basic) mutexes, interrupt masking, and condition variables

- Demonstrated using Super Mario Bros. 1-1 with sections of the map dedicated to various synchronization primitives

There are many simplifications and shortcuts taken (doesn't even have task priorities), and it doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice), but I think it's A) pretty interesting and B) a unique visceral explanation of threads.

Comments (46)

Ericson2314 · 25m ago

Sorry to be an pedant.

From your previous post

> The first thing to understand about threads is that processors do not support them.

Yes, but I would say by changing the emulator (augmenting the emulator software with your other software, sure) in the outside world, not doing something in the inside world, you have are implementing something closer akin to hardware concurrency support or processes.

> It doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice

Right, but I think if we're going to make a threads vs process distinction (vs concurrency in general) whether we have only a small opt-in set of communication channels / synchronization primitives (the mutex and barrier), or a wide open one (shared memory, be careful!) is the operative distinction.

Sohcahtoa82 · 8h ago

That's pretty neat.

The metaphor I usually go for when it comes to for threads is having several widgets that need to be assembled. To begin working on Widget B, you have to pause the work on Widget A. Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person. Multicore is having multiple workbenches each with their own toolbox so they can operate completely in parallel.

A mutex is like having a specialized tool that can only be used by one widget assembly at a time. Like, each workbench might have a full set of screwdrivers, hammers, wrenches, sockets, etc., but you only have 1 welder.

A semaphore is a tool that can be used by a limited number of widgets, like an oven that can fit up to 4 widgets.

notorious_pgb · 1h ago

I think this is a pretty damn good metaphor for the introduction of these concepts, especially in a context where the learner(s) might need to imminently apply this knowledge.

What drove me to write this post (as someone who didn't attend college where I might've learned exactly what I'm about to say -- big grain of salt) is that it took me being forced by circumstance to become a firmware engineer for a time to actually understand not just how to multithread well, but what the hell this all even was.

And I guess I saw one too many Reddit comments (mistake #1) which downplayed the idea that there's any benefit to knowing the answer to such things.

That all said, this blog post / metaphor is definitely aimed at someone who's already been using threads effectively for some time -- I don't think it'd fare very well as a first introduction.

eru · 6h ago

As a special challenge: explain (Software) Transactional Memory, as eg implemented in Haskell, in your metaphor.

STM is pretty similar to how some databases implement transactions.

No comments yet

ImPostingOnHN · 5h ago

> Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person.

This sounds like hyperthreading – or, maybe I'm missing the screwdriver in the metaphor :)

ranger207 · 3h ago

Hyperthreading is Intel's implementation of SMP

_ihaque · 9m ago

I think you mean SMT (simultaneous multithreading), as opposed to SMP (symmetric multiprocessing)?

Sohcahtoa82 · 2m ago

Ah hell, that was MY mistake. It should be SMT in my original comment and it's way too late to edit it now.

Sohcahtoa82 · 1h ago

Funny thing is, when I was first typing that comment, I started with "Hyperthreading" and deleted it in favor of SMP to be vendor agnostic.

EDIT: And it should have been SMT anyways!

noone_youknow · 6h ago

This is simply amazing work, I take my hat off to you. And not just the threading in an emulator (which is a genius concept all by itself) but for the article itself - I have spent years trying to find the right way to articulate exactly your points in hopes of engaging people to learn more about the systems they’re building on top of.

From now on, I’ll simply send out a link to this. Thank you!

notorious_pgb · 3h ago

My person!

I worried that I'd come across as a high-horse dickhead in the mystification section, when that's the fundamental opposite of my intention and attitude here. I'm glad to hear that it resonated with another.

Although I guess we could both be high-horse dickheads.

noone_youknow · 1m ago

IMO the mystification section is right on the money, I couldn’t have said it better myself (and believe me, I’ve tried).

upghost · 3h ago

Not only that but now, thanks to you, we can all be high-horse dickheads concurrently.

peterkelly · 3h ago

The mystification section was my favorite part - you make some excellent points there.

ignormies · 10h ago

This is a super cool visual demonstration of RTOS/scheduling! I love the region-based critical sections!

I took a real-time operating systems course in university as an elective. One of the hardest courses I took the whole four years, but also one of the most interesting. Had a great professor, who gave really demanding, but very instructive, project-based assignments.

I need to find a toy project to play around with this domain again.

notorious_pgb · 10h ago

Thank you!

I'm curious how effective you feel this specific example might've been if it were delivered during your course. I suspect I've stumbled across a really helpful teaching tool, but having not gone to university, I don't actually know how this stuff is being taught :v

throwanem · 9h ago

Nor I, but I've learned from the metaphor, for what that's worth. And the demystification of primitives has considerable value these days.

notorious_pgb · 9h ago

> Nor I, but I've learned from the metaphor, for what that's worth.

It's the sickest possible outcome for me, so: worth a lot!

Cloudef · 1h ago

Isn't this more like coroutines / green threads? I guess it doesn't matter as they are essentially same concept apart from cooperative scheduling vs. preemptive scheduling.

notorious_pgb · 1h ago

I suppose the actual thing it is is closer to what you're describing, but the thing that it makes viscerally understandable is threads itself -- I hope.

abyesilyurt · 1h ago

Tsoding recently built coroutines for C from scratch. The concept of coroutines is related to threads, so I bet you’d like the video if you liked the article.

https://youtu.be/sYSP_elDdZw

gwbas1c · 9h ago

Last night I read an article about eliminating lag in emulators. It's done with a similar concept. Basically, for each frame, the emulator calculates the state for different button combinations. Then, based on the button you actually push, the state shown moves to the precalculated one.

jchw · 8h ago

Not 100% familiar with exactly what this is but I am familiar with run-ahead, which is basically also the same idea as GGPO, but for eliminating lag:

- Run the emulator one frame normally, using real polled input. Somehow snapshot the state.

- Then, run the emulator n frames (usually just one) with the same input. Present the video + audio from the last frame.

- Synchronize. (Some emulators can get very fancy; instead of just waiting for vsync, they'll also delay until the end of the window minus estimated processing time, to poll input at the last possible moment.)

- Roll back to the saved snapshot. (I believe you can also optimize if you know the inputs really didn't change, avoiding a lot of rollbacks at the cost of less predictable frame times.)

The main reason this is even a good idea is because most games will have some of their own processing latency by design, so jumping a frame or two ahead usually doesn't have any noticeable side-effects. This is a pretty cool idea since obviously modern computers with LCD screens have a lot more latency basically everywhere versus older simpler machines connected to CRTs.

Unfortunately, this sort of approach only works when your emulator's state is small enough and fast enough to restore.

I actually have been dying to experiment with designing an emulator to have fast incremental snapshots from the ground up to see if you could manage to make this feasible for more modern consoles. You could, for example, track dirty memory pages with userfaultfd/MEM_WRITE_WATCH, and design structures like JIT caches to be able to handle rewinding without having to drop the entire cache. I'm actually not sure that all emulators clear their caches upon loading state, but you know, more generally, I would like to know how fast and small you could get save states to be if you were designing for that from the ground up.

achierius · 8h ago

What do you mean by JIT cache in this case -- like an inline data cache, or like a cache for the jitted binary code?

jchw · 7h ago

The latter, though I'm just using it as an example really. I think for most emulator core designs today you wind up dumping a lot of ephemeral state when you rewind or load state (and sometimes even saving state has some overhead due to synchronization.)

notorious_pgb · 3h ago

Interesting; which emulator(s) was this for? I have to imagine this strategy is mostly effective below a certain total complexity level -- however you want to define that -- and the more modern you get, the more you're lucky to even render a frame at all.

cortesoft · 7h ago

i think this is kind of the inverse of how some emulators that add remote play deal with network lag... they will predict what control the opponent will give for the current frame and display it, but then when they actually receive the real opponent control some number of frames later, they will go back to that frame and re-calculate based on knowing the actual control input for that frame, and then catch up to the current frame. That way the game isn't waiting for remote inputs to refresh the screen, but can then reconcile once all information is known.

potato3732842 · 7h ago

Sounds a lot like speculative execution for hardware.

cevn · 7h ago

Umm, that's brilliant. Is this technique employed in all modern emulators or a recent thing?

aftbit · 7h ago

https://web.archive.org/web/20250528203416/https://prettygoo...

https://archive.ph/msHkO

liamwire · 8h ago

Brilliant visual demonstration, really helps build an intuition for the concepts at play.

CGamesPlay · 6h ago

Would be neat if the subworld memory was shared between the save states, to really drive home the mutex concept.

notorious_pgb · 5h ago

This is one of the core weaknesses of the analogy that I'm trying to figure out how to resolve: there's no shared memory, unlike actual threads.

I think the solution requires that the game code is what's executing threading syscalls, not the Lua around the game code.

It'd definitely be a super sick evolution to add true shared in-game memory.

leni536 · 51m ago

You have critical section in the bonus section behind the pipe. Maybe shared state protected by that critical section could be your shared memory, like the state of the coins there.

kelnos · 4h ago

I guess in a way the three Mario threads are more like processes? And they're using OS-level or shared-memory mutexes and condvars that can be shared between processes?

notorious_pgb · 3h ago

I see what you're getting at -- though then it raises the question: what would it mean for those processes to have multiple threads within them?

The way I've been looking at it, if I want to take this further, processes would maybe be _NES ROMs_, and each process can have multiple threads. Swapping between processes would inherently be more expensive than between threads within a process, which I think would be a really instructive aspect. Also, the whole idea of "sharing memory" between entirely different games is obviously ridiculous, which could be yet another teaching tool.

Then if we want to expand it even further, maybe multicore support would look like multiple emulator instances, connected via sockets to support cross-core IPC. You'd probably want to institute an "all the threads of a process have to be on the same core" rule (so they can locally share primitives in the same Lua script), which would be a great way to demonstrate the principle of building systems which adapt to their realistic constraints, as opposed to building systems which exactly model a textbook.

I've gotten ahead of myself, though.

Ericson2314 · 13m ago

Easier to leave out the parallelism and just look at concurrency.

For a game like Mario, you can split up the memory and decide what you want to be shared and what do you want to be per-thread. E.g. starting small you can rig up just a few fixed variables like scores or whatever to be persisted from one game to the next. trying to push that further without causing endless corruption should be fun :)

Refreeze5224 · 10h ago

This is a fantastic way of demonstrating threading and scheduling, well done!

loas · 9h ago

I really enjoyed this, thank you.

dsfdsfsfew · 9h ago

Kind of pushing the definition of RTOS here, I was expecting something like using ACE to run a concurrent program, still an interesting and informative post.

notorious_pgb · 9h ago

You're goddamn right it's pushing it.

I struggled with how to convey this; the ultimate goal is to viscerally demonstrate what a thread IS, so I'm comfortable with it being... something strange... but I do want it to resemble actual concurrent systems to a useful degree.

Like, the thread scheduler code is userland code -- similar to, say, FreeRTOS -- but the "threads" themselves aren't exactly threads in an exact sense, since their context packages are entire save states for a console, not just its processor.

Also, the game code (the true userland?) has no ability to access the threading API whatsoever, which really harms the analogy.

So I went with RTOS because it's a preemptive scheduler with synchronous reschedule-on-block; where the scheduler is implemented in the same codebase as the code using it. But to be honest, nothing I know of fits.

anthk · 9h ago

Forth does that too. https://www.bradrodriguez.com/papers/mtasking.html

hei-lima · 7h ago

Wow! So cool!

bitwize · 9h ago

"...But first we need to talk about parallel universes."

shakna · 9h ago

If you're going to make this your DooM thing, clearly the next thing to do is grab a DooM engine and get to work. (PsyDoom has Lua support even).

But congrats. This is worth being included in a uni course.

notorious_pgb · 9h ago

It'd be really interesting to push this to that level -- the big challenge likely being that DooM's state (including video memory) is certainly much larger and expensive to swap than an NES save state.

One would have to get really clever...

Show HN: Typed-FFmpeg 3.0–Typed Interface to FFmpeg and Visual Filter Editor (github.com)

Show HN: Porting Terraria and Celeste to WebAssembly (velzie.rip)

Long live American Science and Surplus (milwaukeerecord.com)

A toy RTOS inside Super Mario Bros. using emulator save states (prettygoodblog.com)

Show HN: I rewrote my Mac Electron app in Rust (desktopdocs.com)

Japan Post launches 'digital address' system (japantimes.co.jp)

Compiler Explorer and the promise of URLs that last forever (xania.org)

5-year study suggests chimps strike stones against trees as communication (phys.org)

Can Open Source Projects Exit Foundations? (infoq.com)

Show HN: Handover.ai – Knowledge transfer made easy (handover.ai)

HTAP is Dead (mooncake.dev)

Visualize and debug Rust programs with a new lens (firedbg.sea-ql.org)

What If We Had Bigger Brains? Imagining Minds Beyond Ours (writings.stephenwolfram.com)

What does “Undecidable” mean, anyway (buttondown.com)

Compiling a neural net to C for a speedup (slightknack.dev)

3D Simulation of the Bombe Machine (bombe.virtualcolossus.co.uk)

Show HN: Tesseral – Open-Source Auth (github.com)

Apollo Lunar Surface Journal (nasa.gov)

Basic for the Raspberry Pi Pico and Pico 2 (geoffg.net)

A Visual History of Chessmen (chesshistory.github.io)

LLM codegen go brrr – Parallelization with Git worktrees and tmux (skeptrune.com)

Grass Rendering Series (hexaquo.at)

The Blowtorch Theory: A new model for structure formation in the universe (theeggandtherock.com)

Unlocking Ractors: class instance variables in Ruby (byroot.github.io)

De-anonymization attacks against the privacy coin XMR (monero.forex)

Launch HN: MindFort (YC X25) – AI agents for continuous pentesting

Deepseek R1-0528 (huggingface.co)

YAD: display graphical dialogs from shell scripts or command line (yad-guide.ingk.se)

Run a C# file directly using dotnet run app.cs (devblogs.microsoft.com)

Mathematical Fiction (kasmana.people.charleston.edu)

GoGoGrandparent (YC S16) is hiring Back end Engineers

Getting a Cease and Desist from Waffle House (jack.bio)

A thought on JavaScript "proof of work" anti-scraper systems (utcc.utoronto.ca)

DeepSeek-R1-0528 is now live on Hyperbolic (app.hyperbolic.xyz)

Mullvad Leta (leta.mullvad.net)

Implementing complex numbers and FFT with just datatypes (2023) (gist.github.com)

The anomalous magnetic moment of the muon in the Standard Model: an update (arxiv.org)

xAI to pay telegram $300M to integrate Grok into the chat app (techcrunch.com)

Prohibition and ice cream in the US Navy (oldsaltblog.com)

As a developer, my most important tools are a pen and a notebook (hamatti.org)

Two Paths for A.I (newyorker.com)

Harrison Ruffin Tyler, grandson of 10th U.S. president, has died (richmonder.org)

DWARF as a Shared Reverse Engineering Format (lief.re)

Show HN: Loodio 2 – A Simple Rechargable Bathroom Privacy Device (loodio.com)

Collatz's Ant and Similarity of Landscapes (gbragafibra.github.io)

Concatenative programming and stack-based languages (2023) [video] (youtube.com)

Homo erectus from the seabed, new archaeological discoveries in Indonesia (universiteitleiden.nl)

Designing Tools for Scientific Thought (forester-notes.org)

US Trade Court finds Trump tariffs illegal (bloomberg.com)

The mysterious Gobi wall uncovered (phys.org)

A toy RTOS inside Super Mario Bros. using emulator save states

Comments (46)