Great article. But too short. I was just getting excited about it and it ended. I look forward reading the other parts.
Animats · 18h ago
Tune in next week for the next exciting episode, where we will see a command taken off the queue and executed in the GPU!
The abstraction level discussed here is just where data gets passed across the user/kernel boundary. It's mostly queue and buffer management, which is why there are so few operations. The real action happens as queued commands are executed.
There's another stream of command completions coming back from the GPU. Looking forward to seeing how that works. All this asynchrony is mostly not the driver's problem. That's kicked up to the user code level, as the driver delivers completions.
Muromec · 21h ago
Oh, that's cool. I use one of the rk3588 things with panfrost as a desktop and it sometimes bugs out with black or transparent patches in firefox. Weird thing.
rjsw · 19h ago
The RK3588 uses the panthor driver that is the subject of the article, not panfrost.
skavi · 18h ago
Curious as to whether uring_cmd was considered instead of ioctls since this looks green field. Would the benefits have been negligible to nonexistent? If so, why?
kimixa · 14h ago
As GPUs are already asynchronous devices with their own command queue, and the IOCTLS generally just abstracting a relatively cheap write into that command queue, I suspect there's limited utility in making another asynchronous command queue on the CPU to schedule those writes.
Unless you mean to make the GPU command queue itself the uring and map that into userspace, but that would likely require significant firmware changes to support the specifics of the io_uring API, if even possible at all due to hardware specifics.
rjsw · 18h ago
The driver described in the article uses the API that the userspace Mesa libraries expect.
skavi · 15h ago
ah thanks for the clarification. should have read more carefully.
taminka · 19h ago
very interesting, is there a second part to this? or logical continuation...
steveklabnik · 19h ago
It came out today, so I am assuming more will come later.
TZubiri · 20h ago
I know that "Rust GPU driver" on the titles gets you more clicks than "Arm Mali CSF Based GPU Driver". But isn't this a Arm Mali CSF-based GPU driver?
I hate focusing on the metatools (tools for building tools). It really sounds like the objective here was to build something in Rust. In the article it is even described as "a gpu driver kernel supporting arm mali.." instead of just an arm mali driver
It is a misunderstanding of what the job of writing a driver is, you are connecting some wires between the OS api and the manufacturer api, you are not to build a framework that adds an additional layer of abstraction, sorry to put it so bluntly, but you are not that guy.
Sorry for being rough.
dralley · 20h ago
It's somewhat relevant given that this is one of the first Rust-based GPU drivers for Linux.
GeekyBear · 20h ago
The Asahi Linux team has previously blogged pretty extensively about developing the GPU driver for the Apple M series SOCs in Rust.
I'm not sorry for being rough, you sound like someone who has no idea what a modern GPU driver is like. I haven't written any in about 15 years, and I know it's only gotten worse since then.
Go look in the Linux kernel source code -- GPU drivers are, by lines of code, the single biggest component. Also, lots of drivers support multiple cards. Do you think it would be sensible to have a seperate driver, completely independant, for every single GPU card?
GPU drivers aren't about "connecting some wires" between two APIs, because those two APIs turn out to be quite different.
Of course, feel free to prove me wrong. Show me a GPU driver you've written, go link some wires together.
Cieric · 16h ago
While I won't endorse what the GP said, I wouldn't say that it's only gotten worse. I work for a modern gpu company (you can probably figure out which one from my comment history) on one of the modern apis and they much more closely represent what the gpu does. It's not like how opengl use to be as the gpus hold much less state for you than they use to. However with the new features being added now it is starting to drift apart again and once again become more complex.
CJefferson · 16h ago
That's interesting to know! I keep meaning to try fixing into the AMD stuff (mainly as it seems like the more open source one), but need to find the time to deep dive!
Cieric · 14h ago
Yeah, we also have a gaming and a developer discord where I hang around. So feel free to join and ask questions there.
Animats · 18h ago
> it's only gotten worse since then.
It's worse all the way up. Modern GPUs support a huge amount of asynchronous operations. Applications put commands on queues, and completions come back later. The driver and Vulkan mostly pass those completions upward, until they reach the renderer, which has to figure out what it's allowed to do next.
How well that's done has a huge impact on performance.
(See my previous grumbling about the Rust renderer performance situation. All the great things Vulkan can do for performance are thrown away, because the easy way to do this doesn't scale.)
shmerl · 11h ago
Why would Rust rendering be worse than any other rendering? Rust claims to be well suited for handling parallelism.
MindSpunk · 10h ago
It is fantastic for CPU parallelism. The problem is the CPU/GPU boundary is difficult to deal with and exposing an API that is both fast and safe and flexible is almost impossible.
I don't believe it's possible to make an efficient API at a similar level of abstraction to Vulkan or D3D12 that is safe (as in, not marked unsafe in rust). To do so requires recreating all the complexity of D3D11 and OpenGL style APIs to handle resource access synchronization.
The value proposition of D3D12 and Vulkan is that the guardrails are gone and it's up to the user to do the synchronization work themselves. The advantage is that you can make the synchronization decisions at a higher level of abstraction where more assumptions can be made and enforced by a higher-level API. Generally this is more efficient because you can use much simpler algorithms to decide when to emit your barriers, rather than having the driver reverse engineer that high-level knowledge from the low-level command stream.
Rust is just not capable of representing the complex interwoven ownership and synchronization rules for using these APIs without mountains of runtime checks that suck away all the benefit of using these APIs. Lots of Vulkan map quite well to Rust's ownership rules, the memory allocation API surface maps very well. But anything that's happening on the GPU timeline is pretty much impossible to do safely. Rust's type system is not sufficiently capable of modeling this stuff without tons of runtime checks, or making the API so awful to use nobody will bother.
I've seen GP around a lot and afaik they're using WGPU which is, among other things, Firefox's WebGPU implementation. The abstraction that WebGPU provides is entirely the wrong level to most efficiently use Vulkan and D3D12 style APIs. WebGPU must be safe because it's meant to get exposed to JS in a browser, so it spends a boat load of CPU time to do all the runtime checks and work out the synchronization requirements.
Rust can be more challenging here because if you want a safe API you have to be very careful in where you set the boundary between the unsafe internals and the safe API. And Rust's safety rails will be of limited use for the real difficult parts. I'm writing my own abstraction over Vulkan/D3D12/Metal and I've intentionally decided not to make my API safe and to leave it to a higher layer to construct a safe API.
simonask · 4h ago
I'm currently writing a Vulkan renderer in Rust, and I decided against wgpu for this reason - its synchronization story is too blunt. But I don't necessarily agree that this style of programming is very much at odds with Rust's safety model, which is fundamentally an API design tool.
The key insight with Rust is to not try to use borrowing semantics unless the model actually matches, which it doesn't for GPU resources and command submission.
I'm modeling things using render graphs. Nodes in the graph declare what resources they use and how, such that pipeline barriers can be inserted between nodes. Resources may be owned by the render graph itself ("transient"), or externally by an asset system.
Barriers for transient resources can be statically computed when the render graph is built (no per-frame overhead, and often barriers can be elided completely). Barriers for shared resources (assets) must be computed based on some runtime state at submission time that indicates the GPU-side state of each resource (queue ownership etc.), and I don't see how any renderer that supports mutable assets or asset streaming can avoid that.
I don't think there's anything special about Rust here. Any high-level rendering API must decide on some convenient semantics, and map those to Vulkan API semantics. Nothing in Rust forces you to choose Rust's own borrowing model as those semantics, and consequently does not force you to do any more runtime validation than you would anywhere else.
exDM69 · 5h ago
> Lots of Vulkan map quite well to Rust's ownership rules, the memory allocation API surface maps very well. But anything that's happening on the GPU timeline is pretty much impossible to do safely.
I agree with this, having been dabbling with Vulkan and Rust for a few years now. Destructors and ownership can make a pretty ergonomic interface to the cpu side of gpu programming. It's "safe" as long as you don't screw up your gpu synchronization which is not perfect but it's an improvement over "raw" graphic api calls (with little to no overhead).
As for the GPU timeline, I've been experimenting with timeline semaphores. E.g. all the images (and image views) in descriptor set D must be live as long as semaphore S has value less than X. This coupled with some kind of deletion queue could accurately track lifetimes of resources on the GPU timeline.
On the other hand, basic applications and "small world" game engines have a simpler way out. Most resources have a pre-defined lifetime, either it lives as long as the application, or the "loaded level" or the current frame. You might even use Rust lifetimes to track this (but I don't). This model is not applicable when streaming textures and geometry in and out of the GPU.
What I would really like to experiment with is using async Rust for GPU programming. Instead of using `epoll/kqueue/WaitForMultipleObjects` in the async runtime for switching between "green threads" the runtime could do `vkWaitForSemaphores(VK_SEMAPHORE_WAIT_ANY_BIT)` (sadly this function does not return which semaphore(s) were signaled). Each green thread would need its own semaphore, command pools, etc.
Unfortunately this would be a 6-12 month research project and I don't have that much free time at hand. It would also be quite an alien model for most graphics programmers so I don't think it would catch on. But it would be a fun research experiment to try.
Animats · 9h ago
> The abstraction that WebGPU provides is entirely the wrong level to most efficiently use Vulkan and D3D12 style APIs.
I agree with this, although the WGPU people disagree.
There could be a Vulkan API for "modern Vulkan" - bindless only, dynamic rendering only, asset loading on a transfer queue only, multithreaded transfers. That would simplify things and potentially improve performance. But it would break code that's already running and would not work on some mobile devices.
We'll probably get that in the 2027-2030 period, as WebGPU devices catch up.
WGPU suffers from having to support the feature set supported by all its back ends - DX12, Vulkan, Metal, WebGPU, and even OpenGL. It's amazing that it works, but a price was paid.
shmerl · 9h ago
I indeed wouldn't expect safe approach here to be necessarily efficient. But you aren't forced to make everything safe even if it's nicer. I've seen some Vulkan Rust wrappers before which tried to do that, but as you say it comes at some cost.
So I'd guess you can always use raw Vulkan bindings and deal with related unsafety and leave some areas that aren't tied to synchronization for safer logic.
Dealing with hardware in general is unsafe, and GPUs are so complex that it's sort of expected.
UK-AL · 20h ago
Rust is important here because it's one of the first(if not the first) to use the rust infrastructure for a GPU.
monocasa · 19h ago
The Asahi folks were probably first in this regard.
The abstraction level discussed here is just where data gets passed across the user/kernel boundary. It's mostly queue and buffer management, which is why there are so few operations. The real action happens as queued commands are executed.
There's another stream of command completions coming back from the GPU. Looking forward to seeing how that works. All this asynchrony is mostly not the driver's problem. That's kicked up to the user code level, as the driver delivers completions.
Unless you mean to make the GPU command queue itself the uring and map that into userspace, but that would likely require significant firmware changes to support the specifics of the io_uring API, if even possible at all due to hardware specifics.
I hate focusing on the metatools (tools for building tools). It really sounds like the objective here was to build something in Rust. In the article it is even described as "a gpu driver kernel supporting arm mali.." instead of just an arm mali driver
It is a misunderstanding of what the job of writing a driver is, you are connecting some wires between the OS api and the manufacturer api, you are not to build a framework that adds an additional layer of abstraction, sorry to put it so bluntly, but you are not that guy.
Sorry for being rough.
It's also an informative read.
> Paving the Road to Vulkan on Asahi Linux
https://asahilinux.org/2023/03/road-to-vulkan/
Go look in the Linux kernel source code -- GPU drivers are, by lines of code, the single biggest component. Also, lots of drivers support multiple cards. Do you think it would be sensible to have a seperate driver, completely independant, for every single GPU card?
GPU drivers aren't about "connecting some wires" between two APIs, because those two APIs turn out to be quite different.
Of course, feel free to prove me wrong. Show me a GPU driver you've written, go link some wires together.
It's worse all the way up. Modern GPUs support a huge amount of asynchronous operations. Applications put commands on queues, and completions come back later. The driver and Vulkan mostly pass those completions upward, until they reach the renderer, which has to figure out what it's allowed to do next. How well that's done has a huge impact on performance.
(See my previous grumbling about the Rust renderer performance situation. All the great things Vulkan can do for performance are thrown away, because the easy way to do this doesn't scale.)
I don't believe it's possible to make an efficient API at a similar level of abstraction to Vulkan or D3D12 that is safe (as in, not marked unsafe in rust). To do so requires recreating all the complexity of D3D11 and OpenGL style APIs to handle resource access synchronization.
The value proposition of D3D12 and Vulkan is that the guardrails are gone and it's up to the user to do the synchronization work themselves. The advantage is that you can make the synchronization decisions at a higher level of abstraction where more assumptions can be made and enforced by a higher-level API. Generally this is more efficient because you can use much simpler algorithms to decide when to emit your barriers, rather than having the driver reverse engineer that high-level knowledge from the low-level command stream.
Rust is just not capable of representing the complex interwoven ownership and synchronization rules for using these APIs without mountains of runtime checks that suck away all the benefit of using these APIs. Lots of Vulkan map quite well to Rust's ownership rules, the memory allocation API surface maps very well. But anything that's happening on the GPU timeline is pretty much impossible to do safely. Rust's type system is not sufficiently capable of modeling this stuff without tons of runtime checks, or making the API so awful to use nobody will bother.
I've seen GP around a lot and afaik they're using WGPU which is, among other things, Firefox's WebGPU implementation. The abstraction that WebGPU provides is entirely the wrong level to most efficiently use Vulkan and D3D12 style APIs. WebGPU must be safe because it's meant to get exposed to JS in a browser, so it spends a boat load of CPU time to do all the runtime checks and work out the synchronization requirements.
Rust can be more challenging here because if you want a safe API you have to be very careful in where you set the boundary between the unsafe internals and the safe API. And Rust's safety rails will be of limited use for the real difficult parts. I'm writing my own abstraction over Vulkan/D3D12/Metal and I've intentionally decided not to make my API safe and to leave it to a higher layer to construct a safe API.
The key insight with Rust is to not try to use borrowing semantics unless the model actually matches, which it doesn't for GPU resources and command submission.
I'm modeling things using render graphs. Nodes in the graph declare what resources they use and how, such that pipeline barriers can be inserted between nodes. Resources may be owned by the render graph itself ("transient"), or externally by an asset system.
Barriers for transient resources can be statically computed when the render graph is built (no per-frame overhead, and often barriers can be elided completely). Barriers for shared resources (assets) must be computed based on some runtime state at submission time that indicates the GPU-side state of each resource (queue ownership etc.), and I don't see how any renderer that supports mutable assets or asset streaming can avoid that.
I don't think there's anything special about Rust here. Any high-level rendering API must decide on some convenient semantics, and map those to Vulkan API semantics. Nothing in Rust forces you to choose Rust's own borrowing model as those semantics, and consequently does not force you to do any more runtime validation than you would anywhere else.
I agree with this, having been dabbling with Vulkan and Rust for a few years now. Destructors and ownership can make a pretty ergonomic interface to the cpu side of gpu programming. It's "safe" as long as you don't screw up your gpu synchronization which is not perfect but it's an improvement over "raw" graphic api calls (with little to no overhead).
As for the GPU timeline, I've been experimenting with timeline semaphores. E.g. all the images (and image views) in descriptor set D must be live as long as semaphore S has value less than X. This coupled with some kind of deletion queue could accurately track lifetimes of resources on the GPU timeline.
On the other hand, basic applications and "small world" game engines have a simpler way out. Most resources have a pre-defined lifetime, either it lives as long as the application, or the "loaded level" or the current frame. You might even use Rust lifetimes to track this (but I don't). This model is not applicable when streaming textures and geometry in and out of the GPU.
What I would really like to experiment with is using async Rust for GPU programming. Instead of using `epoll/kqueue/WaitForMultipleObjects` in the async runtime for switching between "green threads" the runtime could do `vkWaitForSemaphores(VK_SEMAPHORE_WAIT_ANY_BIT)` (sadly this function does not return which semaphore(s) were signaled). Each green thread would need its own semaphore, command pools, etc.
Unfortunately this would be a 6-12 month research project and I don't have that much free time at hand. It would also be quite an alien model for most graphics programmers so I don't think it would catch on. But it would be a fun research experiment to try.
I agree with this, although the WGPU people disagree.
There could be a Vulkan API for "modern Vulkan" - bindless only, dynamic rendering only, asset loading on a transfer queue only, multithreaded transfers. That would simplify things and potentially improve performance. But it would break code that's already running and would not work on some mobile devices.
We'll probably get that in the 2027-2030 period, as WebGPU devices catch up.
WGPU suffers from having to support the feature set supported by all its back ends - DX12, Vulkan, Metal, WebGPU, and even OpenGL. It's amazing that it works, but a price was paid.
So I'd guess you can always use raw Vulkan bindings and deal with related unsafety and leave some areas that aren't tied to synchronization for safer logic.
Dealing with hardware in general is unsafe, and GPUs are so complex that it's sort of expected.