Io_uring, kTLS and Rust for zero syscall HTTPS server

247 guntars 40 8/22/2025, 3:51:44 AM blog.habets.se ↗

Comments (40)

Seattle3503 · 5h ago
> For example when submitting a write operation, the memory location of those bytes must not be deallocated or overwritten.

> The io-uring crate doesn’t help much with this. The API doesn’t allow the borrow checker to protect you at compile time, and I don’t see it doing any runtime checks either.

I've seen comments like this before[1], and I get the impression that building a a safe async Rust library around io_uring is actually quite difficult. Which is sort of a bummer.

IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".

[1] https://boats.gitlab.io/blog/post/io-uring/

newpavlov · 46m ago
This actually one of my many gripes about Rust async and why I consider it a bad addition to the language in the long term. The fundamental problem is that rust async was developed when epoll was dominant (and almost no one in the Rust circles cared about IOCP) and it has heavily influenced the async design (sometimes indirectly through other languages).

Think about it for a second. Why do we not have this problem with "synchronous" syscalls? When you call `read` you also "pass mutable borrow" of the buffer to the kernel, but it maps well into the Rust ownership/borrow model since the syscall blocks execution of the thread and there are no ways to prevent it in user code. With poll-based async model you side-step this issues since you use the same "sync" syscalls, but which are guaranteed to return without blocking.

For a completion-based IO to work properly with the ownership/borrow model we have to guarantee that the task code will not continue execution until it receives a completion event. You simply can not do it with state machines polled in user code. But the threading model fits here perfectly! If we are to replace threads with "green" threads, user Rust code will look indistinguishable from "synchronous" code. And no, the green threads model can work properly on embedded systems as demonstrated by many RTOSes.

There are several ways of how we could've done it without making the async runtime mandatory for all targets (the main reason why green threads were removed from Rust 1.0). My personal favorite is introduction of separate "async" targets.

Unfortunately, the Rust language developers made a bet on the unproved polling stackless model because of the promised efficiency and we are in the process of finding out whether the bet plays of or not.

jcranmer · 5h ago
There is, I think, an ownership model that Rust's borrow checker very poorly supports, and for lack of a better name, I've called it hot potato ownership. The basic idea is that you have a buffer which you can give out as ownership in the expectation that the person you gave it to will (eventually) give it back to you. It's a sort of non-lexical borrowing problem, and I very quickly discovered when trying to implement it myself in purely safe Rust that the "giving the buffer back" is just really gnarly to write.
pornel · 2h ago
This can be done with exclusively owned objects. That's how io_uring abstractions work in Rust – you give your (heap allocated) buffer to a buffer pool, and get it back when the operation is done.

&mut references are exclusive and non-copyable, so the hot potato approach can even be used within their scope.

But the problem in Rust is that threads can unwind/exit at any time, invalidating buffers living on the stack, and io_uring may use the buffer for longer than the thread lives.

The borrow checker only checks what code is doing, but doesn't have power to alter runtime behavior (it's not a GC after all), so it only can prevent io_uring abstractions from getting any on-stack buffers, but has no power to prevent threads from unwinding to make on-stack buffer safe instead.

stouset · 5h ago
Maybe I’m misunderstanding, but why is that not possible with a

    Fn(_: T) -> T
iknowstuff · 4h ago
dwattttt · 4h ago
As sibling notes, it is. It's very rarely seen though.

One place you might see something like it is if an API takes ownership, but returns it on error; you see the error side carry the resource you gave it, so you could try again.

IshKebab · 1h ago
How is that different to

  Fn(_: &mut T)

?
tayo42 · 4h ago
Refcel didn't work? Or rc?
rfoo · 3h ago
Slapping Rc<T> over something that could be clearly uniquely owned is a sign of very poorly designed lifetime rules / system.

And yes, for now async Rust is full of unnecessary Arc<T> and is very poorly made.

zozbot234 · 1h ago
If the thread can be dropped while the buffer is "owned" by the kernel io-uring facilities (to be given back when the operation completes) that's not "unique" ownership. The existing Rc/Arc<T> may be overkill for that case, but something very much like it will still be needed.
JoshTriplett · 5h ago
I think the right way to build a safe interface around io_uring would be to use ring-owned buffers, ask the ring for a buffer when you want one, and give the buffer back to the ring when initiating a write.
pingiun · 5h ago
This is something that Amos Wenger (fasterthanlime) has worked on: https://github.com/bearcove/loona/blob/main/crates/buffet/RE...
Tuna-Fish · 2h ago
This works perfectly well, and allows using the type system to handle safety. But it also really limits how you handle memory, and makes it impossible to do things like filling out parts of existing objects, so a lot of people are reluctant to take the plunge.
ozgrakkurt · 3h ago
You don’t have to represent everything with borrows. You can just use data structures like Slab to make it cancel safe.

As an example this library I wrote before is cancel safe and doesn’t use lifetimes etc. for it.

https://github.com/steelcake/io2

ozgrakkurt · 4m ago
Just realised my code isn’t cancel safe either. It is invalid if the user just drops a read future and the buffer itself while the operation is in the kernel.

It is just a PITA to get it fully right.

Probably need the buffer to come from the async library so user allocates the buffers using the async library like a sibling comment says.

It is just much easier to not use Rust and say futures should run fully always and can’t be just dropped and make some actual progress. So I’m just doing it in zig now

bmcahren · 5h ago
This was a good read and great work. Can't wait to see the performance tests.

Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface

I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.

I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.

https://netflixtechblog.com/life-of-a-netflix-partner-engine...

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

kev009 · 5h ago
It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.
jabl · 11m ago
To nitpick at least as of Apache HTTPD 1.3 ages ago it wasn't forking for every request, but had a pool of already forked worker processes with each handling one connection at a time but could handle an unlimited number of connections sequentially, and it could spawn or kill worker processes depending on load.

The same model is possible in Apache httpd 2.x with the "prefork" mpm.

avar · 3h ago

    every HTTP session was commonly a forked
    copy of the entire server in the CERN
    and Apache lineage!
And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.
tsimionescu · 2h ago
Forking for anything other than calling exec is still a horrible idea (with special exceptions like shells). Forking is a very unsafe operation (you can easily share locks and files with the child process unless both your code and every library you use is very careful - for example, it's easy to get into malloc deadlocks with forked processes), and its performance depends a lot on how you actually use it.
josephg · 2h ago
So long as you have something like nginx in front of your server. Otherwise your whole site can be taken down by a slowloris attack over a 33.6k modem.
Imustaskforhelp · 3h ago
Such a good read.

I am patient to wait for the benchmarks so take your time ,but I honestly love how the author doesn't care about benchmarks right now and wanted to clean the code first. Its kinda impressive that there are people who have such line of thinking in this world where benchmarks gets maxxed and whole project's sole existence is to satisfy benchmarks.

Really a breath of fresh air and honestly I admire the author so much for this. It was such a good read, loved it a lot thank you. Didn't know ktls existed or Io_uring could be used in such a way.

bullen · 59m ago
So far everything after epoll that I have compared with falls short.

So to reimplement my foundation (with all the bugs) will not be worth it.

I will however compare Javas NIO (epoll) with the new Virtual Threads IO (without pinning).

http://github.com/tinspin/rupy

sandeep-nambiar · 6h ago
This is really cool. I've been thinking about something similar for a long time and I'm glad someone has finally done it. GG!

I can recommend writing even the BPF side of things with rust using Aya[1].

[1] - https://github.com/aya-rs/aya

phrotoma · 1h ago
Anybody know what the state of kTLS is? I asked one of the Cilium devs about it a while ago'cause I'd seen Thomas Graf excitedly talking about it and he told me that kernel support in many distros was lacking so they aren't ready to enable it by default.
ValtteriL · 5h ago
Excellent read. I'd like to see DPDK style full kernel bypass next
spaintech · 4h ago
Not sure if you are aware of this, but LUNA does this already.

https://www.usenix.org/system/files/atc23-zhu-lingjun.pdf

6r17 · 5h ago
I really want to see the benchmarks on this ; tried it like 4 days ago and then built a standard epoll implementation ; I could not compete against nginx using uring but that's not the easiest task for an arrogant night so I really hope you get some deserved sweet numbers ; mine were a sad deception but I did not do most of your implementation - rather simply tried to "batch" calls. Wish you the best of luck and much fun
mgaunard · 3h ago
"zero syscall"

> In order to avoid busy looping, both the kernel and the web server will only busy-loop checking the queue for a little bit (configurable, but think milliseconds), and if there’s nothing new, the web server will do a syscall to “go to sleep” until something gets added to the queue.

thomashabets2 · 1h ago
Under load it's zero syscall (barring any rare allocations inside rustls for the handshake. I can't guarantee that it never does).

Without load the overhead of calling (effectively) sleep() is, while technically true, not relevant.

But sure, you can tweak the busyloop timers and burn 100% CPU on kernel and user side indefinitely if you want to avoid that sleep-when-idle syscall. It's just… not a good idea.

KolmogorovComp · 3h ago
It’s good to read an article until the end

> This means that a busy web server can serve all of its queries without even once (after setup is done) needing to do a syscall. As long as queues keep getting added to, strace will show nothing.

nly · 3h ago
Like all polling I/O models (that don't spin) it also means you have to wait milliseconds in the worst case to start servicing a request. That's a long time.

For comparison a read/write over a TCP socket on loopback between two process is a few microseconds using BSD sockets API.

klabb3 · 2h ago
> Like all polling I/O models (that don't spin) it also means you have to wait milliseconds in the worst case to start servicing a request. That's a long time.

No? What they're saying is the busy loop will spin until an event occurs, for at most x ms. And if it does park the thread (the only syscall required), it can be immediately woken up on the first event too. Only if multiple events occurred since the last call would you receive them together. This normally happens only under high load, when event processing takes enough time to have a buildup of new events in the background. Increased latency is the intended outcome on high loads.

To be fair, it was a while ago I read the io-uring paper. But I distinctly recall the mix of poll and park behavior, plus configurable wait conditions. Please correct me if I'm wrong (someone here certainly knows).

boredatoms · 6h ago
Whats the goto instead of strace, if you wanted to see what was going on?
fuy · 52m ago
perf and look at stack traces (or off-cpu events for waits/locks). also, ebpf
abrookewood · 5h ago
I think you have to use eBPF-based tools
LAC-Tech · 1h ago
I think rusts glacial compile times prevent it from being a useful platform for web apps. Yes it's a nice language, and very performant, but it's horrible devex to have to wait seconds for your server to recompile after a change.
maeln · 1h ago
> but it's horrible devex to have to wait seconds for your server to recompile after a change.

What a time to be alived that seconds to recompile is consider horrible devex.