QUIC for the kernel

213 Bogdanp 160 7/31/2025, 3:57:32 PM lwn.net ↗

Comments (160)

qwertox · 7h ago
I recently had to add `ssl_preread_server_name` to my NGINX configuration in order to `proxy_pass` requests for certain domains to another NGINX instance. In this setup, the first instance simply forwards the raw TLS stream (with `proxy_protocol` prepended), while the second instance handles the actual TLS termination.

This approach works well when implementing a failover mechanism: if the default path to a server goes down, you can update DNS A records to point to a fallback machine running NGINX. That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.

However, this method won't work with HTTP/3. Since HTTP/3 uses QUIC over UDP and encrypts the SNI during the handshake, `ssl_preread_server_name` can no longer be used to route based on domain name.

What alternatives exist to support this kind of SNI-based routing with HTTP/3? Is the recommended solution to continue using HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?

dgl · 4h ago
Clients supporting QUIC usually also support HTTPS DNS records, so you can use a lower priority record as a failover, letting the client potentially take care of it. (See for example: host -t https dgl.cx.)

That's the theory anyway. You can't always rely on clients to do that (see how much of the HTTPS record Chromium actually supports[1]), but in general if QUIC fails for any reason clients will transparently fallback, as well as respecting the Alt-Svc[2] header. If this is a planned failover you could stop sending a Alt-Svc record and wait for the alternative to timeout, although it isn't strictly necessary.

If you do really want to route QUIC however, one nice property is the SNI is always in the first packet, so you can route flows by inspecting the first packet. See cloudflare's udpgrm[3] (this on its own isn't enough to proxy to another machine, but the building block is there).

Without Encrypted Client Hello (ECH) the client hello (including SNI) is encrypted with a known key (this is to stop middleboxes which don't know about the version of QUIC breaking it), so it is possible to decrypt it, see the code in udpgrm[4]. With ECH the "router" would need to have a key to decrypt the ECH, which it can then decrypt inline and make a decision on (this is different to the TLS key and can also use fallback HTTPS records to use a different key than the non-fallback route, although whether browsers currently support that is a different issue, but it is possible in the protocol). This is similar to how fallback with ECH could be supported with HTTP/2 and a TCP connection.

[1]: https://issues.chromium.org/issues/40257146

[2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

[3]: https://blog.cloudflare.com/quic-restarts-slow-problems-udpg...

[4]: https://github.com/cloudflare/udpgrm/blob/main/ebpf/ebpf_qui...

johncolanduoni · 2h ago
For a failover circumstance, I wouldn’t bother with failover for QUIC at all. If a browser can’t make a QUIC connection (even if advertised in DNS), it will try HTTP1/2 over TLS. Then you can use the same fallback mechanism you would if it wasn’t in the picture.
jcgl · 6h ago
Hm, that’s a good question. I suppose the same would apply to TCP+TLS with Encrypted Client Hello as well, right? Presumably the answer would be the same/similar between the two.
MadnessASAP · 4h ago
Unfortunately I think that falls under the "Not a bug" category of bugs. Keeping the endpoint concealed all the way to the TLS endpoint is a feature* of HTTP/3.

* I do actually consider it a feature, but do acknowledge https://xkcd.com/1172/

PS. HAProxy can proxy raw TLS, but can't direct based on hostname. Cloudflare tunnel I think has some special sauce that can proxy on hostname without terminating TLS but requires using them as your DNS provider.

dgl · 3h ago
Unless you're using ECH (encrypted client helo) the endpoint is obscured (known keys), not concealed.

PS: HAProxy definitely can do this too, something using req.ssl_sni like this:

   frontend tcp-https-plain
       mode tcp
       tcp-request inspect-delay 10s
       bind [::]:443 v4v6 tfo
       acl clienthello req.ssl_hello_type 1
       acl example.com req.ssl_sni,lower,word(-1,.,2) example.com
       tcp-request content accept if clienthello
       tcp-request content reject if !clienthello
       default_backend tcp-https-default-proxy
       use_backend tcp-https-example-proxy if example.com
Then tcp-https-example-proxy is a backend which forwards to a server listening for HTTPS (and using send-proxy-v2, so the client IP is kept). Cloudflare really isn't doing anything special here; there are also other tools like sniproxy[1] which can intercept based on SNI (a common thing commerical proxies do for filtering reasons).

[1]: https://github.com/ameshkov/sniproxy

bjourne · 22m ago
What is the need for mashing more and more stuff into the kernel? I thought the job of the kernel was to manage memory, hardware, and tasks. Shouldn't protocols built on top of IP be handled by userland?
leoh · 8m ago
Maybe. Getting stuff into the kernel means (in theory) it’s been hardened, it has a serious LTS, and benefits from… well, the performance of being part of the kernel.
WASDx · 10h ago
I recall this article on QUIC disadvantages: https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...

Seems like this is a step in the right direction to resole some of those issues. I suppose nothing is preventing it from getting hardware support in future network cards as well.

miohtama · 10h ago
QUIC does not work very well for use cases like machine-to-machine traffic. However most of traffic in Internet today is from mobile phones to servers and it is were QUIC and HTTP 3 shine.

For other use cases we can keep using TCP.

thickice · 9h ago
Why doesn't QUIC work well for machine-to-machine traffic ? Is it due to the lack of offloads/optimizations for TCP and machine-to-machine traffic tend to me high volume/high rate ?
yello_downunder · 9h ago
QUIC would work okay, but not really have many advantages for machine-to-machine traffic. Machine-to-machine you tend to have long-lived connections over a pretty good network. In this situation TCP already works well and is currently handled better in the kernel. Eventually QUIC will probably be just as good for TCP in this use case, but we're not there yet.
jabart · 9h ago
You still have latency, legacy window sizes, and packet schedulers to deal with.
spwa4 · 7h ago
But that is the huge advantage of QUIC. It does NOT totally outcompete TCP traffic on links (we already have bittorrent over udp for that purpose). They redesigned the protocol 5 times or so to achieve that.
extropy · 9h ago
The NAT firewalls do not like P2P UDP traffic. Majoritoy of the routers lack the smarts to passtrough QUIC correctly, they need to treat it the same as TCP essentially.
johncolanduoni · 3h ago
QUIC isn’t generally P2P though. Browsers don’t support NAT traversal for it.
beeflet · 9h ago
NAT is the devil. bring on the IPoc4lypse
hdgvhicv · 8h ago
Nat is massively useful for all sorts of reasons which has nothing to do with ip limitations.
paulddraper · 6h ago
The NAT RPC talks purely about IP exhaustion.

What do you have in mind.

skissane · 4h ago
Why run your K8S cluster on IPv6 when IPv4 with 10.0.0.0/8 works perfectly with less hassle? You can always support IPv6 at the perimeter for ingress/egress. If your cluster is so big it can’t fit in 10.0.0.0/8, maybe the right answer is multiple smaller clusters-your service mesh (e.g. istio) can route inter-cluster traffic just based on names, not IPs.

And if 10.0.0.0/8 is not enough, there is always the old Class E, 240.0.0.0/4 - likely never going to be acceptable for use on the public Internet, but growing use as an additional private IPv4 address range - that gives you over 200 million more IPv4 addresses

Iwan-Zotow · 2h ago
Kubes
beeflet · 6h ago
sounds great but it fucks up P2P in residential connections, where it is mostly used due to ipv4 address conservation. You can still have nat in IPv6 but hopefully I won't have to deal with it
mightyham · 3h ago
In practice, P2P over ipv6 is totally screwed because there are no widely supported protocols for dynamic firewall pinholing (allowing inbound traffic) on home routers, whereas dynamic ipv4 NAT configuration via UPnP is very popular and used by many applications.
johncolanduoni · 2h ago
Most home routers do a form of stateful IPv6 firewall (and IPv4 NAT for that matter) compatible with STUN. UPnP is almost never necessary and has frequent security flaws in common implementations.
beeflet · 3h ago
just don't use a firewall
unethical_ban · 8h ago
Rather, NAT is a bandage for all sorts of reasons besides IP exhaustion.

Example: Janky way to get return routing for traffic when you don't control enterprise routes.

Source: FW engineer

dan-robertson · 8h ago
I think basically there is currently a lot of overhead and, when you control the network more and everything is more reliable, you can make tcp work better.
m00x · 9h ago
It's explained in the reddit thread. Most of it is because you have to handle a ton of what TCP does in userland.
exabrial · 8h ago
For starters, why encrypt something literally in the same datacenter 6 feet away? Add significant latency and processing overhead.
sleepydog · 5h ago
Encryption gets you data integrity "for free". If a bit is flipped by faulty hardware, the packet won't decrypt. TCP checksums are not good enough for catching corruption in many cases.
lll-o-lll · 6h ago
To stop or slow down the attacker who is inside your network and trying to move horizontally? Isn’t this the principle of defense in depth?
20k · 7h ago
Because the NSA actively intercepts that traffic. There's a reason why encryption is non optional
Karrot_Kream · 7h ago
To me this seems outlandish (e.g. if you're part of PRISM you know what's happening and you're forced to comply.) But to think through this threat model, you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts to spy traffic at the NIC level? I guess it would be harder to intercept and untangle traffic at the NIC level than intra-DC, but I'm not sure?
viraptor · 7h ago
> you're worried that the NSA will tap intra-DC traffic but not that it will try to install software or hardware on your hosts

It doesn't have to be one or the other. We've known for over a decade that the traffic between DCs was tapped https://www.theguardian.com/technology/2013/oct/30/google-re... Extending that to intra-DC wouldn't be surprising at all.

Meanwhile backdoored chips and firmware attacks are a constant worry and shouldn't be discounted regardless of the first point.

adgjlsfhk1 · 6h ago
The difference between tapping intra-DC and in computer spying is that in computer spying is much more likely to get caught and much less easily able to get data out. There's a pretty big difference between software/hardware weaknesses that require specific targeting to exploit and passive scooping everything up and scanning
cherryteastain · 7h ago
If you are concerned about this, how do you think you could protect against AWS etc allowing NSA to snoop on you from the hypervisor level?
exabrial · 7h ago
Imaginary problems are the funnest to solve.
20k · 1h ago
Its a stone cold fact that the NSA does this, it was part of the snowden revelations. Don't spread FUD about security, its important
switchbak · 7h ago
Service meshes often encrypt traffic that may be running on the same physical host. Your security policy may simply require this.
mschuster91 · 5h ago
Because any random machine in the same datacenter and network segment might be compromised and do stuff like running ARP spoofing attacks. Cisco alone has had so many vendor-provided backdoors cropping up that I wouldn't trust anything in a data center with Cisco gear.
Ericson2314 · 9h ago
What will the socket API look like for multiple streams? I guess it is implied it is the same as multiple connections, with caching behind the scenes.

I would hope for something more explicit, where you get a connection object and then open streams from it, but I guess that is fine for now.

https://github.com/microsoft/msquic/discussions/4257 ah but look at this --- unless this is an extension, the server side can also create new streams, once a connection is established. The client creating new "connections" (actually streams) cannot abstract over this. Something fundamentally new is needed.

My guess is recvmsg to get a new file descriptor for new stream.

gte525u · 9h ago
I would look at SCTP socket API it supports multistreaming.
wahern · 5h ago
signa11 · 1h ago
> API RFC is ...

still a draft though.

Ericson2314 · 5h ago
Ah fuck, it still has a stream_id notion

How are socket APIs always such garbage....

wahern · 2h ago
At least the SCTP API has sctp_peeloff, which gives you a new single-stream socket descriptor for the connection. Maybe QUIC will get something like that, eventually. Kind of a glaring omission, though, unless I'm misunderstanding.
Ericson2314 · 2h ago
Yeah. Huge omission.
mananaysiempre · 4h ago
SCTP is very telecom-shaped; in particular, IIRC, the number of streams is fixed at the start of the connection, so (this sucks but also) GP’s problem does not appear.
Ericson2314 · 6h ago
I checked that out and....yuck!

- Send specifies which stream by ordinal number? (Can't have different parts of a concurrent app independently open new streams)

- Receive doesn't specify which stream at all?!

another_twist · 7h ago
I have a question - bottleneck for TCP is said to the handshake. But that can be solved by reusing connections and/or multiplexing. The current implementation is 3-4x slower than the Linux impl and performance gap is expected to close.

If speed is touted as the advantage for QUIC and it is in fact slower, why bother with this protocol ? The author of the PR itself attributes some of the speed issues to the protocol design. Are there other problems in TCP that need fixing ?

jauntywundrkind · 7h ago
The article discusses many of the reasons QUIC is currently slower. Most of them seem to come down to "we haven't done any optimization for this yet".

> Long offers some potential reasons for this difference, including the lack of segmentation offload support on the QUIC side, an extra data copy in transmission path, and the encryption required for the QUIC headers.

All of these three reasons seem potentially very addressable.

It's worth noting that the benchmark here is on pristine network conditions, a drag race if you will. If you are on mobile, your network will have a lot more variability, and there TCP's design limits are going to become much more apparent.

TCP itself often has protocols run on top of it, to do QUIC like things. HTTP/2 is an example of this. So when you compare QUIC and TCP, it's kind of like comparing how fast a car goes with how fast an engine bolted to a frame with wheels on it goes. QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3. Thats less system design.

QUIC also has wins for connecting faster, and especially for reconnecting faster. It also has IP mobility: if you're on mobile and your IP address changes (happens!) QUIC can keep the session going without rebuilding it once the client sends the next packet.

It's a fantastically well thought out & awesome advancement, radically better in so many ways. The advantages of having multiple non-blocking streams (alike SCTP) massively reduces the scope that higher level protocol design has to take on. And all that multi-streaming stuff being in the kernel means it's deeply optimizable in a way TCP can never enjoy.

Time to stop driving the old rust bucket jalopy of TCP around everywhere, crafting weird elaborate handmade shit atop it. We need a somewhat better starting place for higher level protocols and man oh man is QUIC alluring.

redleader55 · 5h ago
> QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3

IP is layer 3 - network(ensures packets are routed to the correct host). TCP is layer 4 - transport(some people argue that TCP has functions from layer 5 - eg. establishing sessions between apps), while TLS adds a few functions from layer 6(eg. encryption), which QUIC also has.

w3ll_w3ll_w3ll · 5h ago
TCP is level 4 in the OSI model
morning-coffee · 7h ago
That's just one bottleneck. The other issue is head-of-line blocking. When there is packet loss on a TCP connection, nothing sent after that is delivered until the loss is repaired.
anonymousiam · 7h ago
TCP windowing fixes the issue you are describing. Make the window big and TCP will keep sending when there is a packet loss. It will also retry and usually recover before the end of the window is reached.

https://en.wikipedia.org/wiki/TCP_window_scale_option

quietbritishjim · 7h ago
The statement in the comment you're replying to is still true. While waiting for those missed packets, the later packets will not be dropped if you have a large window size. But they won't be delivered either. They'll be cached in the kennel, even though it may be that the application could make use of them before the earlier blocked packet.
Twirrim · 5h ago
The queuing discipline used by default (pfifo_fast) is barely more than 3 FIFO queues bundled together. The 3 queues allow for a barest minimum semblance of prioritisation of traffic, where Queue 0 > 1 > 2, and you can tweak some tcp parameters to have your traffic land in certain queues. If there's something in queue 0 it must be processed first before anything in queue 1 gets touched etc.

Those queues operate purely head-of-queue basis. If what is at the top of the queue 0 is blocked in any way, the whole queue behind it gets stuck, regardless of if it is talking to the same destination, or a completely different one.

I've seen situations where a glitching network card caused some serious knock on impacts across a whole cluster, because the card would hang or packets would drop, and that would end up blocking the qdisc on a completely healthy host that was in the middle of talking to it, which would have impacts on any other host that happened to be talking to that healthy host. A tiny glitch caused much wider impacts than you'd expect.

The same kind of effect would happen from a VM that went through live migration. The tiny, brief pause would cause a spike of latency all over the place.

There are classful alternatives like fq_codel that can be used, that can mitigate some fo this, but you do have to pay a small amount of processing overhead on every packet, because now you have a queuing discipline that actually needs to track some semblance of state.

morning-coffee · 6h ago
They are unrelated. Larger windows help achieve higher throughput over paths with high delay. You allude to selective acknowledgements as a way to repair loss before the window completely drains which is true, but my point is that no data can be delivered to the application until the loss is repaired (and that repair takes at least a round-trip time). (Then the follow-on effects from noticed loss on the congestion controller can limit subsequent in-flight data for a time, etc, etc.)
anonymousiam · 18m ago
The application will hang waiting for the stack, but the stack keeps working and once the drop is remedied, the application will get a flood of data at a higher rate than the max network rate. So the application may pause sometimes, but the average rate of throughput is not much affected by drops.
another_twist · 7h ago
Whats the packet loss rate on modern networks ? Curious.
deathanatos · 24m ago
… from 0% (a wired home LAN with nothing screwy going on) to 100% (e.g., cell reception at the San Antonio Caltrain station), depending on conditions…?

As it always has been, and always will be.

adgjlsfhk1 · 6h ago
~80% when you step out of wifi range on your cell phone.
wmf · 7h ago
It can be high on cellular.
reliablereason · 7h ago
That depends on how much data you are pushing. if you are pushing 200 mb on a 100mb line you will get 50% packet loss.
spwa4 · 7h ago
Well, yes, that's the idea behind TCP itself, but a "normal" rate of packet loss is something along the lines of 5/100k packets dropped on any given long-haul link. Let's say a random packet passes about 8 such links, so a "normal" rate of packet loss is 0.025% or so.
positr0n · 51m ago
Once it makes it to the long haul links. Measure starting at your cell phone and packet loss is much higher than 0.025% and that's where QUIC shines.
bawolff · 2h ago
> bottleneck for TCP is said to the handshake. But that can be solved by reusing connections

You can't reuse a connection that doesn't exist yet. A lot of this is about reducing latency not overall speed.

frmdstryr · 6h ago
The "advantage" is tracking via the server provided connection ID https://www.cse.wustl.edu/~jain/cse570-21/ftp/quic/index.htm...
bawolff · 2h ago
That's non-sensical. Connection-id doesn't allow tracking that you couldn't do with tcp.
Bender · 9h ago
I don't know about using it in the kernel but I would love to see OpenSSH support QUIC so that I get some of the benefits of Mosh [1] while still having all the features of OpenSSH including SFTP, SOCKS, port forwarding, less state table and keep alive issues, roaming support, etc... Could OpenSSH leverage the kernel support?

[1] - https://mosh.org/

wmf · 8h ago
SSH would need a lot of work to replace its crypto and mux layers with QUIC. It's probably worth starting from scratch to create a QUIC login protocol. There are a bunch of different approaches to this in various states of prototyping out there.
Bender · 5h ago
Fair points. I suppose Mosh would be the proper starting point then. I'm just selfish and want the benefits of QUIC without losing all the really useful features of OpenSSH.
bauruine · 8h ago
OpenSSH is an OpenBSD project therefore I guess a Linux api isn't that interesting but I could be wrong ofc.
skissane · 1h ago
Once Linux implements it, I think odds are high that FreeBSD sooner or later does too. And maybe NetBSD and XNU/Darwin/macOS/iOS thereafter. And if they’ve all got it, that increases the odds that eventually OpenBSD also implements it. And if OpenBSD has the support in its kernel, then they might be willing to consider accepting code in OpenSSH which uses it. So OpenSSH supporting QUIC might eventually happen, but if it does, it is going to be some years away
Bender · 5h ago
That's a good point. At least it would not be an entirely new idea. [1] Curious what reactions he received.

[1] - https://papers.freebsd.org/2022/eurobsdcon/jones-making_free...

xgpyc2qp · 7h ago
Looks good. Quick is a real game changer for many. Internet should be a little faster with it. Probably we will not care because of 5g, but still valuable. Wondering that there is a separate tow handshake, I was thinking that qick embeds tls, but seams like I am wrong.
kibwen · 9h ago
I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.
toast0 · 6h ago
Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.

You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.

Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc

But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.

shanemhansen · 8h ago
You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.

1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.

2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.

To your point, Quic in the kernel seems to not have either advantage.

FuriouslyAdrift · 7h ago
So... RDMA?
michaelsshaw · 7h ago
No, the first technique describes the basic way they already operate, DMA, but giving access to userspace directly because it's a zerocopy buffer. This is handled by the OS.

RDMA is directly from bus-to-bus, bypassing all the software.

Karrot_Kream · 7h ago
You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.
Veserv · 6h ago
That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.

The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.

At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.

zamalek · 8h ago
What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.

In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).

phlip9 · 4h ago
I'm hoping we get there too with io_uring. It looks like the last few kernel release have made a lot of progress with zero-copy TCP rx/tx, though NIC support is limited and you need some finicky network iface setup to get the flow steering working

https://docs.kernel.org/networking/iou-zcrx.html

michaelsshaw · 9h ago
The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.
0xbadcafebee · 7h ago
Networking is much faster in the kernel. Even faster on an ASIC.

Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.

wmf · 8h ago
Performance comes from dedicating core(s) to polling, not from userspace.
jbritton · 7h ago
The article didn’t discuss ACK. I have often wondered if it makes sense for the protocol to not have ACKs, and to leave that up to the application layer. I feel like the application layer has to ensure this anyway, so I don’t know how much benefit it is to additionally support this at a lower layer.
dahfizz · 10h ago
> QUIC is meant to be fast, but the benchmark results included with the patch series do not show the proposed in-kernel implementation living up to that. A comparison of in-kernel QUIC with in-kernel TLS shows the latter achieving nearly three times the throughput in some tests. A comparison between QUIC with encryption disabled and plain TCP is even worse, with TCP winning by more than a factor of four in some cases.

Jesus, that's bad. Does anyone know if userspace QUIC implementations are also this slow?

dan-robertson · 10h ago
I think the ‘fast’ claims are just different. QUIC is meant to make things fast by:

- having a lower latency handshake

- avoiding some badly behaved ‘middleware’ boxes between users and servers

- avoiding resetting connections when user up addresses change

- avoiding head of line blocking / the increased cost of many connections ramping up

- avoiding poor congestion control algorithms

- probably other things too

And those are all things about working better with the kind of network situations you tend to see between users (often on mobile devices) and servers. I don’t think QUIC was meant to be fast by reducing OS overhead on sending data, and one should generally expect it to be slower for a long time until operating systems become better optimised for this flow and hardware supports offloading more of the work. If you are Google then presumably you are willing to invest in specialised network cards/drivers/software for that.

dahfizz · 9h ago
Yeah I totally get that it optimizes for different things. But the trade offs seem way too severe. Does saving one round trip on the handshake mean anything at all if you're only getting one fourth of the throughput?
dan-robertson · 9h ago
Are you getting one fourth of the throughput? Aren’t you going to be limited by:

- bandwidth of the network

- how fast the nic on the server is

- how fast the nic on your device is

- whether the server response fits in the amount of data that can be sent given the client’s initial receive window or whether several round trips are required to scale the window up such that the server can use the available bandwidth

yello_downunder · 9h ago
It depends on the use case. If your server is able to handle 45k connections but 42k of them are stalled because of mobile users with too much packet loss, QUIC could look pretty attractive. QUIC is a solution to some of the problematic aspects of TCP that couldn't be fixed without breaking things.
drewg123 · 8h ago
The primary advantage of QUIC for things like congestion control is that companies like Google are free to innovate both sides of the protocol stack (server in prod, client in chrome) simultaneously. I believe that QUIC uses BBR for congestion control, and the major advantage that QUIC has is being able to get a bit more useful info from the client with respect to packet loss.

This could be achieved by encapsulating TCP in UDP and running a custom TCP stack in userspace on the client. That would allow protocol innovation without throwing away 3 decades of optimizations in TCP that make it 4x as efficient on the server side.

brokencode · 9h ago
Maybe it’s a fourth as fast in ideal situations with a fast LAN connection. Who knows what they meant by this.

It could still be faster in real world situations where the client is a mobile device with a high latency, lossy connection.

eptcyka · 9h ago
There are claims of 2x-3x operating costs on the server side to deliver better UX for phone users.
jeroenhd · 9h ago
> - avoiding some badly behaved ‘middleware’ boxes between users and servers

Surely badly behaving middleboxes won't just ignore UDP traffic? If anything, they'd get confused about udp/443 and act up, forcing clients to fall back to normal TCP.

zamadatix · 8h ago
Your average middlebox will just NAT UDP (unless it's outright blocked by security policy) and move on. It's TCP where many middleboxes think they can "help" the congestion signaling, latch more deeply into the session information, or worse. Unencrypted protocols can have further interference under either TCP or UDP beyond this note.

QUIC is basically about taking all of the information middleboxes like to fuck with in TCP, putting it under the encryption layer, and packaging it back up in a UDP packet precisely so it's either just dropped or forwarded. In practice this (i.e. QUIC either being just dropped or left alone) has actually worked quite well.

Veserv · 9h ago
Yes. msquic is one of the best performing implementations and only achieves ~7 Gbps [1]. The benchmarks for the Linux kernel implementation only get ~3 Gbps to ~5 Gbps with encryption disabled.

To be fair, the Linux kernel TCP implementation only gets ~4.5 Gbps at normal packets sizes and still only achieves ~24 Gbps with large segmentation offload [2]. Both of which are ridiculously slow. It is straightforward to achieve ~100 Gbps/core at normal packet sizes without segmentation offload with the same features as QUIC with a properly designed protocol and implementation.

[1] https://microsoft.github.io/msquic/

[2] https://lwn.net/ml/all/cover.1751743914.git.lucien.xin@gmail...

klabb3 · 10h ago
Yes, they are. Worse, I’ve seen them shrink down to nothing in the face of congestion with TCP traffic. If Quic is indeed the future protocol, it’s a good thing to move it into the kernel IMO. It’s just madness to provide these massive userspace impls everywhere, on a packet switched protocol nonetheless, and expect it to beat good old TCP. Wouldn’t surprise me if we need optimizations all the way down to the NIC layer, and maybe even middleboxes. Oh and I haven’t even mentioned the CPU cost of UDP.

OTOH, TCP is like a quiet guy at the gym who always wears baggy clothes but does 4 plates on the bench when nobody is looking. Don't underestimate. I wasted months to learn that lesson.

vladvasiliu · 10h ago
Why is QUIC being pushed, then?
klabb3 · 8h ago
From what I understand the ”killer app” initially was because of mobile spotty networks. TCP is interface (and IP) specific, so if you switch from WiFi to LTE the conn breaks (or worse, degrades/times out slowly). QUIC has a logical conn id that continues to work even when a peer changes the path. Thus, your YouTube ads will not buffer.

Secondary you have the reduced RTT, multiple streams (prevents HOL blocking), datagrams (realtime video on same conn) and you can scale buffers (in userspace) to avoid BDP limits imposed by kernel. However.. I think in practice those haven’t gotten as much visibility and traction, so the original reason is still the main one from what I can tell.

wahern · 5h ago
MPTCP provides interface mobility. It's seen widespread deployment with the iPhone, so network support today is much better than one would assume. Unlike QUIC, the changes required by applications are minimal to none. And it's backward compatible; an application can request MPTCP, but if the other end doesn't support it, everything still works.
toast0 · 9h ago
It has good properties compared to tcp-in-tcp (http/2), especially when connected to clients without access to modern congestion control on iffy networks. http/2 was perhaps adopted too broadly; binary protocol is useful, header compression is useful (but sometimes dangerous), but tcp multiplexing is bad, unless you have very low loss ... it's not ideal for phones with inconsistent networking.
favflam · 10h ago
I know in the p2p space, peers have to send lots of small pieces of data. QUIC stops stream blocking on a single packet delay.
fkarg · 9h ago
because it _does_ provide a number of benefits (potentially fewer initial round-trips, more dynamic routing control by using UDP instead of TCP, etc), and is a userspace softare implementation compared with a hardware-accelerated option.

QUIC getting hardware acceleration should close this gap, and keep all the benefits. But a kernel (software) implementation is basically necessary before it can be properly hardware-accelerated in future hardware (is my current understanding)

01HNNWZ0MV43FF · 9h ago
To clarify, the userspace implementation is not a benefit, it's just that you can't have a brand new protocol dropped into a trillion dollars of existing hardware overnight, you have to do userspace first as PoC

It does save 2 round-trips during connection compared to TLS-over-TCP, if Wikipedia's diagram is accurate: https://en.wikipedia.org/wiki/QUIC#Characteristics That is a decent latency win on every single connection, and with 0-RTT you can go further, but 0-RTT is stateful and hard to deploy and I expect it will see very little use.

dan-robertson · 10h ago
The problem it is trying to solve is not overhead of the Linux kernel on a big server in a datacenter
userbinator · 2h ago
Google wants control.
eptcyka · 9h ago
QUIC performance requires careful use of batching. Using UDP spckets naively, i.e. sending one QUIC packet per syscall, will incur a lot of oberhead - every time the kernel has to figure out which interface to use, queue it up on a buffer, and all the rest. If one uses it like TCP, batching up lots of data and enquing packets in one “call” helps a ton. Similarly, the kernel wireguard implementation can be slower than wireguard-go since it doesn’t batch traffic. At the speeds offered by modern hardware, we really need to use vectored I/O to be efficient.
userbinator · 2h ago
IMO being Google's proprietary crap is enough reason to stay away. It not actually being any better is an even more compelling reason.
0x457 · 6h ago
I would expect that a protocol such as TCP performs much better than QUIC in benchmarks. Now do a realistic benchmark over roaming LTE connection and come back with the results.

Without seeing actual benchmark code, it's hard to tell if you should even care about that specific result.

If your goal is to pipe lots of bytes from A to B over internal or public internet there probably aren't make things, if any, that can outperform TCP. Decades were spent optimizing TCP for that. If HOL blocking isn't an issue for you, then you can keep using HTTP over TCP.

rayiner · 9h ago
It’s an interesting testament to how well designed TCP is.
adgjlsfhk1 · 8h ago
IMO, it's more a testament to how fast hardware designers can make things with 30 years to tune.
jeffbee · 10h ago
This seems to be a categorical error, for reasons that are contained in the article itself. The whole appeal of QUIC is being immune to ossification, being free to change parameters of the protocol without having to beg Linux maintainers to agree.
toast0 · 10h ago
IMHO, you likely want the server side to be in the kernel, so you can get to performance similar to in-kernel TCP, and ossification is less of a big deal, because it's "easy" to modify the kernel on the server side.

OTOH, you want to be in user land on the client, because modifying the kernel on clients is hard. If you were Google, maybe you could work towards a model where Android clients could get their in-kernel protocol handling to be something that could be updated regularly, but that doesn't seem to be something Google is willing or able to do; Apple and Microsoft can get priority kernel updates out to most of their users quickly; Apple also can influence networks to support things they want their clients to use (IPv6, MP-TCP). </rant>

If you were happy with congestion control on both sides of TCP, and were willing to open multiple TCP connections like http/1, instead of multiplexing requests on a single connection like http/2, (and maybe transfer a non-pessimistic bandwidth estimate between TCP connections to the same peer), QUIC still gives you control over retransmission that TCP doesn't, but I don't think that would be compelling enough by itself.

Yes, there's still ossification in middle boxes doing TCP optimization. My information may be old, but I was under the impression that nobody does that in IPv6, so the push for v6 is both a way to avoid NAT and especially CGNAT, but also a way to avoid optimizer boxes as a benefit for both network providers (less expense) and services (less frustration).

ComputerGuru · 8h ago
One thing is that congestion control choice is sort of cursed in that it assumes your box/side is being switched but the majority of the rest of the internet continues with legacy limitations (aside from DCTCP, which is designed for intra-datacenter usage), which is an essential part of the question given that resultant/emergent network behavior changes drastically depending on whether or not all sides are using the same algorithm. (Cubic is technically another sort-of-exception, at least since it became the default Linux CC algorithm, but even then you’re still dealing with all sorts of middleware with legacy and/or pathological stateful behavior you can’t control.)
jeffbee · 10h ago
This is a perspective, but just one of many. The overwhelming majority of IP flows are within data centers, not over planet-scale networks between unrelated parties.
toast0 · 10h ago
I've never been convinced by an explanation of how QUIC applies for flows in the data center.

Ossification doesn't apply (or it shouldn't, IMHO, the point of Open Source software is that you can change it to fit your needs... if you don't like what upstream is doing, you should be running a local fork that does what you want... yeah, it's nicer if it's upstreamed, but try running a local fork of Windows or MacOS); you can make congestion control work for you when you control both sides; enterprise switches and routers aren't messing with tcp flows. If you're pushing enough traffic that this is an issue, the cost of QUIC seems way too high to justify, even if it helps with some issues.

jeffbee · 8h ago
I don't see why this exception to the end-to-end principle should exist. At the scale of single hosts today, with hundreds of CPUs and hundreds of tenants in a single system sharing a kernel, the kernel itself becomes an unwanted middlebox.
jeroenhd · 9h ago
Unless you're using QUIC as some kind of datacenter-to-datacenter protocol (basically as SCTP on steroids with TLS), I don't think QUIC in the datacenter makes much sense at all.

As very few server administrators bother turning on features like MPTCP, QUIC has an advantage on mobile phones with moderate to bad reception. That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead. Any service those people are using should probably consider implementing QUIC, and if they use it, they'd benefit from an in-kernel server.

All the data center operators can stick to (MP)TCP, the telco people can stick to SCTP, but the consumer facing side of the internet would do well to keep QUIC as an option.

mschuster91 · 5h ago
> That's not a huge issue for me most of the time, but billions of people are using mobile phones as their only access to the internet, especially in developing countries that are practically skipping widespread copper and fiber infrastructure and moving directly to 5G instead.

For what it's worth: Romania, one of the piss poorest countries of Europe, has a perfectly fine mobile phone network, and even outback small villages have XGPON fiber rollouts everywhere. Germany? As soon as you cross into the country from Austria, your phone signal instantly drops, barely any decent coverage outside of the cities. And forget about PON, much less GPON or even XGPON.

Germany should be considered a developing country when it comes to expectations around telecommunication.

corbet · 10h ago
Ossification does not come about from the decisions of "Linux maintainers". You need to look at the people who design, sell, and deploy middleboxes for that.
jeffbee · 10h ago
I disagree. There is plenty of ossification coming from inside the house. Just some examples off the top of my head are the stuck-in-1974 minimum RTO and ack delay time parameters, and the unwillingness to land microsecond timestamps.
otterley · 10h ago
Not a networking expert, but does TCP in IPv6 suffer the same maladies?
pumplekin · 10h ago
Yes.

Layer4 TCP is pretty much just slapped on top of Layer3 IPv4 or IPv6 in exactly the same way for both of them.

Outside of some little nitpicky things like details on how TCP MSS clamping works, it is basically the same.

ComputerGuru · 9h ago
…which is basically how it’s supposed to work (or how we teach that it’s supposed to work). (Not that you said anything to the contrary!)
0xbadcafebee · 7h ago
The "middleboxes" excuse for not improving (or replacing) protocols in the past was horseshit. If a big incumbent player in the networking world releases a new feature that everyone wants (but nobody else has), everyone else (including 'middlebox' vendors) will bend over backwards to support it, because if you don't your competitors will and then you lose business. It was never a technical or logistical issue, it was an economic and supply-demand issue.

To prove it:

1. Add a new OSI Layer 4 protocol called "QUIC" and give it a new protocol number, and just for fun, change the UDP frame header semantics so it can't be confused for UDP.

2. Then release kernel updates to support the new protocol.

Nobody's going to use it, right? Because internet routers, home wireless routers, servers, shared libraries, etc would all need their TCP/IP stacks updated to support the new protocol. If we can't ship it over a weekend, it takes too long!

But wait. What if ChatGPT/Claude/Gemini/etc only supported communication over that protocol? You know what would happen: every vendor in the world would backport firmware patches overnight, bending over backwards to support it. Because they can smell the money.

GuB-42 · 5h ago
The protocol itself is resistant to ossification, no matter how it is implemented.

It is mostly achieved by using encryption, and it is a reason why it is such an important and mandatory part of the protocol. The idea is to expose as little as possible of the protocol between the endpoints, the rest is encrypted, so that "middleboxes" can't look at the packet and do funny things based on their own interpretation of the protocol stack.

Endpoint can still do whatever they want, and ossification can still happen, but it helps against ossification at the infrastructure level, which is the worst. Updating the linux kernel on your server is easier than changing the proprietary hardware that makes up the network backbone.

The use of UDP instead of doing straight QUIC/IP is also an anti-ossification technique, as your app can just use UDP and a userland library regardless of the QUIC kernel implementation. In theory you could do that with raw sockets too, but that's much more problematic since because you don't have ports, you need the entire interface for yourself, and often root access.

Karrot_Kream · 7h ago
Do you think putting QUIC in the kernel will significantly ossify QUIC? If so, how do you want to deal with the performance penalty for the actual syscalls needed? Your concern makes sense to me as the Linux kernel moves slower than userspace software and middleboxes sometimes never update their kernels.
wosined · 9h ago
The general web is slowed down by bloated websites. But I guess this can make game latency lower.
fmbb · 9h ago
https://en.m.wikipedia.org/wiki/Jevons_paradox

The Jevons Paradox is applicable in a lot of contexts.

More efficient use of compute and communications resources will lead to higher demand.

In games this is fine. We want more, prettier, smoother, pixels.

In scientific computing this is fine. We need to know those simulation results.

On the web this is not great. We don’t want more ads, tracking, JavaScript.

01HNNWZ0MV43FF · 9h ago
No, the last 20 years of browser improvements has made my static site incredibly fast!

I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC, etc, etc

valorzard · 10h ago
Would this (eventually) include the unreliable datagram extension?
wosined · 9h ago
Don't know if it could get faster than UDP if it is on top of it.
valorzard · 9h ago
The use case for this would be running a multiplayer game server over QUIC
01HNNWZ0MV43FF · 9h ago
Other use cases include video / audio streaming, VPNs over QUIC, and QUIC-over-QUIC (you never know)
snvzz · 3h ago
Brace for unauthenticated remote execution exploits on network stack.
darksaints · 10h ago
For the love of god, can we please move to microkernel-based operating systems already? We're adding a million lines of code to the linux kernel every year. That's so much attack surface area. We're setting ourselves up for a kessler syndrome of sorts with every system that we add to the kernel.
mdavid626 · 10h ago
Most of that code is not loaded into the kernel, only when needed.
darksaints · 10h ago
True, but the last time I checked (several years ago), the size of the portion of code that is not drivers or kernel modules was still 7 million lines of code, and the average system still has to load a few million more via kernel modules and drivers. That is still a phenomenally large attack surface.

The SeL4 kernel is 10k lines of code. OKL4 is 13k. QNX is ~30k.

arp242 · 9h ago
Can I run Firefox or PostgreSQL with reasonable performance on SeL4, OKL4, or QNX?
doubled112 · 8h ago
Reasonable performance includes GPU acceleration for both rendering and decoding media, right?
0x457 · 6h ago
yes
regularfry · 9h ago
You've still got combinatorial complexity problem though, because you never know what a specific user is going to load.
beeflet · 8h ago
Often you do know what a specific user is going to load
wosined · 9h ago
I might be wrong, but microkernel also need drivers, so the attack surface would be the same, or not?
kaoD · 9h ago
You're not wrong, but monolithic kernel drivers run at a privilege level that's even higher than root (ring 0) while microkernels run them at userspace so they're as dangerous as running a normal program.
pessimizer · 8h ago
"Just think of the power of ring-0, muhahaha! Think of the speed and simplicity of ring-0-only and identity-mapping. It can change tasks in half a microsecond because it doesn't mess with page tables or privilege levels. Inter-process communication is effortless because every task can access every other task's memory.

"It's fun having access to everything."

— Terry A. Davis

beeflet · 6h ago
> Inter-process communication is effortless because every task can access every other task's memory.

I think this would get messy quick in an OS designed by more than one person

01HNNWZ0MV43FF · 9h ago
Redox is a microkernel written in Rust
gethly · 6h ago
I've been hearing about QUIC for ages, yet it is still an obscure tech and will likely end up like IPv6.
rstuart4133 · 6h ago
> yet it is still an obscure tech and will likely end up like IPv6.

Probably. According to Google, IPv6 has a measly 46% of internet traffic now [0], and growing at about 5% per year. QUIC is 40% of Chrome traffic, and is growing at 5% every two years [1]. So yeah, their fates do look similar, which is to say both are headed for world domination in a couple of decades.

[0] https://dnsmadeeasy.com/resources/the-state-of-ipv6-adoption...

[1] https://www.cellstream.com/2025/02/14/an-update-on-quic-adop...

gethly · 5h ago
When you remove IoT, those numbers will look very differently.
rstuart4133 · 4h ago
> When you remove IoT, those numbers will look very differently.

To paraphrase: "when you remove all the new stuff being added, you will see all the old stuff is still using the old protocols". Sounds reasonable, but I don't believe it. These IoT devices usually have the simplest stack imaginable, of many of them implemented from the main loop. IPv6 isn't so bad, but QUIC/http2/http3 is a long, long way from simple.

A major driver of IPv6 is phones, which I wound not classify as IoT. Where I live they all receive an IPv6 address now. When I hotspot, they hand out a routable IPv6 address to the laptop / desktop. Modern Windows / Linux installations will use the IPv6 in preference to the double NAT'ed IPv4 address they also hand out. The funny thing is you don't even notice, or at least I didn't. I only twigged when I happened to be looking at packet capture from my tethered laptop and saw all this IPv6 traffic, and wondered what the heck was going on. It could have been happening for years without me noticing. Maybe it was.

It wasn't I surprise I didn't notice. I set up WiFi access for a conference of 100's of computing nerds and professionals many years ago. Partly for kicks, partly as a learning excise I made it IPv6 only. As a backup plan I had a IPv4 network (behind a NAT sadly, which the IPv6 wasn't) ready to go on a different SSID. To my utter disbelief there no complaints, literally not a single one. Again, no one noticed.

adgjlsfhk1 · 3h ago
QUIC is really simple for most IOT: Just import the library.
Jtsummers · 6h ago
QUIC is already out and in active use. Every major web browser supports it, and it's not like IPv6. There's no fundamental infrastructure change needed to support it since it's built on top of UDP. The end points obviously have to support it, but that's the same as any other protocol built on UDP or TCP (like HTTP, SNMP, etc.).
tralarpa · 6h ago
Your browser is using it when you watch a video on youtube (HTTP/3).
gfody · 6h ago
isn't it just http3 now?