I'm not certain... On one hand I agree that some characters are problematic (or invalid) - like unpaired surrogates. But the worst case scenario is imo when people designing data structures and protocols start to feel the need to disallow arbitrary classes of characters, even properly escaped.
In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.
And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.
On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.
csande17 · 1h ago
Yeah, I feel like the only really defensible choices you can make for string representation in a low-level wire protocol in 2025 are:
- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"
- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were
I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.
Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.
dcrazy · 29m ago
Why didn’t you include “Unicode Scalars”, aka “well-formed UTF-8”, aka “the Swift string type?”
Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).
csande17 · 14m ago
IMO if you care about surrogate code points being invalid, you're in "designing the system around UTF-16" territory conceputally -- even if you then send the bytes over the wire as UTF-8, or some more exotic/compressed format. Same as how "potentially ill-formed UTF-16" and WTF-8 have the same underlying model for what a string is.
stuartjohnson12 · 1h ago
> "WTF-8", aka "the JavaScript string type"
This sequence of characters is a work of art.
CharlesW · 1h ago
> I like the idea, just don't buy the argumentation or examples in the blog post.
Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.
It seems like you think this standard is JSON-specific?
doug_durham · 22m ago
I thought the question was pretty substantive. What layer in the code stack should make the decisions about what characters to allow? I had exactly the same question. If the library declares that it will filter out certain subsets then that allows me to choose a different library if needed. I would hate to have this RFC blindly implemented randomly just because it's a standard.
TheRealPomax · 8m ago
I think you missed the part where the RFC is about which Unicode is bad for protocols and data formats, and so which Unicode you should avoid when designing those from now on, with an RFC to consult to know which ones those are. It has nothing to do with "what if I have a file with X" or "what if I want Y in usernames", it's about "what should I do if I want a normal, well behaved, unicode-text-based protocol or data format".
It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.
So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.
JimDabell · 2h ago
> PRECISion · You may find yourself wondering why the IETF waited until 2025 to provide help with Bad Unicode. It didn’t; here’s RFC 8264: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols; the first PRECIS predecessor was published in 2002. 8264 is 43 pages long, containing a very thorough discussion of many more potential Bad Unicode issues than 9839 does.
I’d also suggest people check out the accompanying RFCs 8265 and 8266:
PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols:
Generally speaking, you don’t want usernames being displayed that can change the text direction, or passwords that have different byte representations depending on the device that was used to type it in. These RFCs have specific profiles to avoid that.
I think for these kinds of purposes, failing closed is more secure than failing open. I’d rather disallow whatever the latest emoji to hit the streets is from usernames than potentially allow it to screw up every page that displays usernames.
Waterluvian · 2h ago
I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.
The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.
csande17 · 2h ago
Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".
weinzierl · 35m ago
I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.
I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.
It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.
estebank · 1h ago
The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]
I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.
The enforcement is an app-level issue, depending on the semantics of the field. I agree it doesn't belong in the low-level transport protocol.
The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.
arp242 · 49m ago
I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).
Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.
eviks · 1h ago
Indeed, though a lot of that complexity like surrogates and control codes aren't due to attempts to write language, that's just awful designs preserved for posterity
Etheryte · 2h ago
As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.
kps · 1h ago
Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.
layer8 · 1h ago
One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.
Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”
kps · 1h ago
Sorting is a good point.
On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.
layer8 · 1h ago
Keyboard input handling at that level generally isn’t character-based, and instead requires looking at scancodes and modifier keys, and sometimes also distinguishing between keyup and keydown events.
You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.
kps · 50m ago
Depends on the system. X11/Wayland do it at a higher level where you have `<dead_acute> <e> : eacute` and keysyms are effectively a superset of Unicode with prefix combiners. (This can lead to weirdness since the choice of Compose rules is orthogonal to the choice of keyboard layout.)
layer8 · 2m ago
I guess your conception is that one could then define
<dead_acute> : <combining_acute_accent>
instead and use it for arbitrary letters. However, that would fail in locales using a non-Unicode encoding such as iso-8859-1 that only contain the combined character. Unless you have the input system post-process the mapped input again to normalize it to e.g. NFC before passing it on to the application, in which case the combination has to be reparsed anyway. So I don’t see what would be gained with regard to ease of parsing.
dcrazy · 23m ago
Not all input methods use dead keys to emit combining characters.
yencabulator · 6m ago
I am torn on one decision: Whether to control inputs, or to wrap untrusted input in a datatype that displays it safely (web+log+debug).
ninkendo · 32m ago
It seems like most of these are handled by just rejecting invalid UTF-8 byte sequences (ideally, erroring out altogether) when interpreting a string as UTF-8. I mean, unpaired surrogates, or any surrogate for that matter, is already illegal as a UTF-8 byte sequence. Any competent language that uses UTF-8 for strings should already be returning errors when given such sequences.
The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.
doug_durham · 20m ago
That seems reasonable. It should be up to the application implementer to make that choice and not a lower level more general purpose library. I haven't run into any JSON parsers for usernames only code.
arp242 · 53m ago
Excluding all of "legacy controls" not just as literals but also escaped strings (e.g. "\u0027") seems too much. C1 is essentially unused AFAIK and that's okay, but a number of C0 characters do see real-world use (escape, EOF, NUL). IMHO there are valid and reasonable use cases to use some of them.
ks2048 · 1h ago
It's worth noting that Unicode already defines a "General Category" for all code points that categorizes some of these types of "weird" characters.
I think there should be a restriction in the standard on how many Unicode scalar values a graphical unit can have.
Last time I checked (a couple of years ago admittedly) there was no such restriction in the standard. There was however a recommendation to restrict a graphical unit to 128 bytes for "streaming applications".
Bringing this or at least a limit on the scalar units into the standard would make implementation and processing so much easier without restricting sensible applications.
what does bad/dangerous this code catch that `unicode.IsPrint` is not catching?
or other way, what good/useful does `unicode.IsPrint`removing, that this code keeps?
mort96 · 2m ago
I don't know all the details of `unicode.IsPrint` function, but one major issue is: it's Go-specific. If you're defining a protocol, you probably don't want the spec to include text such as, "the username field must only contain Unicode code points which are conidered printable by the Go programming language's 'unicode.IsPrint' function". You would rather want to write, "the username field must not contain Unicode code points which are considered problematic by RFC 9839".
o11c · 2h ago
I have had real-world programs broken by blind assumption of "does not deliberately contain controls" (form feed is particularly common for things intended to be paginated, escape is common for things designed for a terminal, etc.) and even "is fully UTF-8" (there are lots of old data files and logs that are never going away).
If you aren't doing something useful with the text, you're best off passing a byte-sequence through unchanged. Unfortunately, Microsoft Windows exists, so sometimes you have to pass `char16_t` sequences through instead.
The worst part about UTF-16 is that invalid UTF-16 is fundamentally different than invalid UTF-8. When translating between them (really: when transforming external data into an internal form for processing), the former can use WTF-8 whereas the latter can use Python-style surrogateescape, but you can't mix these.
develatio · 1h ago
I was not able to understand why these code points are bad. The post states that they are bad, but why? Any examples? Any actual situations and PoC that might help me understand how will that break "my code"?
orangeboats · 1h ago
Sometimes it's not just "your code". Strings are often interchanged and sent to many other parties.
And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.
Something like:
1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.
2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".
3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.
(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)
JimDabell · 1h ago
Suppose, when you were registering your username `develatio`, you decided to put U+202E RIGHT-TO-LEFT OVERRIDE in there as well. Now when somebody is reading this page and their browser gets to your username, it switches the text direction to render it right-to-left.
develatio · 1h ago
and "that's it"? I mean, it does sound like it might introduce unexpected UI behaviour, but are there any other more serious / dangerous consequences?
yencabulator · 9m ago
One of my pet peeves is when UIs don't clearly constrain and delineate the extent of user-controlled text. Plenty of phishing attacks have relied on having attacker-controlled input seem authoritative, e.g. getting gmail to repeat back something to the victim.
JimDabell · 53m ago
Making any page that mentions you – including admin pages that might be used to disable your account – become unreadable is bad enough.
Seems like libraries that serialize to JSON should have an option to filter out these bad characters.
layer8 · 1h ago
No. As the RFC notes: “Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, Section 3.2 of [UNICODE] recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).”
I would almost always go for “signaling an error”.
CharlesW · 1h ago
This RFC and Go-language reference library is designed to be used by existing libraries that do serialization/sanitation/validation. This is hot off the press, so I'm sure Tim would appreciate it if you'd let your favorite library know it exists.
My experience writing Unicode related libraries is that people don't use features when you have to explain why and when to use them. I assume that's why Tim puts the emphasis on "working on something new".
xdennis · 1h ago
How is Unicode in any way related to JSON? JSON should just encode whatever dumb data someone wants to transport.
Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.
layer8 · 1h ago
The contents of JSON strings doesn’t admit random binary data. You need to use an encoding like Base64 for that purpose.
recursive · 52m ago
JSON is text. If you're not going to use unicode in the representation of your text, you'll need some other way.
dcrazy · 18m ago
The current JSON spec mandates UTF-8, but practically speaking encoding is a higher-level concept. I suspect there are many server implementations that will respect the Content-Encoding header in a POST request containing JSON.
In the example, username validation is a job of another layer. For example I want to make sure username is shorter than 60 characters, has no emojis or zalgo text, and yes, no null bytes, and return a proper error from the API. I don't want my JSON parsing to fail on completely different layer pre-validation.
And for username some classes are obviously bad - like explained. But what if I send text files that actually use those weird tabs. I expect things that work in my language utf8 "string" type to be encodable. Even more importantly, I see plenty of use cases for null byte, and it is in fact often seen in JSON in the wild.
On the other hand, if we have to use a restricted set of "normal" Unicode characters, having a standard feels useful - better than everyone creating their own mini standard. So I think I like the idea, just don't buy the argumentation or examples in the blog post.
- "Unicode Scalars", aka "well-formed UTF-16", aka "the Python string type"
- "Potentially ill-formed UTF-16", aka "WTF-8", aka "the JavaScript string type"
- "Potentially ill-formed UTF-8", aka "an array of bytes", aka "the Go string type"
- Any of the above, plus "no U+0000", if you have to interface with a language/library that was designed before people knew what buffer overflow exploits were
I thought WTF-8 was just, "UTf-8, but without the restriction to not encode unpaired surrogates"? Windows and Java and JavaScript all use "possibly ill-formed UTF-16" as their string type, not WTF-8.
Can you elaborate more on this? I understood the Python string to be UTF-32, with optimizations where possible to reduce memory use.
Either way, I think the bitter lesson is a parser really can’t rely on the well-formedness of a Unicode string over the wire. Practically speaking, all wire formats are potentially ill-formed until parsed into a non-wire format (or rejected by same parser).
This sequence of characters is a work of art.
Which ones, and why? Tim and Paul collectively have around 100,000X the experience with this than most people do, so it'd be interesting to read substantive criticism.
It seems like you think this standard is JSON-specific?
It's not about JSON, or the web, those are just example vehicles for the discussion. The RFC is completely agnostic about what thing the protocols or data formats are intended for, as long as they're text based, and specifically unicode text based.
So it sounds like you like misread the blog post, and what you should be doing is now read the RFC. It's short. You can cruise through https://www.rfc-editor.org/rfc/rfc9839.html in a few minutes and see it's not actually about what you're focussing on.
I’d also suggest people check out the accompanying RFCs 8265 and 8266:
PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols:
— https://www.rfc-editor.org/rfc/rfc8264
Preparation, Enforcement, and Comparison of Internationalized Strings: Representing Usernames and Passwords
— https://www.rfc-editor.org/rfc/rfc8265
Preparation, Enforcement, and Comparison of Internationalized Strings Representing Nicknames:
— https://www.rfc-editor.org/rfc/rfc8266
Generally speaking, you don’t want usernames being displayed that can change the text direction, or passwords that have different byte representations depending on the device that was used to type it in. These RFCs have specific profiles to avoid that.
I think for these kinds of purposes, failing closed is more secure than failing open. I’d rather disallow whatever the latest emoji to hit the streets is from usernames than potentially allow it to screw up every page that displays usernames.
The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.
I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.
It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.
I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.
1: https://trojansource.codes/
The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.
Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.
Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”
On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.
You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.
The list of code points which are problematic (non-printing, etc) are IMO much more useful and nontrivial. But it’d be useful to treat those as a separate concept from plain-old illegal UTF-8 byte sequences.
https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
e.g. in Python,
Shows "Cc" (control) and "Cs" (surrogate).Last time I checked (a couple of years ago admittedly) there was no such restriction in the standard. There was however a recommendation to restrict a graphical unit to 128 bytes for "streaming applications".
Bringing this or at least a limit on the scalar units into the standard would make implementation and processing so much easier without restricting sensible applications.
what does bad/dangerous this code catch that `unicode.IsPrint` is not catching?
or other way, what good/useful does `unicode.IsPrint`removing, that this code keeps?
If you aren't doing something useful with the text, you're best off passing a byte-sequence through unchanged. Unfortunately, Microsoft Windows exists, so sometimes you have to pass `char16_t` sequences through instead.
The worst part about UTF-16 is that invalid UTF-16 is fundamentally different than invalid UTF-8. When translating between them (really: when transforming external data into an internal form for processing), the former can use WTF-8 whereas the latter can use Python-style surrogateescape, but you can't mix these.
And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.
Something like:
1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.
2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".
3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.
(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)
Another comment linked to this:
https://trojansource.codes
I would almost always go for “signaling an error”.
Unicode validation/cleanup should be done separately because it's needed in multiple places, not just JSON.