UTF-8 is a brilliant design

93 vishnuharidas 40 9/12/2025, 6:30:15 PM iamvishnu.com ↗

Comments (40)

quectophoton · 40m ago
Having the continuation bytes always start with the bits `10` also make it possible to seek to any random byte, and trivially know if you're at the beginning of a character or at a continuation byte like you mentioned, so you can easily find the beginning of the next or previous character.

If the characters were instead encoded like EBML's variable size integers[1] (but inverting 1 and 0 to keep ASCII compatibility for the single-byte case), and you do a random seek, it wouldn't be as easy (or maybe not even possible) to know if you landed on the beginning of a character or in one of the `xxxx xxxx` bytes.

[1]: https://www.rfc-editor.org/rfc/rfc8794#section-4.4

Animats · 12m ago
Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.

Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.

I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.

deepsun · 4m ago
That's assuming the text is not corrupted or maliciously modified. There were (are) _numerous_ vulnerabilities due to parsing/escaping of invalid UTF-8 sequences.

Quick googling (not all of them are on-topic tho):

https://www.rapid7.com/blog/post/2025/02/13/cve-2025-1094-po...

https://www.cve.org/CVERecord/SearchResults?query=utf-8

thesz · 1m ago
> so you can easily find the beginning of the next or previous character.

It is not true [1]. While it is not UTF-8 problem per se, it is a problem of how UTF-8 is being used.

[1] https://paulbutler.org/2025/smuggling-arbitrary-data-through...

PaulHoule · 15m ago
It's not uncommon when you want variable length encodings to write the number of extension bytes used in unary encoding

https://en.wikipedia.org/wiki/Unary_numeral_system

and also use whatever bits are left over encoding the length (which could be in 8 bit blocks so you write 1111/1111 10xx/xxxx to code 8 extension bytes) to encode the number. This is covered in this CS classic

https://archive.org/details/managinggigabyte0000witt

together with other methods that let you compress a text + a full text index for the text into less room than text and not even have to use a stopword list. As you say, UTF-8 does something similar in spirit but ASCII compatible and capable of fast synchronization if data is corrupted or truncated.

1oooqooq · 19m ago
so you replace one costly sweeping with a costly sweeping. i wouldn't call that an advantage in any way over junping n bytes.

what you describe is the bare minimum so you even know what you are searching for while you scan pretty much everything every time.

hk__2 · 6m ago
What do you mean? What would you suggest instead? Fixed-length encoding? It would take a looot of space given all the character variations you can have.
twoodfin · 9m ago
UTF-8 is indeed a genius design. But of course it’s crucially dependent on the choice of ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.

Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.

mort96 · 3m ago
I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets as 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.

In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...

[1] https://en.wikipedia.org/wiki/Quoted-printable

[2] https://en.wikipedia.org/wiki/8-bit_clean

colejohnson66 · 5m ago
The idea was that the free bit would be repurposed, likely for parity.
hyperman1 · 30m ago
One thing I always wonder: It is possible to encode a unicode codepoint with too much bytes. UTF-8 forbids these, only the shortest one is valid. E.g 00000001 is the same as 11000000 10000001.

So why not make the alternatives impossible by adding the start of the last valid option? So 11000000 00000001 would give codepoint 128+1 as values 0 to 127 are already covered by a 1 byte sequence.

The advantages are clear: No illegal codes, and a slightly shorter string for edge cases. I presume the designers thought about this, so what were the disadvantages? The required addition being an unacceptable hardware cost at the time?

nostrademons · 18m ago
I assume you mean "11000000 10000001" to preserve the property that all continuation bytes start with "10"? [Edit: looks like you edited that in]. Without that property, UTF-8 loses self-synchronicity, the property that given a truncated UTF-8 stream, you can always find the codepoint boundaries, and will lose at most codepoint worth rather than having the whole stream be garbled.

In theory you could do it that way, but it comes at the cost of decoder performance. With UTF-8, you can reassemble a codepoint from a stream using only fast bitwise operations (&, |, and <<). If you declared that you had to subtract the legal codepoints represented by shorter sequences, you'd have to introduce additional arithmetic operations in encoding and decoding.

rhet0rica · 22m ago
See quectophoton's comment—the requirement that continuation bytes are always tagged with a leading 10 is useful if a parser is jumping in at a random offset—or, more commonly, if the text stream gets fragmented. This was actually a major concern when UTF-8 was devised in the early 90s, as transmission was much less reliable than it is today.
umanwizard · 21m ago
Because then it would be impossible to tell from looking at a byte whether it is the beginning of a character or not, which is a useful property of UTF-8.
rightbyte · 21m ago
I think that would garble random access?
zamalek · 1m ago
Even for varints (you could probably drop the intermediate prefixes for that). There are many examples of using SIMD to decode utf-8, where-as the more common protobuf scheme is known to be hostile to SIMD and the branch predictor.
happytoexplain · 48m ago
I have a love-hate relationship with backwards compatibility. I hate the mess - I love when an entity in a position of power is willing to break things in the name of advancement. But I also love the cleverness - UTF-8, UTF-16, EAN, etc. To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.
amluto · 29m ago
> To be fair, UTF-8 sacrifices almost nothing to achieve backwards compat though.

It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z

I hope we don’t regret this limitation some day. I’m not aware of any other material reason to disallow larger UTF-8 code units.

throw0101d · 16m ago
> It sacrifices the ability to encode more than 21 bits, which I believe was done for compatibility with UTF-16: UTF-16’s awful “surrogate” mechanism can only express code units up to 2^21-1z

Yes, it is 'truncated' to the "UTF-16 accessible range":

* https://datatracker.ietf.org/doc/html/rfc3629#section-3

* https://en.wikipedia.org/wiki/UTF-8#History

Thompson's original design could handle up to six octets for each letter/symbol, with 31 bits of space:

* https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

1oooqooq · 17m ago
the limitation tomorrow will be today's implementations, sadly.
mort96 · 37m ago
Yeah I honestly don't know what I would change. Maybe replace some of the control characters with more common characters to save a tiny bit of space, if we were to go completely wild and break Unicode backward compatibility too. As a generic multi byte character encoding format, it seems completely optimal even in isolation.
sheerun · 2m ago
I'll mention IPv6 as bad design that could have been potentially UTF-8-like success story
3pt14159 · 47m ago
I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.
linguae · 39m ago
I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

No comments yet

glxxyz · 46m ago
I worked on an email client. Many many character set headaches.
postalrat · 1m ago
Looks similar to midi
nostrademons · 13m ago
For more on UTF-8's design, see Russ Cox's one-pager on it:

https://research.swtch.com/utf8

And Rob Pike's description of the history of how it was designed:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

rmccue · 27m ago
Love the UTF-8 playground that's linked: https://utf8-playground.netlify.app/

Would be great if it was possible to enter codepoints directly; you can do it via the URL (`/F8FF` eg), but not in the UI. (Edit, the future is now. https://github.com/vishnuharidas/utf8-playground/pull/6)

twbarr · 39m ago
It should be noted that the final design for UTF-8 was sketched out on a placemat by Rob Pike and Ken Thompson.
alberth · 26m ago
I’ve re-read so many times Joel’s article on Unicode. It’s also very helpful.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

librasteve · 15m ago
some insightful unicode regex examples...

https://dev.to/bbkr/utf-8-internal-design-5c8b

bruce511 · 44m ago
While the backward compatibility of utf-8 is nice, and makes adoption much easier, the backward compatibility does not come at any cost to the elegance of the encoding.

In other words, yes it's backward compatible, but utf-is also compact and elegant even without that.

nextaccountic · 39m ago
UTF-8 also enables this mindblowing design for small string optimization - if the string has 24 bytes or less it is stored inline, otherwise it is stored on the heap (with a pointer, a length, and a capacity - also 24 bytes)

https://github.com/ParkMyCar/compact_str

How cool is that

(Discussed here https://news.ycombinator.com/item?id=41339224)

cyberax · 38m ago
UTF-8 is simply genius. It entirely obviated the need for clunky 2-byte encodings (and all the associated nonsense about byte order marks).

The only problem with UTF-8 is that Windows and Java were developed without knowledge about UTF-8 and ended up with 16-bit characters.

Oh yes, and Python 3 should have known better when it went through the string-bytes split.

wrs · 29m ago
UTF-16 made lots of sense at the time because Unicode thought "65,536 characters will be enough for anybody" and it retains the 1:1 relationship between string elements and characters that everyone had assumed for decades. I.e., you can treat a string as an array of characters and just index into it with an O(1) operation.

As Unicode (quickly) evolved, it turned out not that only are there WAY more than 65,000 characters, there's not even a 1:1 relationship between code points and characters, or even a single defined transformation between glyphs and code points, or even a simple relationship between glyphs and what's on the screen. So even UTF-32 isn't enough to let you act like it's 1980 and str[3] is the 4th "character" of a string.

So now we have very complex string APIs that reflect the actual complexity of how human language works...though lots of people (mostly English-speaking) still act like str[3] is the 4th "character" of a string.

UTF-8 was designed with the knowledge that there's no point in pretending that string indexing will work. Windows, MacOS, Java, JavaScript, etc. just missed the boat by a few years and went the wrong way.

rowls66 · 3m ago
I think more effort should have been made to live with 65,536 characters. My understanding is that codepoints beyond 65,536 are only used for languages that are no longer in use, and emojis. I think that adding emojis to unicode is going to be seen a big mistake. We already have enough network bandwith to just send raster graphics for images in most cases. Cluttering the unicode codespace with emojis is pointless.
wongarsu · 18m ago
Yeah, Java and Windows NT3.1 had really bad timing. Both managed to include Unicode despite starting development before the Unicode 1.0 release, but both added unicode back when Unicode was 16 bit and the need for something like UTF-8 was less clear
burtekd · 26m ago
I'm just gonna leave this here too: https://www.youtube.com/watch?v=MijmeoH9LT4
quotemstr · 25m ago
Great example of a technology you get from a brilliant guy with a vision and that you'll never get out of a committee.