It's Not Wrong that " ".length == 7

86 program 104 8/22/2025, 6:18:56 AM hsivonen.fi ↗

Comments (104)

DavidPiper · 2h ago
I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

arcticbull · 1h ago
Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

account42 · 1h ago
> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

josephg · 1h ago
It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

account42 · 31m ago
> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.

You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.

This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

tomsmeding · 5m ago
baq · 2h ago
ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

flohofwoe · 1m ago
ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.

Just never ever use Extended ASCII (8-bits with codepages).

account42 · 1h ago
> in the global international connected computing world it doesn’t fit at all.

I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

simonask · 47m ago
This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

jibal · 12m ago
It's neither American nor imperialism -- those are both category mistakes.

Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.

account42 · 35m ago
Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.
simonask · 17m ago
Well, the problem is that what you are advocating is also that knowing Latin would be a prerequisite for studying medicine, which it isn't anywhere. That's the equivalent. Doctors learn a (very limited) Latin vocabulary as they study and work.

You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age.

Imagine learning programming using only your high school Spanish. Good luck.

eru · 2h ago
Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.
ynik · 1h ago
Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).
arcticbull · 1h ago
Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.

It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

It would be pretty silly for them to explode all strings to 4-byte characters.

jibal · 54m ago
You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.
account42 · 1h ago
It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.
jibal · 51m ago
Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.
xigoi · 1h ago
I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.
afiori · 1h ago
I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.
account42 · 1h ago
But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.
afiori · 14m ago
The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.

My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.

I think that the current status quo is better than what came before, but I also think it could be improved.

bawolff · 56m ago
Me too.

The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.

Non normalized unicode is just as problematic as non validated unicode imo.

jibal · 46m ago
Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.
account42 · 1h ago
Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.
xelxebar · 1h ago
> Number of monospaced font character blocks this string will take up on the screen

Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

Semaphor · 1h ago
FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.
xg15 · 2h ago
It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull · 1h ago
Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.

If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

jibal · 42m ago
"Unicode, being a byte code format"

UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

setr · 1h ago
I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation

I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly

arcticbull · 1h ago
Yep for a substring against its parent or other substrings of the same parent that’s definitely true, but I think this question generalizes because the case where you’re comparing strings solely within themselves is an optimization path for the more general. I’m just thinking out loud.
account42 · 55m ago
> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

mseepgood · 1h ago
The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.
xg15 · 1h ago
Indeed. Or s.length, whatever that represents.
guappa · 2h ago
What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?
xigoi · 1h ago
In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.
guappa · 1h ago
No no, I want to create tomorrow's puzzle.
taneq · 2h ago
If you're playing at this level, you need to define:

- letter

- word

- 5 :P

guappa · 1h ago
Eh in macedonian they have some letters that in russian are just 2 separate letters
CorrectHorseBat · 1h ago
In German you have the same, only within one language. ß can be written as ss if it isn't available in a font, and only in 2017 they added a capital version. So depending the font and the unicode version the number of letters can differ.
taneq · 38m ago
Niße. ;)
bluecalm · 55m ago
What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?
account42 · 46m ago
With UTF-8 you can implement them on top of bytes.
zwnow · 1h ago
I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.
jibal · 39m ago
The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.
dwb · 54m ago
The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.
thrdbndndn · 1h ago
I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

account42 · 51m ago
It's not rare at all - multi-code point emojis are pretty standard these days.

And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.

sigmoid10 · 2h ago
I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.
bstsb · 3h ago
ironic that unicode is stripped out the post's title here, making it very much wrong ;)

for context, the actual post features an emoji with multiple unicode codepoints in between the quotes

cmeacham98 · 2h ago
Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.
ale42 · 2h ago
Maybe it isn't a space, but a list of invisible Unicode chars...
yread · 2h ago
It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3
robin_reala · 1h ago
It’s U+0020, a standard space character.
c12 · 2h ago
I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.
eastbound · 2h ago
It can be many Zero-Width Space, or a few Hair-Width Space.

You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).

Next up: The <half-br/> tag.

Moru · 1h ago
You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.
timeon · 1h ago
Unintentional click-bait.
osener · 46m ago
Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.

Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...

mid-kid · 7m ago
Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
jfoster · 13m ago
I run one of the many online word counting tools (WordCounts.com) which also does character counts. I have noticed that even Google Docs doesn't seem to use grapheme counts and will produce larger than expected counts for strings of emoji.

If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.

xg15 · 1h ago
The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.

Therefore, people should use codepoints for things like length limits or database indexes.

But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?

If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?

Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

re · 1h ago
What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)

> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

xg15 · 1h ago
I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."

You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

chrismorgan · 57m ago
> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.

kazinator · 2h ago
Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?

TXR Lisp:

  1> (len " ")
  5
  2> (coded-length " ")
  17
(Trust me when I say that the emoji was there when I edited the comment.)

The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.

chrismorgan · 1h ago
Previous discussions:

https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)

https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)

https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)

I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)

tralarpa · 1h ago
Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.
mrheosuper · 3h ago
>We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)

We would not have this problem if we all agree to return number of bytes instead.

Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.

jibal · 3m ago
> if we all decided to report number of bytes that string used instead number of printable characters

But that isn't the same across all languages, or even across all implementations of the same language.

curtisf · 2h ago
"number of bytes" is dependent on the text encoding.

UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won

minebreaker · 2h ago
> We would not have this problem if we all agree to return number of bytes instead.

I don't understand. It depends on the encoding isn't it?

com2kid · 2h ago
How would that help? UTF-8, 16, and 32 languages would still report different numbers.
charcircuit · 2h ago
>Number of extended grapheme clusters (1 in this case)

Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.

baq · 1h ago
when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.
account42 · 13m ago
You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.
baq · 12m ago
you are right, but this just drives the point.
Ultimatt · 1h ago
Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.

    $ raku
    Welcome to Rakudo™ v2025.06.
    Implementing the Raku® Programming Language v6.d.
    Built on MoarVM version 2025.06.

    [0] > " ".chars
    1
    [1] > " ".codes
    5
    [2] > " ".encode('UTF-8').bytes
    17
    [3] > " ".NFD.map(*.chr.uniname)
    (FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)
pwdisswordfishz · 1h ago
Call me naive, but I think the length of a space character ought to be one.
jibal · 1m ago
Read the article ... the character between the quote marks isn't a space, but HN apparently doesn't support emoji, or at least not that one.
Aissen · 2h ago
I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.
umajho · 2h ago
If you want to get the grapheme length in JavaScript, JavaScript now has Intl.Segmenter[^1][^2].

  > [...(new Intl.Segmenter()).segment(THAT_FACEPALM_EMOJI)].length
  1
[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...

impure · 2h ago
I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.
darkwater · 2h ago
(2019) updated in (2022)
troupo · 2h ago
Obligatory, Emoji under the hood https://tonsky.me/blog/emoji/
Sniffnoy · 2h ago
Another little thing: The post mentions that tag sequences are only used for the flags of England, Scotland, and Wales. Those are the only ones that are standard (RGI), but because it's clear how the mechanism would work for other subnational entities, some systems support other ones, such as US state flags! I don't recommend using these if you want other people to be able to see them, but...
spyrja · 2h ago
I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?
danhau · 1h ago
Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

xg15 · 1h ago
I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

spyrja · 1h ago
Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.
simonask · 29m ago
That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.

Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

It even includes an optimized fast path for ASCII, and it works at compile-time as well.

guappa · 2h ago
Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.
kalleboo · 1h ago
I think what you meant is we should all go back to the simplicity of Shift-JIS
eru · 2h ago
You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

degamad · 1h ago
It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.
bawolff · 49m ago
That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

amake · 1h ago
That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.
spyrja · 1h ago
True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)
eru · 30m ago
Ekaros · 2h ago
Should have just gone with 32 bit characters and no combinations. Utter simplicity.
bawolff · 46m ago
I think combining characters are a lot simpler than having every single combination ever.

Especially when you start getting into non latin-based languages.

amake · 1h ago
What does "no combinations" mean?