How to modify Starlink Mini to run without the built-in WiFi router (olegkutkov.me)

What do HN people have to say about Unicode and UTF-{8,16,32}? Are there parts you've never really understood? Have you had unexpected bugs due to misunderstood properties of text?

Comments (7)

Rendello · 23h ago

I (OP) have been working on some Unicode visualization tooling for a while now. The idea started when I had some buggy string-matching code. I was matching case-insensitively, then using those ranges to highlight the original text.

Turns out, sometimes changing case changes not only the number of bytes (in UTF-8), but the number of encoded characters! This led to my post "UTF-8 characters that behave oddly when the case is changed" [1], which inspired a lot of conversation that taught me a lot. After that, I started reading Unicode documentation in earnest, and building up an idea of what a new tool should show. I'm trying to make clear things I didn't (and sometimes still don't) understand, so I'd love to know what causes pains in the wild / gaps in people's understanding.

1. https://news.ycombinator.com/item?id=42014045

solardev · 13h ago

I don't understand the difference between a character, a codepoint, a glyph, and whatever else makes up a single "thing" in unicode.

Rendello · 10h ago

That tripped me up too. The Unicode Core spec is quite good at explaining things and introduces some terminology you don't really hear outside the document. Chapter 2, General Structure, is worth reading in its entirety. I've linked some bits that might help:

> *2.2.3 Characters, Not Glyphs*

> The Unicode Standard draws a distinction between characters and glyphs. Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. [...] Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters.

> Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data transmission. The Unicode Standard deals only with character codes.

> *2.4 Code Points and Characters*

> The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

> *2.5 Encoding Forms*

This deals with UTF-{8,16,32}, which is a tricky bit and tripped me up for a long time. If the document is too dense here, there's a lot of supplementary material online explaining the different forms, I'll link a Tom Scott video explaining UTF-8.

---

The long and short of it is: the atomic unit of Unicode is the character, or encoded character, which is a value that has been associated with a code point, which is an integer usually represented in hex for as U+XXXX. Unicode doesn't deal with glyphs or graphical representations, just characters and their properties (eg. what is the character name? what should this character do when uppercased?). As you probably know, many characters can combine with others to form grapheme clusters, which may look like a single (abstract) character, but underneath consist of multiple (encoded) characters. Every character is associated with an integer index (a codepoint), and those integers can be represented in three formats (this sort of happened by accident): UTF-32 (just represent the integer directly), UTF-16 (was originally supposed to represent the integer directly, but there were too many and it got extended), and UTF-8 (which has different byte lengths to encode different characters efficiently).

[spec] https://www.unicode.org/versions/Unicode16.0.0/core-spec/

[2.2.3] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.4] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.5] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[Tom Scott UTF-8] https://www.youtube.com/watch?v=MijmeoH9LT4

NoahZuniga · 10h ago

I guess its kind of annoying that letters with diacritics can be represented in multiple different ways

Rendello · 9h ago

That's true, and even with normalization, there's four normalized forms for strings. The -k- forms are mostly for searching, but that still leaves NFC and NFD.

The normalization forms are explained, in order of approachability (imo), in this random Youtube video, the Unicode Annex #15, and the Unicode Core Spec:

https://www.youtube.com/watch?v=ttLD4DiMpiQ

https://unicode.org/reports/tr15/

https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

bjourne · 6h ago

Comparing strings by bytecode equality is kinda dubious anyway.

0xCE0 · 5h ago

The original intent of Unicode was great: a standard that creates a mapping between a unique number==codepoint and specific character of language (and here character means only abstract non-visual symbol==meaning, not visually rendered glyph with stylistic font of any kind). The updates for Unicode versions added more languages, even dead ones. So basically it was a historical knowledge effort also.

Then came emojis, and now the Unicode Consortium's efforts for Unicode version updates seems to be about adding more different kinds of poop emojis and shades of skin colors. Well, maybe it projects accurately the language and culture of this modern time.

UTF-8 is great because it is a superset of ASCII, but because its byte-width varies, it has more complexity for decoding/encoding it (similar to constant/variable width ISA's in CPUs).

Different languages have different concepts, e.g. text direction==flow (left/right, up/down, characters/logograms, different kind of visual cues etc.). Humans create problems when they want to combine different languages at the same time. E.g. mathematical notation is in my opinion 2D graphics, and it cannot be (usually/always) inlined with text glyphs (to be aesthetically pleasing). Same kind of problems may come when trying to inline e.g. languages with different flow directions. Its like trying to combine native GUI widgets in Win32 and Cocoa/SwiftUI and GTK/Qt/WXwidgets - the (visual) languages doesn't have the same concepts or they are conflicting.

Datalog in Rust (github.com)

How to modify Starlink Mini to run without the built-in WiFi router (olegkutkov.me)

Show HN: Meow – An Image File Format I made because PNGs and JPEGs suck for AI (github.com)

Ruby on Rails Audit Complete (ostif.org)

1k year old 3 sisters crop farm found in Northern Michigan (smithsonianmag.com)

The Art of Lisp and Writing (dreamsongs.com)

An origin trial for a new HTML <permission> element (developer.chrome.com)

Q-learning is not yet scalable (seohong.me)

Tiny-diffusion: A minimal implementation of probabilistic diffusion models (github.com)

Infinite Grid of Resistors (mathpages.com)

CI/CD Observability with OpenTelemetry Step by Step Guide (signoz.io)

I have reimplemented Stable Diffusion 3.5 from scratch in pure PyTorch (github.com)

Text-to-LoRA: Hypernetwork that generates task-specific LLM adapters (LoRAs) (github.com)

Notes on the History of the Map Tile (placing.technology)

Waymo rides cost more than Uber or Lyft and people are paying anyway (techcrunch.com)

Breaking My Security Assignments (akpain.net)

Meta-analysis of three different notions of software complexity (typesanitizer.com)

AMD's AI Future Is Rack Scale 'Helios' (morethanmoore.substack.com)

The Algebra of an Infinite Grid of Resistors (mathpages.com)

Solar Orbiter gets world-first views of the Sun's poles (esa.int)

Chicken Eyeglasses (en.wikipedia.org)

The Talented Ms. Highsmith (yalereview.org)

Inside the Apollo “8-Ball” FDAI (Flight Director / Attitude Indicator) (righto.com)

Bioprospectors mine microbial genomes for antibiotic gold (cen.acs.org)

Large language models often know when they are being evaluated (arxiv.org)

How multiplication is defined in Peano arithmetic (devlinsangle.blogspot.com)

Last fifty years of integer linear programming: Recent practical advances (inria.hal.science)

Debunking HDR [video] (yedlin.net)

Cray versus Raspberry Pi (aardvark.co.nz)

Have a damaged painting? Restore it in just hours with an AI-generated “mask” (news.mit.edu)

Endometriosis is an interesting disease (owlposting.com)

Dance Captcha (dance-captcha.vercel.app)

SIMD-friendly algorithms for substring searching (2016) (0x80.pl)

The Many Sides of Erik Satie (thereader.mitpress.mit.edu)

Fixing the mechanics of my bullet chess (jacobbrazeal.wordpress.com)

TimeGuessr (timeguessr.com)

Unsupervised Elicitation of Language Models (arxiv.org)

How to Build Conscious Machines (osf.io)

Seven replies to the viral Apple reasoning paper and why they fall short (garymarcus.substack.com)

Solidroad (YC W25) Is Hiring (solidroad.com)

Slowing the flow of core-dump-related CVEs (lwn.net)

"Make in India" Relies on "Made in China" (hinrichfoundation.com)

Student discovers fungus predicted by Albert Hoffman (wvutoday.wvu.edu)

We investigated Amsterdam's attempt to build a 'fair' fraud detection model (lighthousereports.com)

Self-Adapting Language Models (arxiv.org)

Clinical knowledge in LLMs does not translate to human interactions (arxiv.org)

Root Cause of the June 12, 2025 Google Cloud Outage (twitter.com)

Implementing Logic Programming (btmc.substack.com)

How the Final Cartridge III Freezer Works (pagetable.com)

If the moon were only 1 pixel: A tediously accurate solar system model (2014) (joshworth.com)

Ask HN: What are your Unicode woes?

Comments (7)