The Turkish İ Problem and Why You Should Care (2012)

112 Rygian 151 5/6/2025, 8:34:17 AM haacked.com ↗

Comments (151)

boomlinde · 134d ago

German has an 'ß' problem of a similar nature. There is a corresponding capital "ẞ" in Unicode, and Germany has officially adopted 'ẞ' as an alternative since, but in Unicode's SpecialCasing.txt the upper of 'ß' is still 'SS'. The lower of 'S' of course being 's', there's no going back after folding to upper cases. Lower of 'ẞ' is however still 'ß'.

So by alternating case you end up with ß→SS→ss or ẞ→ß→SS. Certainly has potential to screw with naive attempts at case-insensitive comparison via case folding. Then again, Unicode adopting 'ẞ' as the upper of 'ß' in some future version would probably only increase that potential further.

I'm interested to hear from people dealing with a lot of German text how much of a problem this is in practice.

virtualritz · 134d ago

There is an esthetic issue here too.

The ‘ß’ is a ligature of the old ‘long s’ [1] which was written ‘ſ’ (because it’s common in old texts there is a Unicode code point for it).

This letter has no upper case version. Capitalized words starting with a long ‘ſ’ always used ‘S’.

Now in German language, to make this lowercase long ‘ſ’ a sharp ‘s’, ‘ſ’ followed by ‘z’ was written: ‘ſz’.

And these two were often typeset as a ligature, ‘ß’, for esthetic reasons.

That ligature then became the common case and eventually a letter recognized in German-speaking countries.

As a hypothetic analogy, imagine a ‘ll’ ligature, as in ‘fallacy’, becoming an English letter – by some twist of history.

As we saw, these were lowercase letters. And there is no uppercase version of ‘ſ’.

So the uppercase ‘ẞ’ that is now official recognized and has a Unicode code point should not look like this.

It's an absolute eye saw because all that was done was somehow make the letter look a bit more like a capital.

But it's nature of being two lowercase letters, originally, still makes it stand out like an eye sore for people with a background in typography, like myself.

IMHO It should look like ‘SZ’ (or ‘SS’), made into a ligature.

And as a type designer, I'd either refrain from filling that code point in a font I design, to protest this, or do the above: create a ligature of ‘SZ’ or ‘SS’ (alternative) and put that there.

[1] https://en.m.wikipedia.org/wiki/Long_s

beeforpork · 134d ago

I disagree with roughtly all of this.

First of all, 'ß' was a ligature -- a long time ago. It is a letter today. Disassembling it according to its original construction makes no sense today for any kind of argument about typesetting or Unicode. Further, 'ſ' is not used today in German at all, except for meta discussions like this or to stress how things used to be spelled. It makes no sense to mention it unless you are talking about font design or historic use of German (and other languages, for that matter).

Also, if you do mention it for the sake of talking about font design, in Latin fonts, 'ſs' is actually the basis for the design of 'ß', not 'ſz' -- that was mainly done in Blackletter/Fraktur when the 'z' looked different, maybe a bit like 'ʒ' (I used Unicode's'ezh' here hoping it looks right) so that old style 'ß' looks like a ligature of 'ſʒ'. This can still be seen occasionally, e.g., on Berlin street name signs. It is obsolete for most fonts today (although I quite like it).

Moreover, there is an upper case letter for 'ß': 'ẞ'. And it has existed in fine typology way before being adopted into Unicode. Actually, it's existence was probably the reason why it is now in Unicode. The official German rules are now: either use 'SS' or 'ẞ' for uppercase 'ß'. Most Germans probably do not even know that 'ẞ' exists as a choice today, although it was used on 'DER GROẞE DUDEN' even before Unicode existed.

And finally, how a glyph is designed is not necessarily decided on whether historic parts of an ancient ligature had upper case variants. So that 'ſ' has no upper case equivalent is irrelevant for both Unicode and type design.

But as a font designer or anything else, you can protest. No problem. Everyone has the right to protest. But please don't spill the Internet with wrong information, as there is enough of it already.

And I don't think 'SS'<->'ß' is similar to the Turkish 'I with/without dot' problem, because the default Unicode mapping for 'ß' is correct in all languages, while the Turkish (and also Azerbaijani) problem is correct or broken depending on language setting. This is way more problematic because an assumed universal equivalence does not hold. And you need to carefully distinguish whether a string is language specific or not, e.g., path names or IDs in data bases, etc.

alexey-salmin · 134d ago

> And I don't think 'SS'<->'ß' is similar to the Turkish 'I with/without dot' problem, because the default Unicode mapping for 'ß' is correct in all languages, while the Turkish (and also Azerbaijani) problem is correct or broken depending on language setting.

I don't know if this counts as "correct" but it's still very confusing.

  >>> "ß".upper()
  'SS'
  >>> "ß".upper().lower()
  'ss'

  >>> "ẞ".lower()
  'ß'
  >>> "ẞ".lower().upper()
  'SS'
  >>> "ẞ".lower().upper().lower()
  'ss'

froh · 133d ago

"tja".

now Unicode philosophers have to ponder a breaking change vs introducing a new, duplicate ß code point LATIN SMALL LETTER SHARP S WITH CAPITAL SHARP S which as upper case has encoded the proper ẞ

two red buttons meme here...

beeforpork · 133d ago

Yes, it's definitely weird. But it is independent of locale, so any programmer has a change to notice this regardless of language setting, instead of their app failing only once it is used by someone from Turkey or Azerbaijan.

yorwba · 134d ago

Indeed, the original Unicode inclusion request justifies the need for an encoding for the character by referencing prior usage going back all the way to 1879: https://www.unicode.org/wg2/docs/n3227.pdf

It may be a typographical abomination, but it's an intentional representation of that particular typographical abomination, just as the ox head in "A" intentionally has its horns pointing down.

pimlottc · 133d ago

> It's an absolute eye saw

I assume you mean "eyesore"

r2_pilot · 133d ago

This is quite possibly a mondegreen.

froh · 133d ago

migraine?? lol TIL Lady Mondegreen. made may day. thank you.

https://en.wikipedia.org/wiki/Mondegreen

felineflock · 133d ago

Eye saw must be related to eye poke.

WesolyKubeczek · 133d ago

Eye saw is even gorier, I like it.

fweimer · 133d ago

U+07E6 probably should be rendered as exactly like “SS” (not a ligature, and as a double-width character in monospace fonts). Inventing a separate glyph for it seems a bit silly and only hinders adoption. Even if it's a ligature like §, that issue won't go away. And there are design choices that are even worse than § due to historic precedent.

boomlinde · 133d ago

> As a hypothetic analogy, imagine a ‘ll’ ligature, as in ‘fallacy’, becoming an English letter – by some twist of history.

You could instead consider 'W', a non-hypothetical case of just that. Would you similarly abandon current English orthography in your font design in defense of some historical trivia?

Rendello · 133d ago

This is discussed in the Unicode Core Spec documentation here (and the Turkish "i" above):

https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

JimDabell · 134d ago

Transliterating this character incorrectly resulted in a violent attack causing two deaths:

https://languagelog.ldc.upenn.edu/nll/?p=73

jeroenhd · 134d ago

Based on the murderous reaction from the entire family, I doubt the transliteration issue happening or not wouldn't have changed the outcome much. It's a weird consequence of a transliteration issue, but someone prepared to murder someone else over a rude text is a ticking time bomb regardless.

omeid2 · 134d ago

It might seem like an overreaction from a western point of view, but the accusations in the context of Central Asian culture is something so extremely sensitive that people from all walks of life, from nobility to the poor kill and die over it. It is just a different frame of mind.

batuhanicoz · 134d ago

This is an overreaction. It's violence. Trying to justify it by claiming it's part of their culture is not healthy, I think we can have some universal values (don't stab people?) and it is perfectly reasonable to force people to adopt to those values. It's their culture? They can leave the violent parts of culture behind and adopt to the expectations of modern society (not stabbing people).

I'm Turkish. I grew up in Turkey. These things happen, but let's not try to justify them. We should aim to get to a point where people share these "western values" (of not stabbing people).

dooglius · 133d ago

The point of the comment you're responding to is not to justify it, the point is to rebut GP's assertion that the violence would have occurred anyway.

Cthulhu_ · 133d ago

> it is perfectly reasonable to force people to adopt to those values.

No, it isn't (slippery slope fallacy I know), because that's what's happening in western politics as we speak and I for one am Not Okay with it.

4gotunameagain · 134d ago

I'm sorry but stabbing someone over a single text message is not cultural difference, is idiocy.

pjc50 · 134d ago

Honor culture makes people do weird and terrible things. The American cultural version would be the same thing but with a gun.

lblume · 133d ago

No matter how much I typically despise American culture, killing people (no matter by which means) over prostitution in an antecedent does not appear to be a part of it.

gowld · 133d ago

Please classify the idiotic and non-idiotic reasons for killing an former lover.

GoblinSlayer · 133d ago

At least you can defend yourself in that case. For comparison in USA you can go to jail for life if FBI drops a picture on your computer.

omeid2 · 134d ago

"A single text" is an absurd reductionism.

People suffer worst than death over words all the time, even in the West. Some folks adhere to honour, some to political groups and ideologies, some religion, some to their social views; there are words that are treated as violence and responded to accordingly in every context.

g-b-r · 133d ago

Accordingly would be with (violent) words.

Could you tell us where you're from, anyhow?

Cthulhu_ · 133d ago

It's like a Reddit relationships post, it's highly unlikely the single text message was the whole issue.

readthenotes1 · 133d ago

The other replies seem to indicate that cultural diversity is fine as long as it's in accord with their culture.

tobyhinloopen · 134d ago

I think you're giving this character a bit too much credit here, I feel like the violent attack might have some causes unrelated to transliteration of some characters.

ayhanfuat · 134d ago

That doesn't really make sense. "sıkışınca" would become "sikisinca". No one would read it as "sikişince" (the latter a is doing the heavy lifting there). A guy with the same name (I would say not a common name-surname combination considering the same region) was in jail for sexually assaulting a mentally challenged kid. I guess this was just an excuse of a psychopath. https://www.hurriyet.com.tr/gundem/parti-binasinda-ozurlu-ki...

eknkc · 134d ago

Haha yep. I'm Turkish and been using US layout keyboards my entire life. Therefore, I do not use the Turkish characters online. I use S for Ş. G for Ğ and it just works, nobody ever complained.

One word is to get bored that's causing issues.

sık - to bore sik - to fuck

So if I write "sikildim" to say "I got bored", it actually becomes "I got fucked".

One way around it to capitalize. SIKILDIM is "I got bored" but now you are yelling. Typing "sıkıldım" is a hassle on a US keyboard though.

orphea · 134d ago

  The problem was that Emine's cell phone was not localized properly for Turkish and did not have the letter <ı>; when it displayed Ramazan's message, it replaced the <ı>s with <i>s.

Does it make sense? Could the phone arbitrarily replace characters? Or could it more likely that the guy typed dotted i's?

eknkc · 134d ago

I think the article is somewhat fabrication.

There might be some truth to it but it does not make much sense. Technically, ı would probably show up as □ instead of i if the phone had a hard time displaying it.

There is also the suffix not matching that change: sıkışınca vs sikişince. A becomes E in that suffix when you switch from ı to i. Even if the phone fucked up, "sikişinca" would look weird.

GoblinSlayer · 133d ago

I noticed that countries with latin script use latin-1 encoding for sms, because they never really needed unicode. Then when software converts text to latin-1 or acsii, there's an option to find the best match character in ascii repertoire, I think in that case ı will be converted to i.

seba_dos1 · 133d ago

It's not latin-1, and it's not ASCII either. It's 7-bit GSM 03.38 charset, with optional shift tables. You can use either that or UCS-2 (though these days phones use UTF-16 instead), and UCS-2/UTF-16 significantly limits the number of characters that fit into the message (160 vs. 70).

crabsand · 133d ago

No one here understands 'sikisinca' as 'sikişince', because as you can see the final vowel is different. There are cases for these to be mixed though, sikildim May be "I'm fscked" but it's usually understood as "sıkıldım", "I'm bored."

foobahhhhh · 134d ago

The family being utterly insane was a minor factor.

Shane there was no concept of self defence.

paxys · 133d ago

No, crazy people caused the violent attack. If it wasn't the text message it would have been something else.

mvdtnz · 133d ago

The cause of this attack was violent psychopaths, not the initial misunderstanding.

ozgung · 133d ago

> So while we have two i’s (upper and lower), they have four.

No, we don’t have four i’s in Turkish. I(ı) and İ(i) are two separate letters. Turkish alphabet has 29 letters, and each letter has their own key on a Turkish keyboard. We also have Öö, Üü, Çç, Şş and Ğğ. These are all individual letters. They are not dotted versions, accents or a writing convention. So the language is as simple as it gets. The complications come from mapping letters to English alphabet.

froh · 133d ago

to bad that typography doesn't care.

other languages are hit by the same hard destiny.

German language has öäüÖÄÜ and ßẞ, yet typographically, the only "real" letter of these is ßẞ, the others are "Umlaut" aouAOU.

Linguistically they are "letters" of their own right. With obscure rules for sorting and capitalization, especially if the typeface doesn't have capital ÄÖÜẞ. then they become what they were AE, OE, UE, SZ...

and that's what the article is about: locale matters.

and in that context you have four i-like glyphs in tr-TR. and if you do anything locale sensitive in your code, like case folding, better set the locale explicitly...

mrighele · 133d ago

It's not just the English alphabet, but all the languages that use a Latin alphabet. I think that reusing two symbols representing a specific vowel for two different vowels was a poor choice.

(I guess that when it was made, more than one hundred years ago, it didn't seem that bad)

ozgung · 133d ago

They are two different vowels. There are 8 vowels in Turkish. All letters correspond to a different phonetic sound in the language. So you can read it as it is written. Iı sounds like e in ‘the’, and ‘i’ sounds like ‘ee’ in deep. They are different sounds, different vovels hence conveniently different letters. Difference changes all the meaning. For instance “sınır” means boundary and “sinir” means nerve. I think it is simple and brilliant design for Turkish. I respect the design choices of all the other languages but this is what works best for our own language. You can simply accept that it is a different (modified latin) alphabet for a different language.

tremon · 133d ago

It's not really a worse choice than reusing the same symbol to represent multiple vowels. The preceding sentence has three different vowel sounds associated with the letter e, for example (four if you include the silent -e like in choice).

Dylan16807 · 133d ago

What is your definition of "i's" that you think everyone should agree on?

It can't just be "i", there would only be one glyph then.

But if we're calling multiple things "i's", I think it's reasonable to include all four.

CemDK · 133d ago

> So while we have two i’s (upper and lower), they have four

1. i

2. I

> No, we don’t have four i’s in Turkish. I(ı) and İ(i) are two separate letters

1. ı

2. I

3. i

4. İ

--> 4

ccppurcell · 133d ago

This is the same as saying we have four o's in English: o, O, q, Q. Two with a tail, two without.

MengerSponge · 133d ago

When you remember to include (b, d, p, and P), we have more than two tailed-o characters in English!

ccppurcell · 133d ago

Agreed except that the capital B D and P are not easy to describe as modifications of the capital O (even lower case q is a stretch but the point stands)

RandallBrown · 133d ago

In the default HN font, those are more tailed Ds than Os.

nkrisc · 133d ago

But you're comparing two cases of one letter to two cases of two letters. Of course two letters have, in aggregate, more glyphs than one letter.

jccalhoun · 133d ago

that's like saying l and I are the same letters because they look similar in some sans serif fonts.

somat · 133d ago

early typewriters had no 0 or 1, you were expected to use I or O. So at least there they were the same letter.

There was a great article posted here not too long ago about the actual origin of the qwerty keyboard, spoiler, it was not actually designed to slow typing down. Anyway a fun fact found in that article was why "I" is where it is. An early adopter of the typewriter was a telegraph transcription company. and they wanted "I" moved near the end of the numbers so using it as a "1" they could type the 18 in 1875 quicker.

https://repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433...

jofla_net · 133d ago

mumble mumble forest ... mumble mumble trees

NooneAtAll3 · 133d ago

[translated from latin]

So while we have two i's, they have four

1. i

2. I

No, we don't have four i's in English, I(i) and J(j) are two separate letters

1. i

2. I

3. j

4. J

--> 4

Dylan16807 · 133d ago

I would unironically agree with that if people still spoke latin.

GoblinSlayer · 133d ago

Add to this decomposed variants.

donatj · 133d ago

I feel like Turkish should have been given a different entirely separate lowercase "i" character so the pairs could be consistent, like the Greek lookalikes. Considering how historically capital letters came before lowercase it seems like İ should have been considered an entirely separate letter from I.

Greek was given entirely separate characters even though many are indistinguishable from the Latin alphabet. In Greek for instance we have "Ν" lowercases to "ν" instead of "n". The Greek "Ν" however is not a latin "N" and is an entirely separate character. This makes a lot more sense.

dhosek · 133d ago

One of the design goals of Unicode was lossless roundtrip conversions to and from legacy encodings. Legacy Turkish used the ASCII i for lowercase and a character in the 128–255 range for İ.

In contrast, Greek encoded its entire alphabet in the 128–255 range even though, e.g., A and Α have identical appearances (similarly with Cyrillic letters).

This legacy use is also why, e.g., Thai and Hindi handle their vowel setting differently (in Thai, vowels are separate characters input in display order, in Hindi they’re spacing marks input in phonetic order¹) although both have their origins in the Brahmi script.

⸻

1. Some vowels are written before the consonant that they follow phonetically, some come after, a handful come before and after and some are written above or below or modify the shape of the consonant (less sure about this last one—I have meagre Thai skills and almost no Hindi).

OJFord · 133d ago

I know nothing of Thai but in Hindi (devanagari) ignoring standalone vowels (अ a) they come after a consonant to modify its pronunciation, regardless of whether the mark is written predominantly before or after or below. E.g. for the consonant k क् (the mark below there indicates no vowel yet):

    क ka (kuh)
    का kā (kaa)
    कि ki (kih)
    की kī (kee)
    के ke (kay)
    कै kai (keh)
    को ko (koh)
    कौ kau (kaw)
    कु ku (koo)
    कू kū (kooo)

I'm not sure about the handwriting stroke order of कि, but digitally you write ki, not ik (because that would be इक - an independent i vowel followed by a ka - 'a' being implicit at the end).

dhosek · 133d ago

Thai, as a result of its legacy encoding has you write, e,g., โ + อ = โอ even though the vowel โ is pronounced after the consonant อ. And in cases where vowel markings surround the consonant, โ◌ะ, it’s entered as three separate glyphs in display order.

BuyMyBitcoins · 133d ago

I suspect the Greek alphabet was given special treatment because of just how prominent Greek symbols are in math and science.

I posit that engineers and computer scientists gave extra time and attention towards accommodating Greek because they were so familiar with seeing and using those glyphs during their education. They knew that those symbols would be encountered in English, even before full internationalization efforts would take place. Whereas Turkish was merely an afterthought.

Edit: This post of mine is unfounded/inaccurate, thank you to dhosek for providing a proper explanation, see https://news.ycombinator.com/item?id=43905574

tgv · 133d ago

Turkish was explicitly based on the Latin script by Atatürk. If Erdogan gets his way, it'll be reverted.

groos · 133d ago

What a strange statement. Turkish (the Language) existed before the Ataturk's forced conversion into Romanized script and previously used the extended Arabic script. There are other languages which are written in multiple scripts, even outside of computing. In today's world, it's just as possible to type in Arabic as in Romanized Turkish, so whoever wants to do whatever, the capability exists.

tgv · 131d ago

I should have written "The Turkish script." I thought that was clear from the context.

donatj · 133d ago

Oh, I don't disagree. Some languages clearly got better treatment by the Unicode committee than others.

dhosek · 133d ago

Nope, it’s because of how the legacy encodings were handled. See my sibling comment to yours.

BuyMyBitcoins · 133d ago

Thank you for providing the historical context. I have edited my post and forwarded people to your explanation.

jongjong · 134d ago

This is one of the reasons why software development is so difficult, most people cannot even begin to imagine how complex the user environment can be. Even within very niche problem domains you may have to deal with a broad range of different environments with different locales, different spoken languages, operating systems, programming languages, compilers/transpilers, engine versions, server frameworks, cache engines, load balancers, TLS certificate provisioning, container engines, container image versions, container orchestrators, browsers, browser extensions, frontend frameworks, test environments, transfer protocols, databases (with different client and servers versions), database indexes, schema constraints, rate limiting... I could probably keep going for hours. Now imagine being aware of all these factors (and much more) and being aware of all possible permutations of these; that's what you need in order to be a senior software developer these days. It's a miracle that any human being can produce any working software at all.

As a developer, if some code works perfectly on your own computer, the journey has barely just begun.

Karliss · 134d ago

This makes me wonder is a there a programming language which has separate data types for locale aware and locale independent strings. I know that rust has OsString but that's a slightly different usecase.

Problem with the current widely used approach of having global application wide locale setting is that most applications contain mix of User facing strings and technical code interacting with file formats or remote APIs. Doesn't matter if you set it to current language (or just let operating system set it) or force it to language independent locale, sooner or later something is going to break.

If you are lucky a programming language might provide some locale independent string functions, but using them is often clunky and and unlikely to be done consistently across whole code base and all the third party libraries. It's easier to do things correctly if you are forced to declare the intention from the start and any mixing of different context requires an explicit conversion.

GoblinSlayer · 133d ago

AFAIK ruby string has embedded charset.

dhosek · 133d ago

Rust also has an ASCII-specific casefolding function.

sam_lowry_ · 134d ago

pjc50 · 133d ago

C doesn't have a string data type, let alone a locale-aware one. No, the Microsoft LPCWSTR madness doesn't count.

NooneAtAll3 · 133d ago

because C doesn't make types aware, but functions?

https://en.cppreference.com/w/c/string/multibyte

pjc50 · 133d ago

Yes. And the C type system isn't rich enough to represent "do not pass this type of string into this function".

alkonaut · 133d ago

I think the key to doing text sanely in programming is separating "text" from "international text" or "user text". "Text", can be e.g. the characters that make up my xml node names. Or all the names of my db columns etc. You still have to worry about encodings and everything with this data, but you don't have to worry that there is a 10 byte emoji or a turkish upper case i. A key property of it is: you can, for example, run toUpper or toLower, with a default culture. It has symmetric transforms. It can often be assumed to be the ASCII subset, regardless off encoding.

Then on the other end you have text what the user enters. It can be anything (so may need validation and washing). You may not be able to run "to lower" on it (although I'd be tempted to do it on an email address for example).

The key is just knowing what you have. It's unfortunate that "string" is usually used for everything from paths to user input to db column names etc.

eska · 132d ago

That’s unfortunately what’s so wrong with Unicode, while I do appreciate a lot of its improvements. It’s not enough to have a global locale, but even a local per-string locale is not enough. The locale should be in-band, i.e. part of the text, so it can be switched like in a state machine

HelloNurse · 133d ago

> you don't have to worry that there is a 10 byte emoji or a turkish upper case i

Excluding all emoji is silly but feasible (except for actual thorough custom validation and error handling of all inputs), but excluding some uppercase and lowercase letters because you don't feel up to the task of processing them is demeaning lunacy.

alkonaut · 133d ago

Again, I'm now talking about "known" text in the programming context. Text that is neither user-input, or presented to a user. (E.g. column names. builtin function names in my toy spreadsheet. Whatever).

Obviously for text that is both user-input and then displayed back again to users, you are in the other category. Apart from protecting against rendering mishaps and security etc, you probably just want to preserve what they write.

But that was my point: 90% or more of the text you do is likely in the first category. And very rarely do you even have do deal with text in the second category.

pie_flavor · 133d ago

If you want a separate language type for programmatic identifiers upon which all operations are defined in a culture-independent way, (a) good luck getting people to use it reliably, and (b) good luck with the numerous places where you have to convert back and forth and thus need all the information you're trying to not need.

Your XML files start with `<?xml version="1.0" encoding="utf-8"?>`. If they then cannot actually support common UTF-8 sequences such as emoji or CJK characters, then your system is bugged, and you should fix it.

alkonaut · 133d ago

The schema I meant was created in-house. I meant the tags and attribute names (schema) I can control.

Just like the compiler controls what can go in a function name (but not in a string literal or comment).

bob1029 · 134d ago

System.Globalization is quite the feat of engineering. Setting CultureInfo is like getting onto an actual airplane. I don't know of any other ecosystem with docs like:

https://learn.microsoft.com/en-us/windows/apps/design/global...

sam_lowry_ · 134d ago

It is called locale and has been for many years: https://en.wikipedia.org/wiki/Locale_(computer_software)

pjc50 · 134d ago

> 令和_令_Reiwa_R

twitch

A classic which breaks lots of applications is the difference between number format "1,234.5" and "1.234,5" (some European countries).

simiones · 134d ago

I've actually been responsible some 10 years ago for introducing a bug like this in an official release of an industry-standard tool for a somewhat niche industry. Some SQL queries we were generating ended up saying `SELECT x FROM t WHERE x < 1,02` if run on an any system with commas as the decimal separator. We found it and fixed a few weeks later, and I don't think we've ever had a complaint from the field about this, but it was still pretty eye opening about locales.

The extra irony is that me and my colleagues live in a country that actually has this kind of locale, but no one in the entire extended team was using it, everyone uses a US locale.

pmontra · 133d ago

I think that in my country it's “1'234,5“. It was when I learned to write many years ago before computers were common.

Ylpertnodi · 134d ago

I've had to adust to/ accomodate the difference between 1. and 1, almost daily.

Very expensive if you fuck up. Very embarrassing if you fuck up, too.

sebstefan · 134d ago

Boy it would sure be easier if the Turkish i was a different unicode character in lowercase too

elevatortrim · 134d ago

Not sure about this. For this to work, one of these would need to happen:

1. Have two "i" characters on Turkish keyboards, one to use when writing in English, one in Turkish. Sounds difficult to get used to. Always need to be conscious about whether writing an "English i", or a "Turkish i".

2. "i" key is interpreted as English "i" when in English locale, as a special unicode character when in Turkish locale. This would be a nightmare as you would then always have to be conscious of your locale. Writing in English? Switch to English locale. Writing code? Switch to English locale. Writing a Turkish string literal in code? Switch to Turkish, then switch back. It would need to be a constant switching between back and forth even though both are Latin alphabet.

JimDabell · 134d ago

> "i" key is interpreted as English "i" when in English locale, as a special unicode character when in Turkish locale. This would be a nightmare as you would then always have to be conscious of your locale.

Isn’t this already the case with other languages? For instance, the same key on the keyboard produces a semicolon (;) in English and a Greek question mark (;) in Greek. These are distinct characters that are rendered the same (and also an easy way to troll a developer who uses an editor that doesn’t highlight non-ASCII confusables).

int_19h · 133d ago

Not only that, but this is already the case with "i" specifically. Cyrillic "і" is different from Latin "i", and it's also located in a different place on the keyboard in the corresponding layouts.

alexey-salmin · 134d ago

> 1. Have two "i" characters on Turkish keyboards, one to use when writing in English, one in Turkish. Sounds difficult to get used to. Always need to be conscious about whether writing an "English i", or a "Turkish i".

But you have to do that anyway to be able to produce the correct capitalized version: an "English I" or a "Turkish İ".

daveliepmann · 134d ago

No: a Turkish keyboard has separate i/İ and ı/I keys, and Türkish-writing users with an American/international keyboard use a keyboard layout with modifier keys so that the i/I key can be altered to ı/İ. (I do the latter for idiosyncratic reasons.)

The person you're replying to is pointing out that differentiating English-i from Türkish-i requires some other unwieldy workaround. Would you expect manufacturers to add a third key for English i, or for people with Turkish keyboards to use a modifier key (or locale switching) to distinguish i from i? All workarounds seem extraordinarily unlikely.

elevatortrim · 134d ago

Hmm, you are kind of right but not exactly:

Yes, there are two keys, but their function is not to write the character as a "Turkish i" and an "English i". These keys are necessary because there are 4 variations, that need 2 keys to write with caps lock on and off:

Key 1 - Big and small Turkish "I": Caps Lock On: I Caps Lock Off: ı

Key 2 - Big and small Turkish "İ": Caps Lock On: İ Caps Lock Off: i

For small "Turkish i" and "English i" to be different characters, there would need to be a third key.

sebstefan · 134d ago

Ah, that's because I thought turks and azerbaijanis just switched keyboard layouts to type in english and to type in their native language.

elevatortrim · 134d ago

That's a sensible thought but Turkish QWERTY keyboard includes both the English-exclusive (Q, X, W) and Turkish-exclusive characters so switching is rarely required.

lifthrasiir · 134d ago

Impossible because the decision was already made by Turkish encodings, which made Unicode to pick only one option (round-trip compatibility with legacy encodings) out of possible trade-offs.

alexey-salmin · 134d ago

What were the other possible trade-offs? I don't really see how lack of round-trip compatibility is worse than what we have now. It's breaking the whole idea of Unicode code points and for what.

thaumasiotes · 133d ago

Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

This isn't honored; we have many Unicode code points that look identical by definition and differ only in their secret semantics, but all of those points are in violation of the principles of Unicode. The Turkish 'i' is doing the right thing.

ubutler · 133d ago

> Actually it reflects the idea of Unicode code points correctly. They are meant to represent graphs, not semantics.

Why do we then have lots of invisible characters that are intended essentially as semantic markers (eg, zero-width space)?

alexey-salmin · 133d ago

How do you define "look identical" outside of fonts which from my understanding were excluded from Unicode consideration on purpose?

E.g. Cyrillic "а" looks the same as Latin "a" most of the time, they both are distant descendants of the Phoenician 𐤀, but they are two different letters now. I'm very glad they have different code points, it would be a nightmare otherwise.

anticensor · 133d ago

And they would call it Greco-Roman unification, similar to Han unification.

thaumasiotes · 133d ago

No, Han unification is a completely different thing. Unicode Han unification represents distinct glyphs as the same code point - the intent is that you choose the glyph you want by setting a font (!). This has been acknowledged as a mistake.

Having distinct code points for Latin capital letter A, Greek capital letter A, and Cyrillic capital letter A is the reverse, separate code points for glyphs that are identical by definition. That's also a mistake.

(Although it might be required by Unicode's other principle of being fully compatible with a wide variety of older encodings. There are many characters, like 囍, that don't qualify to have a code point, but that have one anyway because they're present in an encoding that Unicode commits to represent.)

gtbot2007 · 133d ago

No that’s the opposite of how it’s supposed to work

zokier · 134d ago

How would separate code point break round-tripping specifically?

dhosek · 133d ago

The legacy Turkish encoding used ASCII i but a character in the 128–255 range for İ. Remember that not all documents are monolingual so you might have a document with, e.g., both English and Turkish text and in the legacy code page these would use i for both the English and Turkish letter.

zokier · 133d ago

That didn't answer the question, why would separate code point for Turkish lower-case dotted i break round-tripping?

dhosek · 133d ago

Because just because something is in Turkish doesn’t mean it doesn’t also include non-Turkish text. So you end up with weird edge cases when translating mixed text back and forth since it would be a single glyph in legacy Turkish 8-bit text but two glyphs in Unicode so Unicode text that might have “Kırgızistan (English: Kyrgyzstan)” in it under your scheme with a Unicode-Legacy-Unicode roundtrip would encode the i in English as the Turkish dotted i.

zokier · 133d ago

If "turkish lower case dotted i" would be separate codepoint, that still wouldn't cause ambiguity like you describe. It would just mean that "U+0069 latin small letter i" would not be (directly) transcodable to the legacy Turkish character set. But that wouldn't really be any different from other similar homoglyph situations, for example "U+0430 cyrillic small letter a" does not transcode to ASCII and that is business as usual. U+0069 not being transcodeable to some legacy encoding is not really a round-tripping problem, vast majority of Unicode codepoints are not transcodable to any single legacy encoding. Round-trip compatibility is really only concern when going from legacy-unicode-legacy; it is naturally expected that not all strings will be able to roundtrip unicode-legacy-unicode.

dhosek · 133d ago

EXCEPT that the legacy Cyrillic codepages had separate codepoints for Latin a and Cyrillic а. You’re also making assumptions about the roundtrip preservation that are invalid. The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip. Yes, you can’t roundtrip ⨋ to most legacy codepages, but that’s not the brief.

zokier · 131d ago

> The idea is that if a string is encodable in the legacy codepage, you should be able to make the roundtrip.

But the which strings are encodable in legacy codepage depends on what we define as encodable! If we had separate codepoint for "turkish small letter i" then we could have simply defined that "latin small letter i" is not encodable in legacy turkish codepage, same way that "cyrillic small letter a" is not encodable to turkish legacy codepage. "turkish small letter i" and "latin small letter i" would be just another normal homoglyph pair, same as "cyrillic small letter a" and "latin small letter a".

sebstefan · 133d ago

I don't know about round-tripping of anything but suddenly having an entire nation with keyboard outputting utf-8 on outdated national systems probably designed for Latin1 seems like a tough sell to fix this issue

sebstefan · 134d ago

Yep I'm aware

jeroenhd · 134d ago

It does (U+0131 = Latin Small Letter Dotless I, U+0069 = Latin Small Letter I).

The problem is that uppercasing the dotted i outputs a different character depending on your current locale. Using case-insensitive equality checks also break this way (I==i, except in a Turkish locale, so `QUIT ilike quit` is false).

rob74 · 134d ago

Yes - the problem is that "i" and "I" are standard ASCII characters, while the dotted I and the dotless i are not. Creating special "Turkish I" and "Turkish i" characters would have been an alternative, but would have had its own issues (e.g. documents where only some "i"s are Turkish and the rest "regular" because different people edited it with different software/settings).

mrspuratic · 133d ago

Irish script traditionally used a dot-less "i", something that persists in current road signage (anecdotally to save confusion with "í", or with adjacent old-style dotted consonants, I can't find a definitive source to cite). It's only an orthographic/type thing, it's semantically an "i", though the Unicode dot-less "i" is sometimes used online to represent it.

tmtvl · 134d ago

Is it? That's weird, I can't find the code for Latin Small Letter Dotted I. There is a Cyrillic dotted I, but that one doesn't have the dot in capitalised form.

What sebstefan is asking for is a Unicode character which is the non-capitalised form of Latin Capital Letter I With Dot Above (U+0130) which always gets capitalised to U+0130 and which U+0130 gets downcased to.

anticensor · 133d ago

And DELETE DOT ABOVE would wnd that locale dependency.

makeitdouble · 134d ago

I'm imagining coding with some random "i" being a different completely undistinguishable character from the English "i". Or people writing your name and not matching in their DB because their local "i" is not your "i".

It's a potential issue already depending on your script, and CJK also has this funny full English alphabet but all in double-width characters that makes it PITA for people who can't distinguish the two. But having it on a character as common as "i" would feel specially hellish to me.

sebstefan · 134d ago

It wouldn't matter

There's already this problem for cyrillic 'e' and latin 'e' and hundreds of other characters

People use it to create lookalike URLs and phish people

https://www.pcmag.com/news/chrome-blocks-crafty-url-phishing...

makeitdouble · 134d ago

Cyrillic 'e' is isolated in that you switch script when writing it. I'd compare it to the greek X.

Turkish isn't on a fully separate script, most letters are standard ascii and only a few are special (it's closer to French or German with the accentuated characters), so you don't have the explicit switch, it's always mixed.

sebstefan · 133d ago

Then you have the greek question mark ;

alexey-salmin · 134d ago

> But having it on a character as common as "i" would feel specially hellish to me.

https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)

ndepoel · 134d ago

Ahh yes, been there, done that.

Several years ago we had issues with certification of our game on PS4 because the capitalization on Sony's Turkish translation for "wireless controller" was wrong. The problem being that Turkish dotless I. What was the cause? Some years prior we had had issues with internal system strings (read: stringified enums) breaking on certain international PC's because they were being upper/lowercased using locale-specific capitalization rules. As a quick fix, the choice was made then to change the culture info to invariant globally across the entire game. This of course meant that all strings were now being upper/lowercased according to English rules, including user-facing UI strings. Hence Turkish strings mixing up dotted and dotless I's in several places. The solution? We just pre-uppercased that one "wireless controller" term in our localization sheet, because that was the only bit of text Sony cared about. An ugly fix and we really should have gone through the code to properly separate system strings from UI texts, but it got the job done.

the_mitsuhiko · 134d ago

Over the years this has shown up a few times because PHP internally was using a locale dependent function to normalize the class names, but it was also doing it inconsistently in a few places. The bug was active for years and has resurfaced more than once: https://bugs.php.net/bug.php?id=18556

dhosek · 133d ago

I was wondering if anyone else remembered this issue.

whizzter · 133d ago

This highlights the single biggest problem I have with the MS/C#/.NET runtime/ecosystem (The article seems to be from a .NET developer), so many functions connected to string handling are locale dependent and you have to explicitly select the non-locale variants and that then becomes an issue when dealing with the common data interchange and file-formats since those are usually with US semantics.

Many European developers run into this frequently since the default parse for float/double/decimal will assume comma as the decimal separator due to our locale settings.

neonsunset · 133d ago

As a developer or a user, you have full control over this:

  <InvariantGlobalization>true</InvariantGlobalization>

  DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=true

(it is still an occasional gotcha, but pretty much everyone learned to be aware of it)

whizzter · 133d ago

I've missed that it's possible to use as a default, thank you.

Still the need to remember it is an silly cognitive load (also it seems it was introduced only with .NET Core 2, still maintaining old framework apps where this luxury doesn't seem available?).

neonsunset · 133d ago

Historically, it was a decision to avoid cognitive load for GUI applications. To be fair, it's been a while since I've heard anyone complain about it - most servers and containers run with both an en-US locale and/or the invariant globalization set to true. It may sometimes be an issue when debugging locally but it can be quickly addressed, if the solution for some reason does not pass the culture explicitly or does not set the invariant globalization.

I think before there was a global toggle it was a pretty bad default, but nowadays it's not a practical challenge at all. It's a "solved problem". As for .NET Framework applications - they have bigger issues to worry about. The teams willing/forced to stay with it know what they sign for.

Also, why the downvotes?

whizzter · 133d ago

Yes, I agree, it's probably an inheritance of the Visual Basic roots it was supposed to take over.

We got enterprise customers here in Sweden so some parts of the framework legacy are unavoidable due to that and server locale has been a bit of a crapshoot due to it being a classic on prem without fixed guidelines/needs.

Also confused by the downvotes (got it on the thread-root comment also), got another "weird" downvote in another post's comment also so no idea if it's some weird bot behaviour or someone who hates me.

prmph · 133d ago

I wish someone would write a book that distills all the knowledge contained in those "Falsehoods Programmers Believe About X" or "Things Programmers Should Know" topics, providing a resource for how to write real-world practically robust software that works reasonably well anywhere anytime.

The list of gotchas with any non-trivial software is long and frequently obscure.

hudo · 134d ago

Reminds me to friends old but brilliant project, use Unicode to draw art on stack trace logs! Enough with boring stack traces in logs, lets make some art there and make life a bit easier for the poor soul thats on support and has to debug latest prod issue. https://medium.com/@ironcev/stack-trace-art-4b700a8817ea

poulsbohemian · 133d ago

When I was in Turkey on a project, the i was absolutely a problem in the software I was trying to deploy. Glad to see this as it's one of those classic "Things Programmers Should Know" topics right up there with all the other classics like address formats and name formats not being the same across the globe.

rolandog · 133d ago

Huh. Trying to find the letter "i" in this page in Firefox for Android results in a 0-based index of results (starts at 0/-1); you get 999/-1 as the last result if you start from the end.

anticensor · 133d ago

We need a combining character DELETE DOT ABOVE to make i into ı.

NoMoreNicksLeft · 133d ago

I'm still waiting for my application for the symbol for The Artist Formerly Known as Prince to be accepted. Meh, maybe in Unicode 18.0.

PS Apparently the Stargate SG-1 symbols are completely out of the question. How can they be copyrighted if they're based on constellations?

shultays · 133d ago

  const string input = "interesting";
  bool comparison = input.ToUpper() == "INTERESTING";
  Console.WriteLine("These things are equal: " + comparison);
  Console.ReadLine();

Is this a realistic scenario? Changing case of a string and comparing it to something else? Running some kind of operations & logic on a string that is meant for user?

If you are doing such things then it looks more like a code smell.

myflash13 · 133d ago

One use case I can think of: email string normalization during login. If your string localization is wrong, simple things like login can fail.

Edit: another use case: full text case insensitive searching of documents

nemetroid · 133d ago

Case insensitivity is a code smell?

NooneAtAll3 · 133d ago

as the other commenter says - normalization

right now I'm tinkering an old game that transforms all text inputs to uppercase ascii

Adam (YC W25) Is Hiring to Build the Future of CAD (ycombinator.com)

Piramidal (YC W24) Is Hiring Back End Engineer (ycombinator.com)

Mux (YC W16) Is Hiring Engineering ICs and Managers (mux.com)

Bild AI (YC W25) Is Hiring (ycombinator.com)

Infracost (YC W21) Is Hiring First Product Manager to Shift FinOps Left (ycombinator.com)

Crimson (YC X25) is hiring founding engineers in London (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Nango (YC W23) Is Hiring a Staff Back End Engineer (Remote) (jobs.ashbyhq.com)

Gym Class VR (YC W22) Is Hiring – UX Design Engineer (ycombinator.com)

Relace (YC W23) Is Hiring for Code LLMs (SF)

Artie (YC S23) Is Hiring Engineers, AES, and Senior PMM (ycombinator.com)

Depot (YC W23) Is Hiring a Solutions Engineer (Remote US and Canada) (ycombinator.com)

Svix (webhooks as a service) is hiring for a founding marketing lead (svix.com)

Dynamo AI (YC W22) Is Hiring for AI Product Managers (ycombinator.com)

Kapa.ai (YC S23) is hiring research and software engineers (ycombinator.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

Telli (YC F24) is hiring engineers, designers, and interns (on-site in Berlin) (hi.telli.com)

Infisical (YC W23) Is Hiring Solutions Engineers to Scale the OSS Security Stack (ycombinator.com)

Channel3 (YC S25) Is Hiring a Founding Engineer, NYC (channel3.notion.site)

Thunder Compute (YC S24) Is Hiring (ycombinator.com)

Deepnote (YC S19) is hiring engineers to build a better Jupyter notebook (deepnote.com)

Prosper AI (YC S23) Is Hiring Founding Account Executives (NYC) (jobs.ashbyhq.com)

The Forecasting Company (YC S24) Is Hiring a Software Engineer (ycombinator.com)

Lago – Open-Source Usage Based Billing – Is Hiring in Sales, Eng, Ops (EU, US) (ycombinator.com)

Ember (YC F24) Is Hiring Full Stack Engineer (ycombinator.com)

LiteLLM (YC W23) is hiring a back end engineer (ycombinator.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring Platform Engineers (Remote) (jobs.ashbyhq.com)

Motion (YC W20) Is Hiring Principal Software Engineers (jobs.ashbyhq.com)

Bild AI (YC W25) Is Hiring an Applied AI Engineer (workatastartup.com)

Text.ai (YC X25) Is Hiring Founding Full-Stack Engineer (ycombinator.com)

Cua (YC X25) is hiring design engineers in SF (ycombinator.com)

The Turkish İ Problem and Why You Should Care (2012)

Comments (151)