Show HN: Plexe – ML Models from a Prompt (github.com)
31 points by vaibhavdubey97 1h ago 15 comments
Show HN: Pinggy – A free RSS reader for the web (pinggy.com)
5 points by vasanthv 15h ago 1 comments
Show HN: MP3 File Editor for Bulk Processing (cjmapp.net)
27 points by cutandjoin 2d ago 17 comments
The Turkish İ Problem and Why You Should Care (2012)
84 Rygian 120 5/6/2025, 8:34:17 AM haacked.com ↗
So by alternating case you end up with ß→SS→ss or ẞ→ß→SS. Certainly has potential to screw with naive attempts at case-insensitive comparison via case folding. Then again, Unicode adopting 'ẞ' as the upper of 'ß' in some future version would probably only increase that potential further.
I'm interested to hear from people dealing with a lot of German text how much of a problem this is in practice.
The ‘ß’ is a ligature of the old ‘long s’ [1] which was written ‘ſ’ (because it’s common in old texts there is a Unicode code point for it).
This letter has no upper case version. Capitalized words starting with a long ‘ſ’ always used ‘S’.
Now in German language, to make this lowercase long ‘ſ’ a sharp ‘s’, ‘ſ’ followed by ‘z’ was written: ‘ſz’.
And these two were often typeset as a ligature, ‘ß’, for esthetic reasons.
That ligature then became the common case and eventually a letter recognized in German-speaking countries.
As a hypothetic analogy, imagine a ‘ll’ ligature, as in ‘fallacy’, becoming an English letter – by some twist of history.
As we saw, these were lowercase letters. And there is no uppercase version of ‘ſ’.
So the uppercase ‘ẞ’ that is now official recognized and has a Unicode code point should not look like this.
It's an absolute eye saw because all that was done was somehow make the letter look a bit more like a capital.
But it's nature of being two lowercase letters, originally, still makes it stand out like an eye sore for people with a background in typography, like myself.
IMHO It should look like ‘SZ’ (or ‘SS’), made into a ligature.
And as a type designer, I'd either refrain from filling that code point in a font I design, to protest this, or do the above: create a ligature of ‘SZ’ or ‘SS’ (alternative) and put that there.
[1] https://en.m.wikipedia.org/wiki/Long_s
First of all, 'ß' was a ligature -- a long time ago. It is a letter today. Disassembling it according to its original construction makes no sense today for any kind of argument about typesetting or Unicode. Further, 'ſ' is not used today in German at all, except for meta discussions like this or to stress how things used to be spelled. It makes no sense to mention it unless you are talking about font design or historic use of German (and other languages, for that matter).
Also, if you do mention it for the sake of talking about font design, in Latin fonts, 'ſs' is actually the basis for the design of 'ß', not 'ſz' -- that was mainly done in Blackletter/Fraktur when the 'z' looked different, maybe a bit like 'ʒ' (I used Unicode's'ezh' here hoping it looks right) so that old style 'ß' looks like a ligature of 'ſʒ'. This can still be seen occasionally, e.g., on Berlin street name signs. It is obsolete for most fonts today (although I quite like it).
Moreover, there is an upper case letter for 'ß': 'ẞ'. And it has existed in fine typology way before being adopted into Unicode. Actually, it's existence was probably the reason why it is now in Unicode. The official German rules are now: either use 'SS' or 'ẞ' for uppercase 'ß'. Most Germans probably do not even know that 'ẞ' exists as a choice today, although it was used on 'DER GROẞE DUDEN' even before Unicode existed.
And finally, how a glyph is designed is not necessarily decided on whether historic parts of an ancient ligature had upper case variants. So that 'ſ' has no upper case equivalent is irrelevant for both Unicode and type design.
But as a font designer or anything else, you can protest. No problem. Everyone has the right to protest. But please don't spill the Internet with wrong information, as there is enough of it already.
And I don't think 'SS'<->'ß' is similar to the Turkish 'I with/without dot' problem, because the default Unicode mapping for 'ß' is correct in all languages, while the Turkish (and also Azerbaijani) problem is correct or broken depending on language setting. This is way more problematic because an assumed universal equivalence does not hold. And you need to carefully distinguish whether a string is language specific or not, e.g., path names or IDs in data bases, etc.
I don't know if this counts as "correct" but it's still very confusing.
now Unicode philosophers have to ponder a breaking change vs introducing a new, duplicate ß code point LATIN SMALL LETTER SHARP S WITH CAPITAL SHARP S which as upper case has encoded the proper ẞ
two red buttons meme here...
It may be a typographical abomination, but it's an intentional representation of that particular typographical abomination, just as the ox head in "A" intentionally has its horns pointing down.
I assume you mean "eyesore"
https://en.wikipedia.org/wiki/Mondegreen
No, we don’t have four i’s in Turkish. I(ı) and İ(i) are two separate letters. Turkish alphabet has 29 letters, and each letter has their own key on a Turkish keyboard. We also have Öö, Üü, Çç, Şş and Ğğ. These are all individual letters. They are not dotted versions, accents or a writing convention. So the language is as simple as it gets. The complications come from mapping letters to English alphabet.
other languages are hit by the same hard destiny.
German language has öäüÖÄÜ and ßẞ, yet typographically, the only "real" letter of these is ßẞ, the others are "Umlaut" aouAOU.
Linguistically they are "letters" of their own right. With obscure rules for sorting and capitalization, especially if the typeface doesn't have capital ÄÖÜẞ. then they become what they were AE, OE, UE, SZ...
and that's what the article is about: locale matters.
and in that context you have four i-like glyphs in tr-TR. and if you do anything locale sensitive in your code, like case folding, better set the locale explicitly...
(I guess that when it was made, more than one hundred years ago, it didn't seem that bad)
1. i
2. I
> No, we don’t have four i’s in Turkish. I(ı) and İ(i) are two separate letters
1. ı
2. I
3. i
4. İ
--> 4
So while we have two i's, they have four
1. i
2. I
No, we don't have four i's in English, I(i) and J(j) are two separate letters
1. i
2. I
3. j
4. J
--> 4
Greek was given entirely separate characters even though many are indistinguishable from the Latin alphabet. In Greek for instance we have "Ν" lowercases to "ν" instead of "n". The Greek "Ν" however is not a latin "N" and is an entirely separate character. This makes a lot more sense.
In contrast, Greek encoded its entire alphabet in the 128–255 range even though, e.g., A and Α have identical appearances (similarly with Cyrillic letters).
This legacy use is also why, e.g., Thai and Hindi handle their vowel setting differently (in Thai, vowels are separate characters input in display order, in Hindi they’re spacing marks input in phonetic order¹) although both have their origins in the Brahmi script.
⸻
1. Some vowels are written before the consonant that they follow phonetically, some come after, a handful come before and after and some are written above or below or modify the shape of the consonant (less sure about this last one—I have meagre Thai skills and almost no Hindi).
I posit that engineers and computer scientists gave extra time and attention towards accommodating Greek because they were so familiar with seeing and using those glyphs during their education. They knew that those symbols would be encountered in English, even before full internationalization efforts would take place. Whereas Turkish was merely an afterthought.
Edit: This post of mine is unfounded/inaccurate, thank you to dhosek for providing a proper explanation, see https://news.ycombinator.com/item?id=43905574
https://languagelog.ldc.upenn.edu/nll/?p=73
I'm Turkish. I grew up in Turkey. These things happen, but let's not try to justify them. We should aim to get to a point where people share these "western values" (of not stabbing people).
People suffer worst than death over words all the time, even in the West. Some folks adhere to honour, some to political groups and ideologies, some religion, some to their social views; there are words that are treated as violence and responded to accordingly in every context.
Could you tell us where you're from, anyhow?
One word is to get bored that's causing issues.
sık - to bore sik - to fuck
So if I write "sikildim" to say "I got bored", it actually becomes "I got fucked".
One way around it to capitalize. SIKILDIM is "I got bored" but now you are yelling. Typing "sıkıldım" is a hassle on a US keyboard though.
There might be some truth to it but it does not make much sense. Technically, ı would probably show up as □ instead of i if the phone had a hard time displaying it.
There is also the suffix not matching that change: sıkışınca vs sikişince. A becomes E in that suffix when you switch from ı to i. Even if the phone fucked up, "sikişinca" would look weird.
Shane there was no concept of self defence.
Then on the other end you have text what the user enters. It can be anything (so may need validation and washing). You may not be able to run "to lower" on it (although I'd be tempted to do it on an email address for example).
The key is just knowing what you have. It's unfortunate that "string" is usually used for everything from paths to user input to db column names etc.
Excluding all emoji is silly but feasible (except for actual thorough custom validation and error handling of all inputs), but excluding some uppercase and lowercase letters because you don't feel up to the task of processing them is demeaning lunacy.
Obviously for text that is both user-input and then displayed back again to users, you are in the other category. Apart from protecting against rendering mishaps and security etc, you probably just want to preserve what they write.
But that was my point: 90% or more of the text you do is likely in the first category. And very rarely do you even have do deal with text in the second category.
The list of gotchas with any non-trivial software is long and frequently obscure.
Problem with the current widely used approach of having global application wide locale setting is that most applications contain mix of User facing strings and technical code interacting with file formats or remote APIs. Doesn't matter if you set it to current language (or just let operating system set it) or force it to language independent locale, sooner or later something is going to break.
If you are lucky a programming language might provide some locale independent string functions, but using them is often clunky and and unlikely to be done consistently across whole code base and all the third party libraries. It's easier to do things correctly if you are forced to declare the intention from the start and any mixing of different context requires an explicit conversion.
https://en.cppreference.com/w/c/string/multibyte
If you are doing such things then it looks more like a code smell.
Edit: another use case: full text case insensitive searching of documents
right now I'm tinkering an old game that transforms all text inputs to uppercase ascii
https://learn.microsoft.com/en-us/windows/apps/design/global...
twitch
A classic which breaks lots of applications is the difference between number format "1,234.5" and "1.234,5" (some European countries).
The extra irony is that me and my colleagues live in a country that actually has this kind of locale, but no one in the entire extended team was using it, everyone uses a US locale.
Very expensive if you fuck up. Very embarrassing if you fuck up, too.
Several years ago we had issues with certification of our game on PS4 because the capitalization on Sony's Turkish translation for "wireless controller" was wrong. The problem being that Turkish dotless I. What was the cause? Some years prior we had had issues with internal system strings (read: stringified enums) breaking on certain international PC's because they were being upper/lowercased using locale-specific capitalization rules. As a quick fix, the choice was made then to change the culture info to invariant globally across the entire game. This of course meant that all strings were now being upper/lowercased according to English rules, including user-facing UI strings. Hence Turkish strings mixing up dotted and dotless I's in several places. The solution? We just pre-uppercased that one "wireless controller" term in our localization sheet, because that was the only bit of text Sony cared about. An ugly fix and we really should have gone through the code to properly separate system strings from UI texts, but it got the job done.
Many European developers run into this frequently since the default parse for float/double/decimal will assume comma as the decimal separator due to our locale settings.
Still the need to remember it is an silly cognitive load (also it seems it was introduced only with .NET Core 2, still maintaining old framework apps where this luxury doesn't seem available?).
1. Have two "i" characters on Turkish keyboards, one to use when writing in English, one in Turkish. Sounds difficult to get used to. Always need to be conscious about whether writing an "English i", or a "Turkish i".
2. "i" key is interpreted as English "i" when in English locale, as a special unicode character when in Turkish locale. This would be a nightmare as you would then always have to be conscious of your locale. Writing in English? Switch to English locale. Writing code? Switch to English locale. Writing a Turkish string literal in code? Switch to Turkish, then switch back. It would need to be a constant switching between back and forth even though both are Latin alphabet.
But you have to do that anyway to be able to produce the correct capitalized version: an "English I" or a "Turkish İ".
The person you're replying to is pointing out that differentiating English-i from Türkish-i requires some other unwieldy workaround. Would you expect manufacturers to add a third key for English i, or for people with Turkish keyboards to use a modifier key (or locale switching) to distinguish i from i? All workarounds seem extraordinarily unlikely.
Yes, there are two keys, but their function is not to write the character as a "Turkish i" and an "English i". These keys are necessary because there are 4 variations, that need 2 keys to write with caps lock on and off:
Key 1 - Big and small Turkish "I": Caps Lock On: I Caps Lock Off: ı
Key 2 - Big and small Turkish "İ": Caps Lock On: İ Caps Lock Off: i
For small "Turkish i" and "English i" to be different characters, there would need to be a third key.
Isn’t this already the case with other languages? For instance, the same key on the keyboard produces a semicolon (;) in English and a Greek question mark (;) in Greek. These are distinct characters that are rendered the same (and also an easy way to troll a developer who uses an editor that doesn’t highlight non-ASCII confusables).
This isn't honored; we have many Unicode code points that look identical by definition and differ only in their secret semantics, but all of those points are in violation of the principles of Unicode. The Turkish 'i' is doing the right thing.
Why do we then have lots of invisible characters that are intended essentially as semantic markers (eg, zero-width space)?
E.g. Cyrillic "а" looks the same as Latin "a" most of the time, they both are distant descendants of the Phoenician 𐤀, but they are two different letters now. I'm very glad they have different code points, it would be a nightmare otherwise.
The problem is that uppercasing the dotted i outputs a different character depending on your current locale. Using case-insensitive equality checks also break this way (I==i, except in a Turkish locale, so `QUIT ilike quit` is false).
What sebstefan is asking for is a Unicode character which is the non-capitalised form of Latin Capital Letter I With Dot Above (U+0130) which always gets capitalised to U+0130 and which U+0130 gets downcased to.
It's a potential issue already depending on your script, and CJK also has this funny full English alphabet but all in double-width characters that makes it PITA for people who can't distinguish the two. But having it on a character as common as "i" would feel specially hellish to me.
There's already this problem for cyrillic 'e' and latin 'e' and hundreds of other characters
People use it to create lookalike URLs and phish people
https://www.pcmag.com/news/chrome-blocks-crafty-url-phishing...
Turkish isn't on a fully separate script, most letters are standard ascii and only a few are special (it's closer to French or German with the accentuated characters), so you don't have the explicit switch, it's always mixed.
https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)
PS Apparently the Stargate SG-1 symbols are completely out of the question. How can they be copyrighted if they're based on constellations?
As a developer, if some code works perfectly on your own computer, the journey has barely just begun.