I thought the most mode sane and modern language use the unicode block identification to determine something can be used in valid identifier or not. Like all the ‘numeric’ unicode characters can’t be at the beginning of identifier similar to how it can’t have ‘3var’.
So once your programming language supports unicode, it automatically will support any unicode language that has those particular blocks.
Sanity is subjective here. There are reasons to disallow non-ASCII characters, for example to prevent identical-looking characters from causing sneaky bugs in the code, like this but unintentional: https://en.wikipedia.org/wiki/IDN_homograph_attack (and yes, don’t you worry, this absolutely can happen unintentionally).
OCaml’s old m17n compiler plugin solved this by requiring you pick one block per ‘word’ & you can only switch to another character. As such you can do print_แมว but you couldn’t do pℝint_c∀t. This is a totally reasonable solution.
I can’t imagine how something like homograph attacks can happen accidentally. If someone does this in code, they probably intended to troll other contributors.
Multilingual users have multiple keyboard layouts, usually switching with Alt+Shift or similar key combo. If you’re multitasking you might not realize you’re on the wrong keyboard layout. So say you’re chatting with someone in Russian, then you alt+tab to your source code and you spot a typo - you wrote my_var_xopy instead of my_var_copy. You delete the x and type in c. You forget this happened and you never realized the keyboard layout was wrong.
That c that you typed is now actually с, Cyrillic Es.
I use multilingual keyboard layouts, so I know that at least on Windows it’s specific to each window. If I chat with someone in one language, then switch to my IDE, it will not keep the layout I used in the chat window.
But I also have accidently hit the combination to change layouts while doing something, so it can happen. I’m just surprised that Cyrillic с is on the same key as C, instead of S.
I believe there’s a setting for whether it’s global or per-window. Personally I prefer global, because I can’t keep track of more than one state and I absolutely hate the experience of typing something and getting a different language than you expect.
Sorry, I forgot about this. I meant to say any sane modern language that allows unicode should use the block specifications (for e.g. to determine the alphabets, numeric, symbols, alphanumeric unicodes, etc) for similar rules with ASCII. So that they don’t have to individually support each language.
Oh, that I agree with. But then there’s the mess of Unicode updates, and if you’re using an old version of the compiler that was built with an old version of Unicode, it might not recognize every character you use…
I thought the most mode sane and modern language use the unicode block identification to determine something can be used in valid identifier or not. Like all the ‘numeric’ unicode characters can’t be at the beginning of identifier similar to how it can’t have ‘3var’.
So once your programming language supports unicode, it automatically will support any unicode language that has those particular blocks.
Sanity is subjective here. There are reasons to disallow non-ASCII characters, for example to prevent identical-looking characters from causing sneaky bugs in the code, like this but unintentional: https://en.wikipedia.org/wiki/IDN_homograph_attack (and yes, don’t you worry, this absolutely can happen unintentionally).
OCaml’s old m17n compiler plugin solved this by requiring you pick one block per ‘word’ & you can only switch to another character. As such you can do
print_แมว
but you couldn’t dopℝint_c∀t
. This is a totally reasonable solution.That’s pretty cool
I can’t imagine how something like homograph attacks can happen accidentally. If someone does this in code, they probably intended to troll other contributors.
Multilingual users have multiple keyboard layouts, usually switching with Alt+Shift or similar key combo. If you’re multitasking you might not realize you’re on the wrong keyboard layout. So say you’re chatting with someone in Russian, then you alt+tab to your source code and you spot a typo - you wrote
my_var_xopy
instead ofmy_var_copy
. You delete the x and type in c. You forget this happened and you never realized the keyboard layout was wrong.That c that you typed is now actually с, Cyrillic Es.
What do you say, is that realistic enough?
I use multilingual keyboard layouts, so I know that at least on Windows it’s specific to each window. If I chat with someone in one language, then switch to my IDE, it will not keep the layout I used in the chat window.
But I also have accidently hit the combination to change layouts while doing something, so it can happen. I’m just surprised that Cyrillic с is on the same key as C, instead of S.
I believe there’s a setting for whether it’s global or per-window. Personally I prefer global, because I can’t keep track of more than one state and I absolutely hate the experience of typing something and getting a different language than you expect.
Sorry, I forgot about this. I meant to say any sane modern language that allows unicode should use the block specifications (for e.g. to determine the alphabets, numeric, symbols, alphanumeric unicodes, etc) for similar rules with ASCII. So that they don’t have to individually support each language.
Oh, that I agree with. But then there’s the mess of Unicode updates, and if you’re using an old version of the compiler that was built with an old version of Unicode, it might not recognize every character you use…