Need Help Understanding UNICODE!

I just did the sort challenge in javascript and stumbled upon the “Unicode Code Point system” for the first time. I am having trouble understanding the whole concept of it. It would be a great help if you could provide some useful links for understanding this or give a bit of explanation of some useful things required at beginner level yourself.
Thanks in advance :slight_smile:

Did you try googling “what is unicode”? If you’ve already read the documentation, wikipedia, etc, then what part(s) don’t you understand and want help with?

1 Like

This is a good starting point

A string i.e. a sequence of characters is represented as numbers in computer programs - each distinct character needs a unique numeric assignment - the English alphabet has 26 letters - so we could say A is 0, B is 1, …, Z is 25 - this works for A-Z but what about lowercase letters - surely there are times when a is treated differently from A - so let’s keep the A-Z assignments and say a is 26, b is 27, …, z is 51 - this takes care of upper and lowercase A-Z - but what about numbers themselves - ok let’s say 0-9 are assigned 52-61 - what about punctuation - comma, period, question mark, exclamation point, … - what about space, tab?

You see where I’m going with this - ASCII is an old and standard assignment of numeric codes 0-127 to characters commonly used in English - but ASCII covers just 128 characters

Now the problem is what about characters in other languages - well if we don’t need to mix and match languages in the same program we could use similar possibly overlapping assignments for other languages - e.g. Arabic letters, digits and punctuation are quite different from English but a program that just needs to process Arabic strings could use the same numbers 0-127 to represent just Arabic characters and it would work

These separate representations of world languages have an inherent limitation of being unusable for multiple languages at the same time - people came up with systems of switching between representations so a program could use 0-127 for English at one point and switch to 0-127 for Arabic at another - it gets complicated fast but did not seem to get old fast enough because such systems are still in use - in reality 0-127 got reserved for English and other languages began to use 128-255 because 0-255 can be stored in just one byte of 8 bits

Unicode is an attempt to standardize a common representation of all languages of the world - clearly 128 or 256 numbers are not enough - the numbers run over a million - these numbers are called code points - the code point for A is 65, a is 97, 0 is 48, the Arabic alif ا is 1575, Arabic zero ٠ is 1632

Unicode lets programs handle strings in all languages of the world

Once upon a time, there were a million and one different encoding systems for all the languages of the world. Because we didn’t know any better back then, we would often send information without including the proper metadata about which encoding it was using. Sometimes, software developers would neglect to check which encoding was intended before displaying the information.

Those were dark times, filled with mojibake and ����.

Then, along came unicode and solved all our problems. The end.

…Well, not quite. Software still exists that doesn’t use UTF-8 as standard. Developers still forget to specify the encoding their content uses. And even the Unicode standard itself is far from perfect… you only have to look at the problems double-byte characters cause in programming languages such as our beloved JavaScript, or perhaps the struggles getting speakers of some languages to embrace Unicode as a standard, to see that there’s still work to be done.

But on the whole… Unicode is an excellent solution to a difficult problem: how to display the many and varied languages of the world without �������ing your ���.