Character Encoding
TextIn computer terminology, Character Encoding refers to a system in which each character is represented by a specific computer value, which varies depending on the encoding system. The term "encoding" in English means converting or transforming one value into another (e.g., a string into a binary value and vice versa).
A character represents:
- Numbers
- Letters (uppercase or lowercase, from any language)
- Symbols: @ - _ ! ? # & / . ,
- Whitespace characters
- Control characters, which do not represent any visible symbol but are bits of information used for text processing, such as TAB, ENTER, DELETE, SHIFT, etc.
Characters displayed on the screen are actually stored in memory as numeric values. The computer converts these numeric values into visible characters through a process called encoding.
An encoding standard (system, schema) is a database where each character is assigned a numeric value. Character encoding is the process of converting specific characters into their numeric equivalents and vice versa, based on the encoding schema used. Different languages have different characters, so there are various encoding systems/schemas designed to translate these languages into numeric values.
In most text documents, as well as web pages and other data involving strings, the encoding system used for writing that string is specified and stored. This ensures that when the string is opened on another computer, each character is displayed correctly. For instance, if you write a string in Cyrillic and send it to another computer, the Cyrillic letter “И” is represented by the number 201. The receiving computer will recognize that the numeric value 201 corresponds to the letter “И” in Cyrillic and display it accordingly. It is crucial for both computers to use the same encoding system; otherwise, there is a possibility that the Cyrillic letters may be displayed as different characters or even as the placeholder symbol: �.
To avoid these problems, universal encoding systems were invented, such as Unicode, which can represent every character in the world in the same way. This ensures that a string written in Cyrillic will be displayed correctly, regardless of the language used, whether it is Japanese, Chinese, Russian, or others.
To best explain the different encoding systems, I will present them chronologically by their appearance and usage. Understanding their evolution is essential because these systems and standards have built upon each other, making string encoding a complex and sometimes confusing topic.
ASCII
In the early days of computing and programming languages, only the English alphabet was used. The only characters that mattered were good old English letters, and a code called ASCII was created to represent every English character using a number between 32 and 127. Space was 32, the letter “A” was 65, and so on. This could be conveniently stored in 7 bits (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0).
In the ASCII system, each character is assigned a specific binary value:
A -> 01000001
Since ASCII was a 7-bit system and computers at the time operated with 8 bits, there was unused space in the ASCII system (codes between 128 and 255). Many programmers and companies took advantage of this and created their own extensions to the ASCII system based on their needs. More importantly, as computers spread outside the United States, countries with scripts other than the Latin alphabet began adding their own characters to the ASCII system, using the empty slots. This led to numerous problems:
For example, on some PCs, the character code 130 would display as é, but on computers sold in Israel, it was the Hebrew letter Gimel. So when Americans sent their résumés to Israel, they arrived as résumés. In many cases, such as Russian, there were multiple conflicting interpretations of the upper-128 characters, making it impossible to reliably exchange Russian documents.
Before Unicode was invented, there were thousands of different encoding systems, many of which were incompatible and conflicting. Since no single encoding system could include all characters, it often happened that the same numeric value corresponded to entirely different characters in different systems or languages. This made data exchange between different platforms and languages challenging. To address this issue, the ANSI standard was introduced.
ANSI
To overcome the issues of incorrect character encoding, the ASCII system was standardized and expanded, resulting in the first standardized character encoding system: ANSI. In Windows, this system is called Windows-1252 (for Western European languages). The term ANSI is somewhat misleading, as it refers to the American National Standards Institute, which standardized it, but the name has persisted.
ANSI is essentially an extension of the ASCII character set, including all the ASCII characters plus an additional 128 character codes. This difference exists because ANSI encoding is 8-bit rather than 7-bit like ASCII (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0).
The ANSI format stores only the 128 ASCII characters and 128 extended characters. ANSI uses only 1 byte per character, meaning it can store a maximum of 256 different characters, which is insufficient to support all Unicode characters.
What was done here is that everyone agreed that the characters up to 128 would remain the Latin characters already used in ASCII (with the same codes), while new characters were added afterward.
However, handling characters from 128 onwards depended on regional needs. These different systems were called code pages. For example, in Israel, DOS used code page 862, while Greek users used 737. Below 128, they were the same, but from 128 onwards, they differed. This system made it impossible to represent languages like Hebrew and Greek on the same computer without creating custom programs using bitmap graphics.
While ANSI resolved some issues, it created others, especially for Asian scripts, which contain thousands of characters and could not fit within 8 bits. This led to a messy system called DBCS (double-byte character set), where some letters were stored in one byte and others in two. However, this approach also introduced additional problems.
On the basis of ANSI, many different encoding standards (8-bit) emerged, depending on the language and region. For example:
- Latin-10 (South-Eastern European) or ISO/IEC 8859-16 – includes our Latin script
- Latin-2 or ISO 8859-2 (Western and Central Europe) – also includes our region
- Windows-1250 – for Windows platforms, our Latin script
- Windows-1251 – includes Cyrillic scripts
But still, most people assumed that a byte equaled a character and a character was 8 bits. As long as strings were not transferred between computers or languages, it seemed to work. However, with the advent of the Internet, string transfers between computers became commonplace, and the entire system broke down. Fortunately, Unicode had been invented.
Unicode
Unicode is a standard where each possible character in the world – letters, symbols, numbers – is represented by a unique numeric value, called a code point, regardless of platform, program, or language. Unicode itself is not an encoding but a database containing every possible character on the planet (all numbers, symbols, letters from all possible languages), each represented by a unique code point. The Unicode database is continuously expanded and now contains over 100,000 characters. Encoding Unicode characters is done using UTF and similar standards. Encoding translates code points into binary and vice versa, enabling strings to be read from and written to memory. See more about UTF below.
Author of the text: makica
Add Your Comment