Character Set and Character Encoding

  • As we all know computer understands only binary, so suppose you want to feed character number 10 to a computer we can convert decimal 10 to binary i.e 1010 and we can feed it to a computer.
  • But suppose we want to feed character A into computer we cannot convert character A to binary. Hence we need to convert character A to a decimal value (decimal/hex/octa) first and then convert it into binary.
  • For example lets Assign value 65 to A and convert 65 to binary i.e 1000001

Character Set

  • In the above example we assigned value 65 to A similarly we can assign a number to each and every character present to a number.
  • This set in which a number is assigned to every character is called as Character Set

Character Encoding

  • Character Encoding is the process of assigning a value to a character and then converting that value to binary number so that computers can work on it as computers can only work only on Binary Numbers.

BOM (Byte Order Mark)

  • BOM use is optional. 
  • Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.
  • Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.
  • The BOM is encoded in the same scheme as the rest of the document and becomes a non-character Unicode code point if its bytes are swapped. 
  • If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a “zero-width non-breaking space” (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the “Word Joiner” character, U+2060.[1] This allows U+FEFF to be used only as a BOM.
  • This table illustrates how the BOM character is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding 
  • Reference – At the bottom of the page.
EncodingRepresentation (hexadecimal)Representation (decimal)Bytes as CP1252 characters
UTF-8[a]EF BB BF239 187 191
UTF-16 (BE)FE FF254 255þÿ
UTF-16 (LE)FF FE255 254ÿþ
UTF-32 (BE)00 00 FE FF0 0 254 255^@^@þÿ (^@ is the null character)
UTF-32 (LE)FF FE 00 00255 254 0 0ÿþ^@^@ (^@ is the null character)
UTF-7[a]2B 2F 76[b]43 47 118+/v
UTF-1[a]F7 64 4C247 100 76÷dL
UTF-EBCDIC[a]DD 73 66 73221 115 102 115Ýsfs
SCSU[a]0E FE FF14 254 255^Nþÿ (^N is the “shift out” character)
BOCU-1[a]FB EE 28251 238 40ûî(
GB-18030[a]84 31 95 33132 49 149 51„1•3

List of some Character Set

  • ASCII
    • An encoding for English characters based on 7-bits that are mapped to 128 characters
  • Unicode
    • It currently has 128,237 characters that are increased over time.
    • Its current maximum is 1,114,112 characters that represents the hexadecimal numbers 0 to 10FFFF.
    • Unicode is a standard for a character set and not a character set itself.
  • UTF-8
    • UTF-8 is a character set that implements Unicode. 
    • Despite its name, UTF-8 isn’t a static 8-bit encoding but instead is a variable length encoding that uses up to 32 bits.
    • It encodes the most common characters, such as basic numbers and English with 8-bits. This makes it efficient for most data
    •  Another advantage of UTF-8 is that for English, it is identical to ASCII.
  • UTF-16
    • UTF-16 is a character set that implements Unicode.
    • Like UTF-8, it is a variable length encoding that uses up to 32 bits. 
    • It encodes the most common characters with 16-bits and less common characters with 32-bits.
  • UTF-32
    • UTF-32 is a character set that implements Unicode as a static 32-bit code. 
    • Unicode only requires 21-bits to encode its limit of 1,114,112 characters. 
    • As such, UTF-32 has a number of leading zeros that pad each code.
    • This is inefficient and all data is smaller in UTF-8 and UTF-16.
    • For English data, UTF-32 is typically about 4 times larger.
  • ANSI (CP-1252 or CodePage-1252 or ISO-8859)
    • “ANSI” encoding is 8-bit rather than 7-bit as ASCII is (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0). 
    • This is essentially an extension of the ASCII character set in that it includes all the ASCII characters with an additional 128 character codes. 
    • It is more properly referred to as Windows-1252 on Western/U.S. systems. (It can represent certain other Windows code pages on other systems.) 
    • Strictly speaking, there is no such thing as ANSI encoding. Colloquially the term ANSI is used for several different encodings (ISO 8859-1 or CP1252)
  • Code Page : CP-XXXX
    • The term “code page” originated from IBM’s EBCDIC-based mainframe systems but Microsoft, SAP, and Oracle Corporation are among the few vendors which use this term.
    •  In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them.
    • Originally, the code page numbers referred to the page numbers in the IBM standard character set manual
    • For example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.
    • For more details, reference at the bottom of this page

Type a Special Unicode in Unix

  • Select the unicode from a website (Example 1, Example 2)
  • Open a file in vi editor
  • Suppose the unicode is é and the unicode for the symbol is U+00E9
  • Then you will write the same in VI editor as Ctrl+Shift+U+00E9

Debugging Unicode

Finding the character set of a file

How to reproduce the character in Unicode

Find characters that are unsupported in a character set.

Reference :

Leave a Comment