- As we all know computer understands only binary, so suppose you want to feed character number 10 to a computer we can convert decimal 10 to binary i.e 1010 and we can feed it to a computer.
- But suppose we want to feed character A into computer we cannot convert character A to binary. Hence we need to convert character A to a decimal value (decimal/hex/octa) first and then convert it into binary.
- For example lets Assign value 65 to A and convert 65 to binary i.e 1000001
Character Set
- In the above example we assigned value 65 to A similarly we can assign a number to each and every character present to a number.
- This set in which a number is assigned to every character is called as Character Set
Character Encoding
- Character Encoding is the process of assigning a value to a character and then converting that value to binary number so that computers can work on it as computers can only work only on Binary Numbers.
BOM (Byte Order Mark)
- BOM use is optional.
- Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.
- Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.
- The BOM is encoded in the same scheme as the rest of the document and becomes a non-character Unicode code point if its bytes are swapped.
- If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a “zero-width non-breaking space” (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the “Word Joiner” character, U+2060.[1] This allows U+FEFF to be used only as a BOM.
- This table illustrates how the BOM character is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding
- Reference – At the bottom of the page.
Encoding | Representation (hexadecimal) | Representation (decimal) | Bytes as CP1252 characters |
---|---|---|---|
UTF-8[a] | EF BB BF | 239 187 191 |  |
UTF-16 (BE) | FE FF | 254 255 | þÿ |
UTF-16 (LE) | FF FE | 255 254 | ÿþ |
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 | ^@^@þÿ (^@ is the null character) |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 | ÿþ^@^@ (^@ is the null character) |
UTF-7[a] | 2B 2F 76[b] | 43 47 118 | +/v |
UTF-1[a] | F7 64 4C | 247 100 76 | ÷dL |
UTF-EBCDIC[a] | DD 73 66 73 | 221 115 102 115 | Ýsfs |
SCSU[a] | 0E FE FF | 14 254 255 | ^Nþÿ (^N is the “shift out” character) |
BOCU-1[a] | FB EE 28 | 251 238 40 | ûî( |
GB-18030[a] | 84 31 95 33 | 132 49 149 51 | „1•3 |
List of some Character Set
- ASCII
- An encoding for English characters based on 7-bits that are mapped to 128 characters
- Unicode
- It currently has 128,237 characters that are increased over time.
- Its current maximum is 1,114,112 characters that represents the hexadecimal numbers 0 to 10FFFF.
- Unicode is a standard for a character set and not a character set itself.
- UTF-8
- UTF-8 is a character set that implements Unicode.
- Despite its name, UTF-8 isn’t a static 8-bit encoding but instead is a variable length encoding that uses up to 32 bits.
- It encodes the most common characters, such as basic numbers and English with 8-bits. This makes it efficient for most data
- Another advantage of UTF-8 is that for English, it is identical to ASCII.
- UTF-16
- UTF-16 is a character set that implements Unicode.
- Like UTF-8, it is a variable length encoding that uses up to 32 bits.
- It encodes the most common characters with 16-bits and less common characters with 32-bits.
- UTF-32
- UTF-32 is a character set that implements Unicode as a static 32-bit code.
- Unicode only requires 21-bits to encode its limit of 1,114,112 characters.
- As such, UTF-32 has a number of leading zeros that pad each code.
- This is inefficient and all data is smaller in UTF-8 and UTF-16.
- For English data, UTF-32 is typically about 4 times larger.
- ANSI (CP-1252 or CodePage-1252 or ISO-8859)
- “ANSI” encoding is 8-bit rather than 7-bit as ASCII is (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0).
- This is essentially an extension of the ASCII character set in that it includes all the ASCII characters with an additional 128 character codes.
- It is more properly referred to as Windows-1252 on Western/U.S. systems. (It can represent certain other Windows code pages on other systems.)
- Strictly speaking, there is no such thing as ANSI encoding. Colloquially the term ANSI is used for several different encodings (ISO 8859-1 or CP1252)
- Code Page : CP-XXXX
- The term “code page” originated from IBM’s EBCDIC-based mainframe systems but Microsoft, SAP, and Oracle Corporation are among the few vendors which use this term.
- In the case when there is a plethora of character sets (like in IBM), identifying character sets through a number is a convenient way to distinguish them.
- Originally, the code page numbers referred to the page numbers in the IBM standard character set manual
- For example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.
- For more details, reference at the bottom of this page
Type a Special Unicode in Unix
- Select the unicode from a website (Example 1, Example 2)
- Open a file in vi editor
- Suppose the unicode is é and the unicode for the symbol is U+00E9
- Then you will write the same in VI editor as Ctrl+Shift+U+00E9
Debugging Unicode
Finding the character set of a file
How to reproduce the character in Unicode
Find characters that are unsupported in a character set.
Reference :