Character Representation: ASCII, EBCDIC, and Unicode

Introduction

Even though many people used to think of computers as "number crunchers", people figured out long ago that it's just as important to handle character data.

Character data isn't just alphabetic characters, but also numeric characters, punctuation, spaces, etc. Most keys on the central part of the keyboard (except shift, caps lock) are characters.

As we've discussed with signed and unsigned ints, characters need to represented. In particular, they need to be represented in binary. After all, computers store and manipulate 0's and 1's (and even those 0's and 1's are just abstractions---the implementation is typically voltages).

Unsigned binary and two's complement are used to represent unsigned and signed int respectively, because they have nice mathematical properties, in particular, you can add and subtract as you'd expect.

However, there aren't such properties for character data, so assigning binary codes for characters is somewhat arbitrary. The most common character representation is ASCII, which atands for American Standard Code for Information Interchange.

There are two reasons to use ASCII. First, we need some way to represent characters as binary numbers (or, equivalently, as bitstring patterns). There's not much choice about this since computers represent everything in binary.

If you've noticed a common theme, it's that we need representation schemes for everything. However, most importantly, we need representations for numbers and characters. Once you have that (and perhaps pointers), you can build up everything you need.

The other reason we use ASCII is because of the letter "S" in ASCII, which stands for "standard". Standards are good because they allow for common formats that everyone can agree on.

Unfortunately, there's also the letter "A", which stands for American. ASCII is clearly biased for the English language character set. Other languages may have their own character set, even though English dominates most of the computing world (at least, programming and software).

Nice Properties

Even though character sets don't have mathematical properties, there are some nice aspects about ASCII. In particular, the lowercase letters are contiguous ('a' through 'z' maps to 97₁₀ through 122₁₀). The upper case letters are also contiguous ('A' through 'Z' maps to 65₁₀ through 90₁₀). Finally, the digits are contiguous ('0' through '9' maps to to 48₁₀ through 57₁₀).

Since they are contiguous, it's usually easy to determine whether a character is lowercase or uppercase (by checking if the ASCII code lies in the range of lower or uppercase ASCII codes), or to determine if it's a digit, or to convert a digit in ASCII to an int value.

ASCII Code (Decimal)

This chart can be found by typing man ascii.

 0 nul   16 dle   32 sp    48 0     64 @     80 P     96 `    112 p
 1 soh   17 dc1   33 !     49 1     65 A     81 Q     97 a    113 q
 2 stx   18 dc2   34 "     50 2     66 B     82 R     98 b    114 r
 3 etx   19 dc3   35 #     51 3     67 C     83 S     99 c    115 s
 4 eot   20 dc4   36 $     52 4     68 D     84 T    100 d    116 t
 5 enq   21 nak   37 %     53 5     69 E     85 U    101 e    117 u
 6 ack   22 syn   38 &     54 6     70 F     86 V    102 f    118 v
 7 bel   23 etb   39 '     55 7     71 G     87 W    103 g    119 w
 8 bs    24 can   40 (     56 8     72 H     88 X    104 h    120 x
 9 ht    25 em    41 )     57 9     73 I     89 Y    105 i    121 y
10 nl    26 sub   42 *     58 :     74 J     90 Z    106 j    122 z
11 vt    27 esc   43 +     59 ;     75 K     91 [    107 k    123 {
12 np    28 fs    44 ,     60 <     76 L     92 \    108 l    124 |
13 cr    29 gs    45 -     61 =     77 M     93 ]    109 m    125 }
14 so    30 rs    46 .     62 >     78 N     94 ^    110 n    126 ~
15 si    31 us    47 /     63 ?     79 O     95 _    111 o    127 del

The characters between 0 and 31 are generally not printable (control characters, etc). 32 is the space character.

Also note that there are only 128 ASCII characters. This means only 7 bits are required to represent an ASCII character. However, since the smallest size representation on most computers is a byte, a byte is used to store an ASCII character. The MSb of an ASCII character is 0.

Sometimes ASCII has been extended by using the MSb.

ASCII Code (Hex)

This chart can be found by typing man ascii.

00 nul   10 dle   20 sp    30 0     40 @     50 P     60 `     70 p
01 soh   11 dc1   21 !     31 1     41 A     51 Q     61 a     71 q
02 stx   12 dc2   22 "     32 2     42 B     52 R     62 b     72 r
03 etx   13 dc3   23 #     33 3     43 C     53 S     63 c     73 s
04 eot   14 dc4   24 $     34 4     44 D     54 T     64 d     74 t
05 enq   15 nak   25 %     35 5     45 E     55 U     65 e     75 u
06 ack   16 syn   26 &     36 6     46 F     56 V     66 f     76 v
07 bel   17 etb   27 '     37 7     47 G     57 W     67 g     77 w
08 bs    18 can   28 (     38 8     48 H     58 X     68 h     78 x
09 ht    19 em    29 )     39 9     49 I     59 Y     69 i     79 y
0a nl    1a sub   2a *     3a :     4a J     5a Z     6a j     7a z
0b vt    1b esc   2b +     3b ;     4b K     5b [     6b k     7b {
0c np    1c fs    2c ,     3c <     4c L     5c \     6c l     7c |
0d cr    1d gs    2d -     3d =     4d M     5d ]     6d m     7d }
0e so    1e rs    2e .     3e >     4e N     5e ^     6e n     7e ~
0f si    1f us    2f /     3f ?     4f O     5f _     6f o     7f del

The difference in the ASCII code between an uppercase letter and its corresponding lowercase letter is 20₁₆.

This makes it easy to convert lower to uppercase (and back) in hex (or binary).

`char` as a one byte int

It turns out that C supports two char types: char (which is usually considered "signed") and unsigned char, which is unsigned.

This may seem like a particularly odd feature. What does it mean to have a signed or unsigned char?

This is where it's useful to think of a char as a one byte int. When you want to cast char to an int (of any size), the rules for sign-extension may apply. In particular, if the MSb of a char is 1, then casting it to an int may cause this 1 to sign extend, which may be surprising if you're not expecting it.

Of course, you may observe "but how would the MSb get the value 1?". If you recall, char is one of the data types that you can manipulate with bitwise and bitshift operators. This means you can set or clear any bit of a char. In particular, you can set or clear the MSb of a char.

Another way you might get 1 for the MSb is casting an int down to a char. Usually, this means truncating off the upper bytes, leaving the least signficant byte. Since an int can have any bit pattern, there's a possibility that the least significant byte has a 1 in bit position b₇.

You should think of a char as both an ASCII character as well as an 8 bit int. This duality is important because char is the only data type that is 1 byte. There are no other data types that are 1 byte.

Other Character Codes

While ASCII is still popularly used, another character representation that was used (especially at IBM) was EBCDIC, which stands for Extended Binary Coded Decimal Interchange Code (yes, the word "code" appears twice). This character set has mostly disappeared. EBCDIC does not store characters contiguously, so this can create problems alphabetizing "words".

One problem with ASCII is that it's biased to the English language. This generally creates some problems. One common solution is for people in other countries to write programs in ASCII.

Other countries have used different solutions, in particular, using 8 bits to represent their alphabets, giving up to 256 letters, which is plenty for most alphabet based languages (recall you also need to represent digits, punctuation, etc).

However, Asian languages, which are word-based, rather than character-based, often have more words than 8 bits can represent. In particular, 8 bits can only represent 256 words, which is far smaller than the number of words in natural languages.

Thus, a new character set called Unicode is now becoming more prevalent. This is a 16 bit code, which allows for about 65,000 different representations. This is enough to encode the popular Asian languages (Chinese, Korean, Japanese, etc.). It also turns out that ASCII codes are preserved. What does this mean? To convert ASCII to Unicode, take all one byte ASCII codes, and zero-extend them to 16 bits. That should be the Unicode version of the ASCII characters.

The biggest consequence of using Unicode from ASCII is that text files double in size. The second consequence is that endianness begins to matter again. With single bytes, there's no need to worry about endianness. However, you have to consider that with two byte quantities.

While C and C++ still primarily use ASCII, Java has already used Unicode. This means that Java must create a byte type, because char in Java is no longer a single byte. Instead, it's a 2 byte Unicode representation.

ASCII files

It's easy to fool yourself into thinking that numbers written in a file are actually the internal representation. For example, if you write 123 in a file using a text editor, is that really how the integer 123 is stored?

The file does NOT storing 123. Instead, it stores the ASCII code for the character '1', '2', and '3' (which is 31, 32, 33 in hex or 0011 0001, 0011 0010, 0011 0011 in binary).

ASCII files store bytes. Each byte is the ASCII code for some character in the character set. You can think of a text editor as a translator. It translates those binary numbers into symbols on the screen. Thus, when it sees 41₁₆, that's the ASCII code for 'A', and thus 'A' gets displayed.

Some people think that if they type 0's and 1's in a text editor, they are writing out bits into a binary file. This is not true. The file contains the ASCII code for the character '0' and the character '1'.

There are hex editors which allow you to either type in binary or more commonly in hex. Those hex pairs are translated to binary. Thus, when you write F3, the binary number 1111 0011 is written to the file (the space is placed there only to make the binary number easy to read).

Summary

Character data is at least as important as numeric data. Like numeric data, character data is represented using 0's and 1's. The most commonly used character representation (at least, in the US) is ASCII. However, Unicode is gaining popularity, and should eventually become the standard character set in programming languages.

Web Accessibility