The Dual Nature of `char`

When they developed C, they probably should have created a type called byte. Instead, char serves that purpose. char is both a character representation, but for the purposes of converting to an int, it's also a int representation.

A char typically stores characters represented in ASCII. ASCII codes have its MSb set to 0, since it is really a 7-bit code.

However, there's no enforcement of this policy in C or C++. You can set the high bit to 1. This would create a bitpattern that does not correspond to a valid ASCII code.

Nevertheless, it's useful to allow any possible bitstring pattern for a char even though patterns with the MSb equal to 1 are invalid. The reason? There's no byte type!

And even though there's no byte type, it's convenient to have bytes. For example, you may need to read the contents of a file, byte by byte, say, to write a hexdump utility.

Furthermore, people often do computations with characters, such as:

   if ( ch >= 'a' && ch <= 'z' )
      ch -= 'a' - 'A' ; // Subtract 32

Notice that we're doing comparisons and arithemtic operations. For that to make any sense, we need the ability to treat those bits like numbers.

char as 1-byte int

For the purposes of comparing values and casting to an int, char can be considered 1-byte signed 2C int. When you compare two characters, you compare their ASCII values as if they were 2C. Of course, with the high bit set to zero for valid ASCII codes, the comparison is between non-negative values (which means it doesn't matter if the representation is 2C or UB).

However, if the high bit of a char were 1, then you'd have to think of the character as a signed quantity. Thus, a char with its MSb set to 1 is considered negative int value.

unsigned char

Because char types can be thought of as 1-byte int's, it makes sense to talk about unsigned char. If char only represented character data (implying that you couldn't do arithmetic or comparisons of character data), then unsigned char would make little sense. You'd only need char. However, unsigned char really does exist, and so the question is why.

The reason it exists is simple. chars can be thought of as a signed int, so it makes some sense to have an unsigned version, so you can easily cast unsigned char to larger int types, and zero-extend.

Java, for example, has no unsigned int types. This makes it a pain to cast to larger sizes if you plan to zero-extend an int. Either, you set the MSb to 0 in the shorter int type, cast to the larger, and then reset that bit, or you cast to the larger size, and zero out the upper bits using a mask.

Converting between char and unsigned char

You can easily convert between char and unsigned char. Just do something like:

   char ch = 'A' ;
   unsigned char ch2 = static_cast<unsigned char>( ch ) ;

The bit patterns don't change when you make this conversion. That is, both ch and ch2 have the same representation. Where things change is when you compare two chars. If it's unsigned char, you treat the bits as if it were 1-byte UB. If it's char, you treat it as if it were 1-byte signed 2C.

Also, when you cast to an unsigned char to an unsigned int, it zero-extends instead of sign-extending.

Summary

char types serve two different roles. First, it's a character representation. Second, it acts like a 1-byte int. The second role is useful when comparing two chars, when doing math on two chars, and when casting to an int type.

Web Accessibility

The Dual Nature of char

char as 1-byte int

unsigned char

Converting between char and unsigned char

Summary

The Dual Nature of `char`