Chapter 8. Strings

Variables with character datatypes store text and are manipulated by character functions. Because character strings are "free-form," there are few rules concerning their content. For example, you can store numbers and letters, as well as any combination of special characters, in a character-type variable. There are, however, several different kinds of character datatypes, each of which serves a particular purpose.

CLOB (character large object) and LONG, while arguably character types, cannot be used in the same manner as the character types discussed in this chapter, and are more usefully thought of as large object types. We discuss large object types in Chapter 12.

 

8.1 The Impact of Character Sets

Working with strings used to be a short and simple topic. However, as applications have grown more international in nature, Oracle's support for different character sets, especially Unicode, has expanded, and a good understanding of character set issues is now almost a necessity when working with strings.

8.1.1 What Is a Character Set?

A character set is a mapping between a set of characters meaningful to humans and a set of bit sequences used to represent those characters in a computer or on a disk. 7-bit ASCII is a commonly used character set in the United States. Each 7-bit ASCII character is represented as a sequence of seven bits within an eight-bit byte. The letter G, then, is represented as 0100 0111. Look at that same string of bits as a number, and you end up with 0x47 (hexadecimal) or 71 (decimal). With seven bits, you can represent only 128 characters, enough to handle American English and little else. The characters that a character set's designers choose to represent, together with their underlying numeric values, form the definition of a character set.

7-bit ASCII was one of the first character sets to be defined, and it's very U.S.-centric. By that we mean that the people who defined 7-bit ASCII did not choose to represent any letters needed by languages other than English. As a result, many, many other character sets have been defined by various standards organizations and companies in order to handle characters used by other languages. Many of these character sets are supersets of ASCII that make use of the eighth bit to represent an additional 128 characters. For example, the Microsoft Windows Code Page 1251 8-bit Latin/Cyrillic character set is compatible with ASCII, but also represents Cyrillic characters.

256 characters is enough for most western character sets, such as those based on the Latin or Cyrillic alphabets. However, 256 characters is nowhere near enough to representAsian languages such as Japanese, Korean, and Chinese—such languages have far more than 256 characters. Consequently, character sets for those languages typically use two or even more bytes per character. Such character sets are referred to as multibyte character sets.

Unicode is a relative newcomer on the character set scene. Unicode refers to a class of character sets that have been developed to incorporate all known characters into one character set. Different Unicode character sets are available, but each encompasses the same, or almost the same, universal set of characters.

For authoritative information on Unicode, visit http://unicode.org.