According to Davis, Unicode encoding consists of three elements: The character, its properties, and textual descriptions. Contrary to popular belief, the character is not the glyph of typography -- in other words, not the physical appearance of any character in any given font, but the "abstract character" of all representations of that character. "A given character like an "A" could have many different shapes," Davis explains. "What Unicode is encoding is the abstract character "A," which can be represented with lots and lots of shapes."
This abstract character is defined by such properties as whether it has upper- and lower-case forms, and how it interacts with other characters at a line break. "The properties are driven by the question of what computers need to do with the characters," Davis says, such as using them in a search or displaying them properly on the Web.
In turn, the textual description comprises such details as "how the properties and characters should be used" in each programming implementation of the Unicode standard, such as UTF-8 or UTF-16. The descriptions are based both on observations of how characters are used in writing and on rules that are developed to enable programs to use each character in every given circumstance. These descriptions have to be written separately for each encoding of Unicode.
In order for Unicode to remain a reliable standard, both properties and descriptions need to be as complete as possible. Talking about the pre-Unicode days of the 1980s, when Davis co-authored Japanese and Hebrew versions of the Macintosh operating system, each of required separate development of text rendering, he recalls, "we were really working with a Tower of Babel with the way that computers were handling text. It was a mishmash because you couldn't depend on a given character having the same code. Or one character could have two different codes in the computer, or the same code could have two different meanings."
The Unicode standard grew out of individual work on the common mapping of Chinese and Japanese characters to a common standard at Xerox and discussions about a universal character set at Apple in the mid-1980s. In 1988, the two groups began working together. By the end of the year, the two groups were referring to their joint efforts as "Unicode." Over the next few years, other major companies joined the discussions, and in 1991, the Unicode Consortium was formed and the first version of the standard was released. Since then, Unicode has been criticized about its implementation of some Asian languages -- the common mapping of Chinese, Japanese, and Korean in particular being criticized as incomplete and ethnocentric -- but, over the last two decades, in general Unicode has helped bring a measure of stability to character encoding, partly through thoroughness and partly through its cooperation with other encoding standards, such as ISO-8559.
Still, the process of defining the properties and descriptions of alphabets and symbol sets remains ongoing, even for those sets already incorporated into Unicode. "Properties can change for a number of reasons," Davis says. "One reason is that, as we find out more information about characters, especially characters used to write languages that are less common or less well-known, we have to modify the properties of characters. Some of it is that, as people get to have a better idea of how characters are used in processing, it becomes clear that a character should have one property instead of another. In the vast majority of cases, the properties assigned to a character are stable, but some of them will change." Davis likens the process to biological classification, in which the characteristics and relations of each species can change as new information becomes available.
Another reason for revisions of the standards is the errata that accumulate after each version. Unicode publishes errors as they are observed, but makes no attempt to correct them until the next minor or major version. Other changes have to be made for security, as in an earlier version, when multiple renderings of a forward slash in UTF-8 might have permitted crackers to navigate directory structures, or duplicates in different character sets made spoofing of addresses and sites easier. Still another reason is to bring the increasingly small group of languages not covered by Unicode into the standard. Moreover, while all these priorities are being juggled, backward compatibility with earlier versions remains a high priority, although it is not always possible.
Specific changes in Unicode 5.0 include:
"It's not radical," Davis says, summarizing the new version of the standard. "At this point, we don't want anything radical." All the same, he is already looking ahead to the rendering and potential security issues raised by international domain names (ones entered with characters other than the standard English ones available in ASCII encoding), and the encoding of the last few languages that remain outside the standard.
"For me, it's a fascinating area," Davis says. "I'm really glad I got into it. Trying to figure how people use their languages and how we can bring the advantages of computation -- and that includes everything from personal computers to mobile telephones -- in their own languages all the way around the world -- that has really been the goal of the Unicode Consortium."
The Unicode 5.0 Standard is published by Addison-Wesley Professional and can be ordered from the Unicode Consortium site.
Bruce Byfield is a course designer and instructor, and a computer journalist who writes regularly for NewsForge, Linux.com and IT Manager's Journal.
You must log in to comment on Unicode 5.0 continues toppling of Tower of Babel
There are no comments attached to this item.