About the Unicode Standard Version 5.0 Book (original) (raw)

	Hardcover: 1472 pages Author: The Unicode Consortium Publisher: Addison-Wesley Professional; 5th edition (November 3, 2006) Language: English Size: 7.5 x 9.5 in (19 x 24 cm) ISBN: 0321480910 Our Price: $52.99	The Unicode Standard, Version 5.0 is the one reference all developers, programmers, and managers who produce international software must have. With greatly improved figures and tables, and significantly revised text, Version 5.0 is much more accessible and delivers a stable, practical character processing model.
Read what Tim Berners-Lee, Brian Carpenter,Vint Cerf, Bill Gates, Steve Mills, Larry Page,Joel Spolsky, and many others have to say about Unicode, and see the Back Cover and Foreword for more about the importance of this work.

Back Cover of The Unicode Standard, Version 5.0

“Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.”

—, The Art of Computer Programming

“For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.”

—Bill Gates, Chairman, Microsoft Corporation

“The path W3C follows to making text on the Web truly global is Unicode.”

—Sir Tim Berners-Lee, KBE, Web Inventor and Director of the World Wide Web Consortium (W3C)

“Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.”

—James Gosling, Inventor of Java, Sun Microsystems, Inc.

These and other software luminaries recognize that Unicode has become an indispensable tool for supporting an increasingly global marketplace (see Acclaim for Unicode for more testimony). A comprehensive system of standards for representing alphabets throughout the world, Unicode is the basis for modern programming—Windows, XML, Python, PERL, Mac OS, Linux—and every major search engine and browser in operation today.

This new edition of Unicode's official reference manual has been substantially updated to document the latest revisions to the Unicode Standard, with hundreds of pages of new information. It includes major revisions to text, figures, tables, definitions, and conformance clauses, and provides clear and practical answers to common questions. For the first time, the book contains the Unicode Standard Annexes, which specify vital processes such as text normalization and identifier parsing.

New to Unicode Version 5.0

A stable foundation for Unicode Security Mechanisms

Property data for the Unicode Collation Algorithm and Common Locale Data Repository

Improvements to the Unicode Encoding Model for UTF-8

Rigorous stability of case folding and identifiers for improved interoperability and backward compatibility—enabling additional new ways to optimize code

A systematic framework for improved text processing for greater reliability—covering combining characters, Unicode strings, line breaking, and segmentation

These improvements are so important that Version 5.0 is the basis for Microsoft's Vista generation of operating systems, and is included in upgrade plans for Google, Yahoo!, and ICU, to name but a few.

This is the one book all developers using Unicode must have.

Foreword to The Unicode Standard, Version 5.0

Without much fanfare, Unicode has completely transformed the foundation of software and communications over the past decade. Whenever you read or write anything on a computer, you’re using Unicode. Whenever you search on Google, Yahoo!, MSN, Wikipedia, or many other Web sites, you’re using Unicode. Unicode 5.0 marks a major milestone in providing people everywhere the ability to use their own languages on computers.

We began Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the world’s languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and late—unsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled.

Unicode changed that situation radically. Now, for all text, programs only need to use a single representation—one that supports all the world’s languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice.

The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible.

With the rise of the Web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicode—every program handling XML must use Unicode internally. The search engines all use Unicode for good reason; even if a Web page is in a legacy character encoding, the only effective way to index that page for searching is to translate it into the lingua franca, Unicode. All of the text on the Web thus can be stored, searched, and matched with the same program code. Since all of the search engines translate Web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place.

This edition of The Unicode Standard, Version 5.0, supersedes and obsoletes all previous versions of the standard. The book is smaller in size, less expensive, and yet has hundreds of pages of new material and hundreds more of revised material. Like any human enterprise, Unicode is not without its flaws, of course. This book will help you work around some of the “gotchas” introduced into Unicode over the course of its development. Importantly, it will help you to understand which features may change in the future, and which cannot, so that you can appropriately optimize your implementations. You will also find a wealth of other information on the Unicode Web site (www.unicode.org). If you are interested in having a voice in determining directions for future development of Unicode, or want to follow closely the ongoing work, you will find information there on joining the Consortium.

What you have in your hands is the culmination of many years of experience from experts around the globe. I am sure you will find it very useful.

MARK DAVIS, Ph.D.
President
The Unicode Consortium