=head1 NAME perluniintro - Perl Unicode introduction =head1 DESCRIPTION This document gives a general idea of Unicode and how to use Unicode in Perl. See L for references to more in-depth treatments of Unicode. =head2 Unicode Unicode is a character set standard which plans to codify all of the writing systems of the world, plus many other symbols. Unicode and ISO/IEC 10646 are coordinated standards that unify almost all other modern character set standards, covering more than 80 writing systems and hundreds of languages, including all commercially-important modern languages. All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages. Unicode 1.0 was released in October 1991, and 6.0 in October 2010. A Unicode I is an abstract entity. It is not bound to any particular integer width, especially not to the C language C. Unicode is language-neutral and display-neutral: it does not encode the language of the text, and it does not generally define fonts or other graphical layout details. Unicode operates on characters and on text built from those characters. Unicode defines characters like C or C and unique numbers for the characters, in this case 0x0041 and 0x03B1, respectively. These unique numbers are called I

.  A code point is essentially the position of the
character within the set of all possible Unicode characters, and thus in
Perl, the term I is often used interchangeably with it.

The Unicode standard prefers using hexadecimal notation for the code
points.  If numbers like C<0x0041> are unfamiliar to you, take a peek
at a later section, L.  The Unicode standard
uses the notation C, to give the
hexadecimal code point and the normative name of the character.

Unicode also defines various I for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.

A Unicode I "character" can actually consist of more than one internal
I "character" or code point.  For Western languages, this is adequately
modelled by a I (like C) followed
by one or more I (like C).  This sequence of
base character and modifiers is called a I.  Some non-western languages require more complicated
models, so Unicode created the I concept, which was
later further refined into the I.  For
example, a Korean Hangul syllable is considered a single logical
character, but most often consists of three actual
Unicode characters: a leading consonant followed by an interior vowel followed
by a trailing consonant.

Whether to call these extended grapheme clusters "characters" depends on your
point of view. If you are a programmer, you probably would tend towards seeing
each element in the sequences as one unit, or "character".  However from
the user's point of view, the whole sequence could be seen as one
"character" since that's probably what it looks like in the context of the
user's language.  In this document, we take the programmer's point of
view: one "character" is one Unicode code point.

For some combinations of base character and modifiers, there are
I characters.  There is a single character equivalent, for
example, for the sequence C followed by
C.  It is called  C.  These precomposed characters are, however, only available for
some combinations, and are mainly meant to support round-trip
conversions between Unicode and legacy standards (like ISO 8859).  Using
sequences, as Unicode does, allows for needing fewer basic building blocks
(code points) to express many more potential grapheme clusters.  To
support conversion between equivalent forms, various I are also defined.  Thus, C is
in I, (abbreviated NFC), and the sequence
C followed by C
represents the same character in I (NFD).

Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character".  The same character could
be represented differently in several legacy encodings.  The
converse is not true: some code points do not have an assigned
character.  Firstly, there are unallocated code points within
otherwise used blocks.  Secondly, there are special Unicode control
characters that do not represent true characters.

When Unicode was first conceived, it was thought that all the world's
characters could be represented using a 16-bit word; that is a maximum of
C<0x10000> (or 65,536) characters would be needed, from C<0x0000> to
C<0xFFFF>.  This soon proved to be wrong, and since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
The first C<0x10000> characters are called the I, or the
I (BMP).  With Unicode 3.1, 17 (yes,
seventeen) planes in all were defined--but they are nowhere near full of
defined characters, yet.

When a new language is being encoded, Unicode generally will choose a
C of consecutive unallocated code points for its characters.  So
far, the number of code points in these blocks has always been evenly
divisible by 16.  Extras in a block, not currently needed, are left
unallocated, for future growth.  But there have been occasions when
a later release needed more code points than the available extras, and a
new block had to allocated somewhere else, not contiguous to the initial
one, to handle the overflow.  Thus, it became apparent early on that
"block" wasn't an adequate organizing principal, and so the C