[perl5.git] / pod / perlunicode.pod

=head1 NAME

perlunicode - Unicode support in Perl

=head1 DESCRIPTION

WARNING: The implementation of Unicode support in Perl is incomplete.
Expect sudden and unannounced changes!

Beginning with version 5.6, Perl uses logically wide characters to
represent strings internally.  This internal representation of strings
uses the UTF-8 encoding.

In future, Perl-level operations will expect to work with characters
rather than bytes, in general.

However, Perl v5.6 aims to provide a safe migration path from byte
semantics to character semantics for programs.  To preserve compatibility
with earlier versions of Perl which allowed byte semantics in Perl
operations (owing to the fact that the internal representation for
characters was in bytes) byte semantics will continue to be in effect
until a the C<utf8> pragma is used in the C<main> package, or the C<$^U>
global flag is explicitly set.

Under character semantics, many operations that formerly operated on
bytes change to operating on characters.  For ASCII data this makes
no difference, because UTF-8 stores ASCII in single bytes, but for
any character greater than C<chr(127)>, the character is stored in
a sequence of two or more bytes, all of which have the high bit set.
But by and large, the user need not worry about this, because Perl
hides it from the user.  A character in Perl is logically just a number
ranging from 0 to 2**32 or so.  Larger characters encode to longer
sequences of bytes internally, but again, this is just an internal
detail which is hidden at the Perl level.

The C<byte> pragma can be used to force byte semantics in a particular
lexical scope.  See L<byte>.

The C<utf8> pragma is a compatibility device to enables recognition
of UTF-8 in literals encountered by the parser.  It is also used
for enabling some experimental Unicode support features.  Note that
this pragma is only required until a future version of Perl in which
character semantics will become the default.  This pragma may then
become a no-op.  See L<utf8>.

Character semantics have the following effects:

=over 4

=item *

Strings and patterns may contain characters that have an ordinal value
larger than 255.  In Perl v5.6, this is only enabled if the lexical
scope has a C<use utf8> declaration (due to compatibility needs) but
future versions may enable this by default.

Presuming you use a Unicode editor to edit your program, such characters
will typically occur directly within the literal strings as UTF-8
characters, but you can also specify a particular character with an
extension of the C<\x> notation.  UTF-8 characters are specified by
putting the hexadecimal code within curlies after the C<\x>.  For instance,
a Unicode smiley face is C<\x{263A}>.  A character in the Latin-1 range
(128..255) should be written C<\x{ab}> rather than C<\xab>, since the
former will turn into a two-byte UTF-8 code, while the latter will
continue to be interpreted as generating a 8-bit byte rather than a
character.  In fact, if C<-w> is turned on, it will produce a warning
that you might be generating invalid UTF-8.

=item *

Identifiers within the Perl script may contain Unicode alphanumeric
characters, including ideographs.  (You are currently on your own when
it comes to using the canonical forms of characters--Perl doesn't (yet)
attempt to canonicalize variable names for you.)

This also needs C<use utf8> currently.  [XXX: Why?  High-bit chars were
syntax errors when they occurred within identifiers in previous versions,
so this should be enabled by default.]

=item *

Regular expressions match characters instead of bytes.  For instance,
"." matches a character instead of a byte.  (However, the C<\C> pattern
is provided to force a match a single byte ("C<char>" in C, hence
C<\C>).)

Unicode support in regular expressions needs C<use utf8> currently.
[XXX: Because the SWASH routines need to be loaded.  And the RE engine
appears to need an overhaul to Unicode by default anyway.]

=item *

Character classes in regular expressions match characters instead of
bytes, and match against the character properties specified in the
Unicode properties database.  So C<\w> can be used to match an ideograph,
for instance.

C<use utf8> is needed to enable this.  See above.

=item *

Named Unicode properties and block ranges make be used as character
classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
match property) constructs.  For instance, C<\p{Lu}> matches any
character with the Unicode uppercase property, while C<\p{M}> matches
any mark character.  Single letter properties may omit the brackets, so
that can be written C<\pM> also.  Many predefined character classes are
available, such as C<\p{IsMirrored}> and  C<\p{InTibetan}>.

C<use utf8> is needed to enable this.  See above.

=item *

The special pattern C<\X> match matches any extended Unicode sequence
(a "combining character sequence" in Standardese), where the first
character is a base character and subsequent characters are mark
characters that apply to the base character.  It is equivalent to
C<(?:\PM\pM*)>.

C<use utf8> is needed to enable this.  See above.

=item *

The C<tr///> operator translates characters instead of bytes.  It can also
be forced to translate between 8-bit codes and UTF-8 regardless of the
surrounding utf8 state.  For instance, if you know your input in Latin-1,
you can say:

    use utf8;
    while (<>) {
	tr/\0-\xff//CU;		# latin1 char to utf8
	...
    }

Similarly you could translate your output with

    tr/\0-\x{ff}//UC;		# utf8 to latin1 char

No, C<s///> doesn't take /U or /C (yet?).

C<use utf8> is needed to enable this.  See above.

=item *

Case translation operators use the Unicode case translation tables
when provided character input.  Note that C<uc()> translates to
uppercase, while C<ucfirst> translates to titlecase (for languages
that make the distinction).  Naturally the corresponding backslash
sequences have the same semantics.

=item *

Most operators that deal with positions or lengths in the string will
automatically switch to using character positions, including C<chop()>,
C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
C<write()>, and C<length()>.  Operators that specifically don't switch
include C<vec()>, C<pack()>, and C<unpack()>.  Operators that really
don't care include C<chomp()>, as well as any other operator that
treats a string as a bucket of bits, such as C<sort()>, and the
operators dealing with filenames.

=item *

The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
since they're often used for byte-oriented formats.  (Again, think
"C<char>" in the C language.)  However, there is a new "C<U>" specifier
that will convert between UTF-8 characters and integers.  (It works
outside of the utf8 pragma too.)

=item *

The C<chr()> and C<ord()> functions work on characters.  This is like
C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
C<unpack("C")>.  In fact, the latter are how you now emulate
byte-oriented C<chr()> and C<ord()> under utf8.

=item *

And finally, C<scalar reverse()> reverses by character rather than by byte.

=back

=head1 CAVEATS

As of yet, there is no method for automatically coercing input and
output to some encoding other than UTF-8.  This is planned in the near
future, however.

Whether a piece of data will be treated as "characters" or "bytes"
by internal operations cannot be divined at the current time.

Use of locales with utf8 may lead to odd results.  Currently there is
some attempt to apply 8-bit locale info to characters in the range
0..255, but this is demonstrably incorrect for locales that use
characters above that range (when mapped into Unicode).  It will also
tend to run slower.  Avoidance of locales is strongly encouraged.

=head1 SEE ALSO

L<byte>, L<utf8>, L<perlvar/"$^U">

=cut
Commit	Line	Data
393fec97 GS	1	=head1 NAME
	2
	3	perlunicode - Unicode support in Perl
	4
	5	=head1 DESCRIPTION
	6
	7	WARNING: The implementation of Unicode support in Perl is incomplete.
	8	Expect sudden and unannounced changes!
	9
	10	Beginning with version 5.6, Perl uses logically wide characters to
	11	represent strings internally. This internal representation of strings
	12	uses the UTF-8 encoding.
	13
	14	In future, Perl-level operations will expect to work with characters
	15	rather than bytes, in general.
	16
	17	However, Perl v5.6 aims to provide a safe migration path from byte
	18	semantics to character semantics for programs. To preserve compatibility
	19	with earlier versions of Perl which allowed byte semantics in Perl
	20	operations (owing to the fact that the internal representation for
	21	characters was in bytes) byte semantics will continue to be in effect
	22	until a the C<utf8> pragma is used in the C<main> package, or the C<$^U>
	23	global flag is explicitly set.
	24
	25	Under character semantics, many operations that formerly operated on
	26	bytes change to operating on characters. For ASCII data this makes
	27	no difference, because UTF-8 stores ASCII in single bytes, but for
	28	any character greater than C<chr(127)>, the character is stored in
	29	a sequence of two or more bytes, all of which have the high bit set.
	30	But by and large, the user need not worry about this, because Perl
	31	hides it from the user. A character in Perl is logically just a number
	32	ranging from 0 to 2**32 or so. Larger characters encode to longer
	33	sequences of bytes internally, but again, this is just an internal
	34	detail which is hidden at the Perl level.
	35
	36	The C<byte> pragma can be used to force byte semantics in a particular
	37	lexical scope. See L<byte>.
	38
	39	The C<utf8> pragma is a compatibility device to enables recognition
	40	of UTF-8 in literals encountered by the parser. It is also used
	41	for enabling some experimental Unicode support features. Note that
	42	this pragma is only required until a future version of Perl in which
	43	character semantics will become the default. This pragma may then
	44	become a no-op. See L<utf8>.
	45
	46	Character semantics have the following effects:
	47
	48	=over 4
	49
	50	=item *
	51
	52	Strings and patterns may contain characters that have an ordinal value
	53	larger than 255. In Perl v5.6, this is only enabled if the lexical
	54	scope has a C<use utf8> declaration (due to compatibility needs) but
	55	future versions may enable this by default.
	56
	57	Presuming you use a Unicode editor to edit your program, such characters
	58	will typically occur directly within the literal strings as UTF-8
	59	characters, but you can also specify a particular character with an
	60	extension of the C<\x> notation. UTF-8 characters are specified by
	61	putting the hexadecimal code within curlies after the C<\x>. For instance,
	62	a Unicode smiley face is C<\x{263A}>. A character in the Latin-1 range
	63	(128..255) should be written C<\x{ab}> rather than C<\xab>, since the
	64	former will turn into a two-byte UTF-8 code, while the latter will
65	continue to be interpreted as generating a 8-bit byte rather than a
66	character. In fact, if C<-w> is turned on, it will produce a warning
67	that you might be generating invalid UTF-8.
68
69	=item *
70
71	Identifiers within the Perl script may contain Unicode alphanumeric
72	characters, including ideographs. (You are currently on your own when
73	it comes to using the canonical forms of characters--Perl doesn't (yet)
74	attempt to canonicalize variable names for you.)
75
76	This also needs C<use utf8> currently. [XXX: Why? High-bit chars were
77	syntax errors when they occurred within identifiers in previous versions,
78	so this should be enabled by default.]
79
80	=item *
81
82	Regular expressions match characters instead of bytes. For instance,
83	"." matches a character instead of a byte. (However, the C<\C> pattern
84	is provided to force a match a single byte ("C<char>" in C, hence
85	C<\C>).)
86
87	Unicode support in regular expressions needs C<use utf8> currently.
88	[XXX: Because the SWASH routines need to be loaded. And the RE engine
89	appears to need an overhaul to Unicode by default anyway.]
90
91	=item *
92
93	Character classes in regular expressions match characters instead of
94	bytes, and match against the character properties specified in the
95	Unicode properties database. So C<\w> can be used to match an ideograph,
96	for instance.
97
98	C<use utf8> is needed to enable this. See above.
99
100	=item *
101
102	Named Unicode properties and block ranges make be used as character
103	classes via the new C<\p{}> (matches property) and C<\P{}> (doesn't
104	match property) constructs. For instance, C<\p{Lu}> matches any
105	character with the Unicode uppercase property, while C<\p{M}> matches
106	any mark character. Single letter properties may omit the brackets, so
107	that can be written C<\pM> also. Many predefined character classes are
108	available, such as C<\p{IsMirrored}> and C<\p{InTibetan}>.
109
110	C<use utf8> is needed to enable this. See above.
111
112	=item *
113
114	The special pattern C<\X> match matches any extended Unicode sequence
115	(a "combining character sequence" in Standardese), where the first
116	character is a base character and subsequent characters are mark
117	characters that apply to the base character. It is equivalent to
118	C<(?:\PM\pM*)>.
119
120	C<use utf8> is needed to enable this. See above.
121
122	=item *
123
124	The C<tr///> operator translates characters instead of bytes. It can also
125	be forced to translate between 8-bit codes and UTF-8 regardless of the
126	surrounding utf8 state. For instance, if you know your input in Latin-1,
127	you can say:
128
129	use utf8;
130	while (<>) {
131	tr/\0-\xff//CU; # latin1 char to utf8
132	...
133	}
134
135	Similarly you could translate your output with
136
137	tr/\0-\x{ff}//UC; # utf8 to latin1 char
138
139	No, C<s///> doesn't take /U or /C (yet?).
140
141	C<use utf8> is needed to enable this. See above.
142
143	=item *
144
145	Case translation operators use the Unicode case translation tables
146	when provided character input. Note that C<uc()> translates to
147	uppercase, while C<ucfirst> translates to titlecase (for languages
148	that make the distinction). Naturally the corresponding backslash
149	sequences have the same semantics.
150
151	=item *
152
153	Most operators that deal with positions or lengths in the string will
154	automatically switch to using character positions, including C<chop()>,
155	C<substr()>, C<pos()>, C<index()>, C<rindex()>, C<sprintf()>,
156	C<write()>, and C<length()>. Operators that specifically don't switch
157	include C<vec()>, C<pack()>, and C<unpack()>. Operators that really
158	don't care include C<chomp()>, as well as any other operator that
159	treats a string as a bucket of bits, such as C<sort()>, and the
160	operators dealing with filenames.
161
162	=item *
163
164	The C<pack()>/C<unpack()> letters "C<c>" and "C<C>" do I<not> change,
165	since they're often used for byte-oriented formats. (Again, think
166	"C<char>" in the C language.) However, there is a new "C<U>" specifier
167	that will convert between UTF-8 characters and integers. (It works
168	outside of the utf8 pragma too.)
169
170	=item *
171
172	The C<chr()> and C<ord()> functions work on characters. This is like
173	C<pack("U")> and C<unpack("U")>, not like C<pack("C")> and
174	C<unpack("C")>. In fact, the latter are how you now emulate
175	byte-oriented C<chr()> and C<ord()> under utf8.
176
177	=item *
178
179	And finally, C<scalar reverse()> reverses by character rather than by byte.
180
181	=back
182
183	=head1 CAVEATS
184
185	As of yet, there is no method for automatically coercing input and
186	output to some encoding other than UTF-8. This is planned in the near
187	future, however.
188
189	Whether a piece of data will be treated as "characters" or "bytes"
190	by internal operations cannot be divined at the current time.
191
192	Use of locales with utf8 may lead to odd results. Currently there is
193	some attempt to apply 8-bit locale info to characters in the range
194	0..255, but this is demonstrably incorrect for locales that use
195	characters above that range (when mapped into Unicode). It will also
196	tend to run slower. Avoidance of locales is strongly encouraged.
197
198	=head1 SEE ALSO
199
200	L<byte>, L<utf8>, L<perlvar/"$^U">
201
202	=cut