X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/f7d1603dbb882894bc7a423058e0334fa88e4614..54e0f05ce4bb904f953dde352028f27b07cb1fdf:/pod/perluniintro.pod diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod index 6d11bb7..6a8c07d 100644 --- a/pod/perluniintro.pod +++ b/pod/perluniintro.pod @@ -83,7 +83,7 @@ Because of backward compatibility with legacy encodings, the "a unique number for every character" idea breaks down a bit: instead, there is "at least one number for every character". The same character could be represented differently in several legacy encodings. The -converse is also not true: some code points do not have an assigned +converse is not also true: some code points do not have an assigned character. Firstly, there are unallocated code points within otherwise used blocks. Secondly, there are special Unicode control characters that do not represent true characters. @@ -155,8 +155,8 @@ character set. Otherwise, it uses UTF-8. A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes relevant when -outputting Unicode strings to a stream without a PerlIO layer -- one with -the "default" encoding. In such a case, the raw bytes used internally +outputting Unicode strings to a stream without a PerlIO layer (one with +the "default" encoding). In such a case, the raw bytes used internally (the native character set or UTF-8, as appropriate for each string) will be used, and a "Wide character" warning will be issued if those strings contain a character beyond 0x00FF. @@ -248,7 +248,7 @@ characters: Note that both C<\x{...}> and C<\N{...}> are compile-time string constants: you cannot use variables in them. if you want similar -run-time functionality, use C and C. +run-time functionality, use C and C. If you want to force the result to Unicode characters, use the special C<"U0"> prefix. It consumes no arguments but causes the following bytes @@ -344,7 +344,8 @@ layer when opening files The I/O layers can also be specified more flexibly with the C pragma. See L, or look at the following example. - use open ':encoding(utf8)'; # input/output default encoding will be UTF-8 + use open ':encoding(utf8)'; # input/output default encoding will be + # UTF-8 open X, ">file"; print X chr(0x100), "\n"; close X; @@ -355,7 +356,8 @@ the C pragma. See L, or look at the following example. With the C pragma you can use the C<:locale> layer BEGIN { $ENV{LC_ALL} = $ENV{LANG} = 'ru_RU.KOI8-R' } - # the :locale will probe the locale environment variables like LC_ALL + # the :locale will probe the locale environment variables like + # LC_ALL use open OUT => ':locale'; # russki parusski open(O, ">koi8"); print O chr(0x430); # Unicode CYRILLIC SMALL LETTER A = KOI8-R 0xc1 @@ -432,13 +434,13 @@ its argument so that Unicode characters with code points greater than 255 are displayed as C<\x{...}>, control characters (like C<\n>) are displayed as C<\x..>, and the rest of the characters as themselves: - sub nice_string { - join("", - map { $_ > 255 ? # if wide character... - sprintf("\\x{%04X}", $_) : # \x{...} - chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... - sprintf("\\x%02X", $_) : # \x.. - quotemeta(chr($_)) # else quoted or as themselves + sub nice_string { + join("", + map { $_ > 255 ? # if wide character... + sprintf("\\x{%04X}", $_) : # \x{...} + chr($_) =~ /[[:cntrl:]]/ ? # else if control character ... + sprintf("\\x%02X", $_) : # \x.. + quotemeta(chr($_)) # else quoted or as themselves } unpack("W*", $_[0])); # unpack Unicode characters } @@ -553,19 +555,19 @@ L Character Ranges and Classes -Character ranges in regular expression character classes (C) -and in the C (also known as C) operator are not magically -Unicode-aware. What this means is that C<[A-Za-z]> will not magically start -to mean "all alphabetic letters"; not that it does mean that even for -8-bit characters, you should be using C in that case. +Character ranges in regular expression bracketed character classes ( e.g., +C) and in the C (also known as C) operator are not +magically Unicode-aware. What this means is that C<[A-Za-z]> will not +magically start to mean "all alphabetic letters" (not that it does mean that +even for 8-bit characters; for those, if you are using locales (L), +use C; and if not, use the 8-bit-aware property C<\p{alpha}>). -For specifying character classes like that in regular expressions, -you can use the various Unicode properties--C<\pL>, or perhaps -C<\p{Alphabetic}>, in this particular case. You can use Unicode -code points as the end points of character ranges, but there is no -magic associated with specifying a certain range. For further -information--there are dozens of Unicode character classes--see -L. +All the properties that begin with C<\p> (and its inverse C<\P>) are actually +character classes that are Unicode-aware. There are dozens of them, see +L. + +You can use Unicode code points as the end points of character ranges, and the +range will include all Unicode code points that lie between those end points. =item * @@ -607,7 +609,7 @@ Unicode; for that, see the earlier I/O discussion. How Do I Know Whether My String Is In Unicode? You shouldn't have to care. But you may, because currently the semantics of the -characters whose ordinals are in the range 128 to 255 is different depending on +characters whose ordinals are in the range 128 to 255 are different depending on whether the string they are contained within is in Unicode or not. (See L.) @@ -622,8 +624,8 @@ string has any characters at all. All the C does is to return the value of the internal "utf8ness" flag attached to the C<$string>. If the flag is off, the bytes in the scalar are interpreted as a single byte encoding. If the flag is on, the bytes in the scalar -are interpreted as the (multi-byte, variable-length) UTF-8 encoded code -points of the characters. Bytes added to an UTF-8 encoded string are +are interpreted as the (variable-length, potentially multi-byte) UTF-8 encoded +code points of the characters. Bytes added to a UTF-8 encoded string are automatically upgraded to UTF-8. If mixed non-UTF-8 and UTF-8 scalars are merged (double-quoted interpolation, explicit concatenation, and printf/sprintf parameter substitution), the result will be UTF-8 encoded @@ -648,6 +650,7 @@ the C function: use bytes; print length($unicode), "\n"; # will also print 2 # (the 0xC4 0x80 of the UTF-8) + no bytes; =item * @@ -730,11 +733,11 @@ or: You can find the bytes that make up a UTF-8 sequence with - @bytes = unpack("C*", $Unicode_string) + @bytes = unpack("C*", $Unicode_string) and you can create well-formed Unicode with - $Unicode_string = pack("U*", 0xff, ...) + $Unicode_string = pack("U*", 0xff, ...) =item *