X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/04e8f31976b33aacdeb5a6d9c5b75dda622712b8..f57d8456e7b8d6b2dad0bb49899cfdc68007b794:/pod/perlunicode.pod?ds=sidebyside diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index bd70c25..9c13c35 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -36,8 +36,8 @@ Unicode support is an extensive requirement. While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. -Also, the use of Unicode may present security issues that aren't obvious. -Read L. +Also, the use of Unicode may present security issues that aren't +obvious, see L. =over 4 @@ -1633,15 +1633,23 @@ Also, note the following: Malformed UTF-8 -Unfortunately, the original specification of UTF-8 leaves some room for -interpretation of how many bytes of encoded output one should generate -from one input Unicode character. Strictly speaking, the shortest -possible sequence of UTF-8 bytes should be generated, -because otherwise there is potential for an input buffer overflow at -the receiving end of a UTF-8 connection. Perl always generates the -shortest length UTF-8, and with warnings on, Perl will warn about -non-shortest length UTF-8 along with other malformations, such as the -surrogates, which are not Unicode code points valid for interchange. +UTF-8 is very structured, so many combinations of bytes are invalid. In +the past, Perl tried to soldier on and make some sense of invalid +combinations, but this can lead to security holes, so now, if the Perl +core needs to process an invalid combination, it will either raise a +fatal error, or will replace those bytes by the sequence that forms the +Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it. + +Every code point can be represented by more than one possible +syntactically valid UTF-8 sequence. Early on, both Unicode and Perl +considered any of these to be valid, but now, all sequences longer +than the shortest possible one are considered to be malformed. + +Unicode considers many code points to be illegal, or to be avoided. +Perl generally accepts them, once they have passed through any input +filters that may try to exclude them. These have been discussed above +(see "Surrogates" under UTF-16 in L, +L, and L). =item *