X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/268e690504dd7c01f8946eee54865856e54cb871..6b659339f976d014a1a53731d86cedd01f5921ec:/pod/perlrebackslash.pod diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index c216f25..f27da1f 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -69,8 +69,6 @@ as C \b{}, \b Boundary. (\b is a backspace in []). \B{}, \B Not a boundary. Not in []. \cX Control-X. - \C Single octet, even under UTF-8. Not in []. - (Deprecated) \d Character class for digits. \D Character class for non-digits. \e Escape character. @@ -531,7 +529,7 @@ Mnemonic: Ilobal. C<\b{...}>, available starting in v5.22, matches a boundary (between two characters, or before the first character of the string, or after the final character of the string) based on the Unicode rules for the -boundary type specified inside the braces. The currently known boundary +boundary type specified inside the braces. The boundary types are given a few paragraphs below. C<\B{...}> matches at any place between characters where C<\b{...}> of the same type doesn't match. @@ -553,7 +551,7 @@ the non-word "=", there must be a word character immediately previous. All plain C<\b> and C<\B> boundary determinations look for word characters alone, not for non-word characters nor for string ends. It may help to understand how -<\b> and <\B> work by equating them as follows: +C<\b> and C<\B> work by equating them as follows: \b really means (?:(?<=\w)(?!\w)|(? and C<\B{...}> may or may not match at the beginning and end of the line, depending on the boundary type. These implement the Unicode default boundaries, specified in +L and L. -The boundary types currently available are: +The boundary types are: =over @@ -574,6 +573,18 @@ explained below under L>. In fact, C<\X> is another way to get the same functionality. It is equivalent to C. Use whichever is most convenient for your situation. +=item C<\b{lb}> + +This matches according to the default Unicode Line Breaking Algorithm +(L), as customized in that +document +(L) +for better handling of numeric expressions. + +This is suitable for many purposes, but the L module +is available on CPAN that provides many more features, including +customization. + =item C<\b{sb}> This matches a Unicode "Sentence Boundary". This is an aid to parsing @@ -594,11 +605,28 @@ future Perl versions. =item C<\b{wb}> -This matches a Unicode "Word Boundary". This gives better (though not +This matches a Unicode "Word Boundary", but tailored to Perl +expectations. This gives better (though not perfect) results for natural language processing than plain C<\b> (without braces) does. For example, it understands that apostrophes can be in the middle of words and that parentheses aren't (see the examples -below). More details are at L. +below). More details are at L. + +The current Unicode definition of a Word Boundary matches between every +white space character. Perl tailors this, starting in version 5.24, to +generally not break up spans of white space, just as plain C<\b> has +always functioned. This allows C<\b{wb}> to be a drop-in replacement for +C<\b>, but with generally better results for natural language +processing. (The exception to this tailoring is when a span of white +space is immediately followed by something like U+0303, COMBINING TILDE. +If the final space character in the span is a horizontal white space, it +is broken out so that it attaches instead to the combining character. +To be precise, if a span of white space that ends in a horizontal space +has the character immediately following it have either of the Word +Boundary property values "Extend" or "Format", the boundary between the +final horizontal space character and the rest of the span matches +C<\b{wb}>. In all other cases the boundary between two white space +characters matches C<\B{wb}>.) =back @@ -621,10 +649,9 @@ rule. It is also important to realize that these are default boundary definitions, and that implementations may wish to tailor the results for -particular purposes and locales. - -Unicode defines a fourth boundary type, accessible through the -L module. +particular purposes and locales. For example, some languages, such as +Japanese and Thai, require dictionary lookup to determine word +boundaries. Mnemonic: Ioundary. @@ -663,18 +690,6 @@ categories above. These are: =over 4 -=item \C - -(Deprecated.) C<\C> always matches a single octet, even if the source -string is encoded -in UTF-8 format, and the character to be matched is a multi-octet character. -This is very dangerous, because it violates -the logical character abstraction and can cause UTF-8 sequences to become malformed. - -Use C instead. - -Mnemonic: oItet. - =item \K This appeared in perl 5.10.0. Anything matched left of C<\K> is