X-Git-Url: https://perl5.git.perl.org/perl5.git/blobdiff_plain/268e690504dd7c01f8946eee54865856e54cb871..6b659339f976d014a1a53731d86cedd01f5921ec:/pod/perlrebackslash.pod

diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index c216f25..f27da1f 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -69,8 +69,6 @@ as C<Not in [].>
  \b{}, \b          Boundary. (\b is a backspace in []).
  \B{}, \B          Not a boundary.  Not in [].
  \cX               Control-X.
- \C                Single octet, even under UTF-8.  Not in [].
-                   (Deprecated)
  \d                Character class for digits.
  \D                Character class for non-digits.
  \e                Escape character.
@@ -531,7 +529,7 @@ Mnemonic: I<G>lobal.
 C<\b{...}>, available starting in v5.22, matches a boundary (between two
 characters, or before the first character of the string, or after the
 final character of the string) based on the Unicode rules for the
-boundary type specified inside the braces.  The currently known boundary
+boundary type specified inside the braces.  The boundary
 types are given a few paragraphs below.  C<\B{...}> matches at any place
 between characters where C<\b{...}> of the same type doesn't match.
 
@@ -553,7 +551,7 @@ the non-word "=", there must be a word character immediately previous.
 All plain C<\b> and C<\B> boundary determinations look for word
 characters alone, not for
 non-word characters nor for string ends.  It may help to understand how
-<\b> and <\B> work by equating them as follows:
+C<\b> and C<\B> work by equating them as follows:
 
     \b	really means	(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
     \B	really means	(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
@@ -561,8 +559,9 @@ non-word characters nor for string ends.  It may help to understand how
 In contrast, C<\b{...}> and C<\B{...}> may or may not match at the
 beginning and end of the line, depending on the boundary type.  These
 implement the Unicode default boundaries, specified in
+L<http://www.unicode.org/reports/tr14/> and
 L<http://www.unicode.org/reports/tr29/>.
-The boundary types currently available are:
+The boundary types are:
 
 =over
 
@@ -574,6 +573,18 @@ explained below under L</C<\X>>.  In fact, C<\X> is another way to get
 the same functionality.  It is equivalent to C</.+?\b{gcb}/>.  Use
 whichever is most convenient for your situation.
 
+=item C<\b{lb}>
+
+This matches according to the default Unicode Line Breaking Algorithm
+(L<http://www.unicode.org/reports/tr14/>), as customized in that
+document
+(L<Example 7 of revision 35|http://www.unicode.org/reports/tr14/tr14-35.html#Example7>)
+for better handling of numeric expressions.
+
+This is suitable for many purposes, but the L<Unicode::LineBreak> module
+is available on CPAN that provides many more features, including
+customization.
+
 =item C<\b{sb}>
 
 This matches a Unicode "Sentence Boundary".  This is an aid to parsing
@@ -594,11 +605,28 @@ future Perl versions.
 
 =item C<\b{wb}>
 
-This matches a Unicode "Word Boundary".  This gives better (though not
+This matches a Unicode "Word Boundary", but tailored to Perl
+expectations.  This gives better (though not
 perfect) results for natural language processing than plain C<\b>
 (without braces) does.  For example, it understands that apostrophes can
 be in the middle of words and that parentheses aren't (see the examples
-below).   More details are at L<http://www.unicode.org/reports/tr29/>.
+below).  More details are at L<http://www.unicode.org/reports/tr29/>.
+
+The current Unicode definition of a Word Boundary matches between every
+white space character.  Perl tailors this, starting in version 5.24, to
+generally not break up spans of white space, just as plain C<\b> has
+always functioned.  This allows C<\b{wb}> to be a drop-in replacement for
+C<\b>, but with generally better results for natural language
+processing.  (The exception to this tailoring is when a span of white
+space is immediately followed by something like U+0303, COMBINING TILDE.
+If the final space character in the span is a horizontal white space, it
+is broken out so that it attaches instead to the combining character.
+To be precise, if a span of white space that ends in a horizontal space
+has the character immediately following it have either of the Word
+Boundary property values "Extend" or "Format", the boundary between the
+final horizontal space character and the rest of the span matches
+C<\b{wb}>.  In all other cases the boundary between two white space
+characters matches C<\B{wb}>.)
 
 =back
 
@@ -621,10 +649,9 @@ rule.
 
 It is also important to realize that these are default boundary
 definitions, and that implementations may wish to tailor the results for
-particular purposes and locales.
-
-Unicode defines a fourth boundary type, accessible through the
-L<Unicode::LineBreak> module.
+particular purposes and locales.  For example, some languages, such as
+Japanese and Thai, require dictionary lookup to determine word
+boundaries.
 
 Mnemonic: I<b>oundary.
 
@@ -663,18 +690,6 @@ categories above. These are:
 
 =over 4
 
-=item \C
-
-(Deprecated.) C<\C> always matches a single octet, even if the source
-string is encoded
-in UTF-8 format, and the character to be matched is a multi-octet character.
-This is very dangerous, because it violates
-the logical character abstraction and can cause UTF-8 sequences to become malformed.
-
-Use C<utf8::encode()> instead.
-
-Mnemonic: oI<C>tet.
-
 =item \K
 
 This appeared in perl 5.10.0. Anything matched left of C<\K> is