perl5.git.perl.org Git - perl5.git/blame_incremental

... / ...

Commit	Line	Data
	1	=head1 NAME
	2
	3	perlrecharclass - Perl Regular Expression Character Classes
	4
	5	=head1 DESCRIPTION
	6
	7	The top level documentation about Perl regular expressions
	8	is found in L<perlre>.
	9
	10	This manual page discusses the syntax and use of character
	11	classes in Perl Regular Expressions.
	12
	13	A character class is a way of denoting a set of characters,
	14	in such a way that one character of the set is matched.
	15	It's important to remember that matching a character class
	16	consumes exactly one character in the source string. (The source
	17	string is the string the regular expression is matched against.)
	18
	19	There are three types of character classes in Perl regular
	20	expressions: the dot, backslashed sequences, and the bracketed form.
	21
	22	=head2 The dot
	23
	24	The dot (or period), C<.> is probably the most used, and certainly
	25	the most well-known character class. By default, a dot matches any
	26	character, except for the newline. The default can be changed to
	27	add matching the newline with the I<single line> modifier: either
	28	for the entire regular expression using the C</s> modifier, or
	29	locally using C<(?s)>.
	30
	31	Here are some examples:
	32
	33	"a" =~ /./ # Match
	34	"." =~ /./ # Match
	35	"" =~ /./ # No match (dot has to match a character)
	36	"\n" =~ /./ # No match (dot does not match a newline)
	37	"\n" =~ /./s # Match (global 'single line' modifier)
	38	"\n" =~ /(?s:.)/ # Match (local 'single line' modifier)
	39	"ab" =~ /^.$/ # No match (dot matches one character)
	40
	41
	42	=head2 Backslashed sequences
	43
	44	Perl regular expressions contain many backslashed sequences that
	45	constitute a character class. That is, they will match a single
	46	character, if that character belongs to a specific set of characters
	47	(defined by the sequence). A backslashed sequence is a sequence of
	48	characters starting with a backslash. Not all backslashed sequences
	49	are character class; for a full list, see L<perlrebackslash>.
	50
	51	Here's a list of the backslashed sequences, which are discussed in
	52	more detail below.
	53
	54	\d Match a digit character.
	55	\D Match a non-digit character.
	56	\w Match a "word" character.
	57	\W Match a non-"word" character.
	58	\s Match a white space character.
	59	\S Match a non-white space character.
	60	\h Match a horizontal white space character.
	61	\H Match a character that isn't horizontal white space.
	62	\v Match a vertical white space character.
	63	\V Match a character that isn't vertical white space.
	64	\pP, \p{Prop} Match a character matching a Unicode property.
	65	\PP, \P{Prop} Match a character that doesn't match a Unicode property.
	66
	67	=head3 Digits
	68
	69	C<\d> matches a single character that is considered to be a I<digit>.
	70	What is considered a digit depends on the internal encoding of
	71	the source string. If the source string is in UTF-8 format, C<\d>
	72	not only matches the digits '0' - '9', but also Arabic, Devanagari and
	73	digits from other languages. Otherwise, if there is a locale in effect,
	74	it will match whatever characters the locale considers digits. Without
	75	a locale, C<\d> matches the digits '0' to '9'.
	76	See L</Locale, Unicode and UTF-8>.
	77
	78	Any character that isn't matched by C<\d> will be matched by C<\D>.
	79
	80	=head3 Word characters
	81
	82	C<\w> matches a single I<word> character: an alphanumeric character
	83	(that is, an alphabetic character, or a digit), or the underscore (C<_>).
	84	What is considered a word character depends on the internal encoding
	85	of the string. If it's in UTF-8 format, C<\w> matches those characters
	86	that are considered word characters in the Unicode database. That is, it
	87	not only matches ASCII letters, but also Thai letters, Greek letters, etc.
	88	If the source string isn't in UTF-8 format, C<\w> matches those characters
	89	that are considered word characters by the current locale. Without
	90	a locale in effect, C<\w> matches the ASCII letters, digits and the
	91	underscore.
	92
	93	Any character that isn't matched by C<\w> will be matched by C<\W>.
	94
	95	=head3 White space
	96
	97	C<\s> matches any single character that is consider white space. In the
	98	ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line
	99	(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the
	100	space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set
	101	of characters matched by C<\s> depends on whether the source string is
	102	in UTF-8 format. If it is, C<\s> matches what is considered white space
	103	in the Unicode database. Otherwise, if there is a locale in effect, C<\s>
	104	matches whatever is considered white space by the current locale. Without
	105	a locale, C<\s> matches the five characters mentioned in the beginning
	106	of this paragraph. Perhaps the most notable difference is that C<\s>
	107	matches a non-breaking space only if the non-breaking space is in a
	108	UTF-8 encoded string.
	109
	110	Any character that isn't matched by C<\s> will be matched by C<\S>.
	111
	112	C<\h> will match any character that is considered horizontal white space;
	113	this includes the space and the tab characters. C<\H> will match any character
	114	that is not considered horizontal white space.
	115
	116	C<\v> will match any character that is considered vertical white space;
	117	this includes the carriage return and line feed characters (newline).
	118	C<\V> will match any character that is not considered vertical white space.
	119
	120	C<\R> matches anything that can be considered a newline under Unicode
	121	rules. It's not a character class, as it can match a multi-character
	122	sequence. Therefore, it cannot be used inside a bracketed character
	123	class. Details are discussed in L<perlrebackslash>.
	124
	125	C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.
	126
	127	Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match
	128	the same characters, regardless whether the source string is in UTF-8
	129	format or not. The set of characters they match is also not influenced
	130	by locale.
	131
	132	One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.
	133	The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however
	134	considered vertical white space. Furthermore, if the source string is
	135	not in UTF-8 format, the next line (C<"\x85">) and the no-break space
	136	(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.
	137	If the source string is in UTF-8 format, both the next line and the
	138	no-break space are matched by C<\s>.
	139
	140	The following table is a complete listing of characters matched by
	141	C<\s>, C<\h> and C<\v>.
	142
	143	The first column gives the code point of the character (in hex format),
	144	the second column gives the (Unicode) name. The third column indicates
	145	by which class(es) the character is matched.
	146
	147	0x00009 CHARACTER TABULATION h s
	148	0x0000a LINE FEED (LF) vs
	149	0x0000b LINE TABULATION v
	150	0x0000c FORM FEED (FF) vs
	151	0x0000d CARRIAGE RETURN (CR) vs
	152	0x00020 SPACE h s
	153	0x00085 NEXT LINE (NEL) vs [1]
	154	0x000a0 NO-BREAK SPACE h s [1]
	155	0x01680 OGHAM SPACE MARK h s
	156	0x0180e MONGOLIAN VOWEL SEPARATOR h s
	157	0x02000 EN QUAD h s
	158	0x02001 EM QUAD h s
	159	0x02002 EN SPACE h s
	160	0x02003 EM SPACE h s
	161	0x02004 THREE-PER-EM SPACE h s
	162	0x02005 FOUR-PER-EM SPACE h s
	163	0x02006 SIX-PER-EM SPACE h s
	164	0x02007 FIGURE SPACE h s
	165	0x02008 PUNCTUATION SPACE h s
	166	0x02009 THIN SPACE h s
	167	0x0200a HAIR SPACE h s
	168	0x02028 LINE SEPARATOR vs
	169	0x02029 PARAGRAPH SEPARATOR vs
	170	0x0202f NARROW NO-BREAK SPACE h s
	171	0x0205f MEDIUM MATHEMATICAL SPACE h s
	172	0x03000 IDEOGRAPHIC SPACE h s
	173
	174	=over 4
	175
	176	=item [1]
	177
	178	NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in
	179	UTF-8 format.
	180
	181	=back
	182
	183	It is worth noting that C<\d>, C<\w>, etc, match single characters, not
	184	complete numbers or words. To match a number (that consists of integers),
	185	use C<\d+>; to match a word, use C<\w+>.
	186
	187
	188	=head3 Unicode Properties
	189
	190	C<\pP> and C<\p{Prop}> are character classes to match characters that
	191	fit given Unicode classes. One letter classes can be used in the C<\pP>
	192	form, with the class name following the C<\p>, otherwise, the property
	193	name is enclosed in braces, and follows the C<\p>. For instance, a
	194	match for a number can be written as C</\pN/> or as C</\p{Number}/>.
	195	Lowercase letters are matched by the property I<LowercaseLetter> which
	196	has as short form I<Ll>. They have to be written as C</\p{Ll}/> or
	197	C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.
	198	It matches a two character string: a letter (Unicode property C<\pL>),
	199	followed by a lowercase C<l>.
	200
	201	For a list of possible properties, see
	202	L<perlunicode/Unicode Character Properties>. It is also possible to
	203	defined your own properties. This is discussed in
	204	L<perlunicode/User-Defined Character Properties>.
	205
	206
	207	=head4 Examples
	208
	209	"a" =~ /\w/ # Match, "a" is a 'word' character.
	210	"7" =~ /\w/ # Match, "7" is a 'word' character as well.
	211	"a" =~ /\d/ # No match, "a" isn't a digit.
	212	"7" =~ /\d/ # Match, "7" is a digit.
	213	" " =~ /\s/ # Match, a space is white space.
	214	"a" =~ /\D/ # Match, "a" is a non-digit.
	215	"7" =~ /\D/ # No match, "7" is not a non-digit.
	216	" " =~ /\S/ # No match, a space is not non-white space.
	217
	218	" " =~ /\h/ # Match, space is horizontal white space.
	219	" " =~ /\v/ # No match, space is not vertical white space.
	220	"\r" =~ /\v/ # Match, a return is vertical white space.
	221
	222	"a" =~ /\pL/ # Match, "a" is a letter.
	223	"a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters.
	224
	225	"\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character
	226	# 'THAI CHARACTER SO SO', and that's in
	227	# Thai Unicode class.
	228	"a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.
	229
	230
	231	=head2 Bracketed Character Classes
	232
	233	The third form of character class you can use in Perl regular expressions
	234	is the bracketed form. In its simplest form, it lists the characters
	235	that may be matched inside square brackets, like this: C<[aeiou]>.
	236	This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other
	237	character classes, exactly one character will be matched. To match
	238	a longer string consisting of characters mentioned in the characters
	239	class, follow the character class with a quantifier. For instance,
	240	C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.
	241
	242	Repeating a character in a character class has no
	243	effect; it's considered to be in the set only once.
	244
	245	Examples:
	246
	247	"e" =~ /[aeiou]/ # Match, as "e" is listed in the class.
	248	"p" =~ /[aeiou]/ # No match, "p" is not listed in the class.
	249	"ae" =~ /^[aeiou]$/ # No match, a character class only matches
	250	# a single character.
	251	"ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.
	252
	253	=head3 Special Characters Inside a Bracketed Character Class
	254
	255	Most characters that are meta characters in regular expressions (that
	256	is, characters that carry a special meaning like C<*> or C<(>) lose
	257	their special meaning and can be used inside a character class without
	258	the need to escape them. For instance, C<[()]> matches either an opening
	259	parenthesis, or a closing parenthesis, and the parens inside the character
	260	class don't group or capture.
	261
	262	Characters that may carry a special meaning inside a character class are:
	263	C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
	264	escaped with a backslash, although this is sometimes not needed, in which
	265	case the backslash may be omitted.
	266
	267	The sequence C<\b> is special inside a bracketed character class. While
	268	outside the character class C<\b> is an assertion indicating a point
	269	that does not have either two word characters or two non-word characters
	270	on either side, inside a bracketed character class, C<\b> matches a
	271	backspace character.
	272
	273	A C<[> is not special inside a character class, unless it's the start
	274	of a POSIX character class (see below). It normally does not need escaping.
	275
	276	A C<]> is either the end of a POSIX character class (see below), or it
	277	signals the end of the bracketed character class. Normally it needs
	278	escaping if you want to include a C<]> in the set of characters.
	279	However, if the C<]> is the I<first> (or the second if the first
	280	character is a caret) character of a bracketed character class, it
	281	does not denote the end of the class (as you cannot have an empty class)
	282	and is considered part of the set of characters that can be matched without
	283	escaping.
	284
	285	Examples:
	286
	287	"+" =~ /[+?*]/ # Match, "+" in a character class is not special.
	288	"\cH" =~ /[\b]/ # Match, \b inside in a character class
	289	# is equivalent with a backspace.
	290	"]" =~ /[][]/ # Match, as the character class contains.
	291	# both [ and ].
	292	"[]" =~ /[[]]/ # Match, the pattern contains a character class
	293	# containing just ], and the character class is
	294	# followed by a ].
	295
	296	=head3 Character Ranges
	297
	298	It is not uncommon to want to match a range of characters. Luckily, instead
	299	of listing all the characters in the range, one may use the hyphen (C<->).
	300	If inside a bracketed character class you have two characters separated
	301	by a hyphen, it's treated as if all the characters between the two are in
	302	the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]>
	303	matches any lowercase letter from the first half of the ASCII alphabet.
	304
	305	Note that the two characters on either side of the hyphen are not
	306	necessary both letters or both digits. Any character is possible,
	307	although not advisable. C<['-?]> contains a range of characters, but
	308	most people will not know which characters that will be. Furthermore,
	309	such ranges may lead to portability problems if the code has to run on
	310	a platform that uses a different character set, such as EBCDIC.
	311
	312	If a hyphen in a character class cannot be part of a range, for instance
	313	because it is the first or the last character of the character class,
	314	or if it immediately follows a range, the hyphen isn't special, and will be
	315	considered a character that may be matched. You have to escape the hyphen
	316	with a backslash if you want to have a hyphen in your set of characters to
	317	be matched, and its position in the class is such that it can be considered
	318	part of a range.
	319
	320	Examples:
	321
	322	[a-z] # Matches a character that is a lower case ASCII letter.
	323	[a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the
	324	# letter 'z'.
	325	[-z] # Matches either a hyphen ('-') or the letter 'z'.
	326	[a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the
	327	# hyphen ('-'), or the letter 'm'.
	328	['-?] # Matches any of the characters '()*+,-./0123456789:;<=>?
	329	# (But not on an EBCDIC platform).
	330
	331
	332	=head3 Negation
	333
	334	It is also possible to instead list the characters you do not want to
	335	match. You can do so by using a caret (C<^>) as the first character in the
	336	character class. For instance, C<[^a-z]> matches a character that is not a
	337	lowercase ASCII letter.
	338
	339	This syntax make the caret a special character inside a bracketed character
	340	class, but only if it is the first character of the class. So if you want
	341	to have the caret as one of the characters you want to match, you either
	342	have to escape the caret, or not list it first.
	343
	344	Examples:
	345
	346	"e" =~ /[^aeiou]/ # No match, the 'e' is listed.
	347	"x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel.
	348	"^" =~ /[^^]/ # No match, matches anything that isn't a caret.
	349	"^" =~ /[x^]/ # Match, caret is not special here.
	350
	351	=head3 Backslash Sequences
	352
	353	You can put a backslash sequence character class inside a bracketed character
	354	class, and it will act just as if you put all the characters matched by
	355	the backslash sequence inside the character class. For instance,
	356	C<[a-f\d]> will match any digit, or any of the lowercase letters between
	357	'a' and 'f' inclusive.
	358
	359	Examples:
	360
	361	/[\p{Thai}\d]/ # Matches a character that is either a Thai
	362	# character, or a digit.
	363	/[^\p{Arabic}()]/ # Matches a character that is neither an Arabic
	364	# character, nor a parenthesis.
	365
	366	Backslash sequence character classes cannot form one of the endpoints
	367	of a range.
	368
	369	=head3 Posix Character Classes
	370
	371	Posix character classes have the form C<[:class:]>, where I<class> is
	372	name, and the C<[:> and C<:]> delimiters. Posix character classes appear
	373	I<inside> bracketed character classes, and are a convenient and descriptive
	374	way of listing a group of characters. Be careful about the syntax,
	375
	376	# Correct:
	377	$string =~ /[[:alpha:]]/
	378
	379	# Incorrect (will warn):
	380	$string =~ /[:alpha:]/
	381
	382	The latter pattern would be a character class consisting of a colon,
	383	and the letters C<a>, C<l>, C<p> and C<h>.
	384
	385	Perl recognizes the following POSIX character classes:
	386
	387	alpha Any alphabetical character.
	388	alnum Any alphanumerical character.
	389	ascii Any ASCII character.
	390	blank A GNU extension, equal to a space or a horizontal tab (C<\t>).
	391	cntrl Any control character.
	392	digit Any digit, equivalent to C<\d>.
	393	graph Any printable character, excluding a space.
	394	lower Any lowercase character.
	395	print Any printable character, including a space.
	396	punct Any punctuation character.
	397	space Any white space character. C<\s> plus the vertical tab (C<\cK>).
	398	upper Any uppercase character.
	399	word Any "word" character, equivalent to C<\w>.
	400	xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
	401
	402	The exact set of characters matched depends on whether the source string
	403	is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>.
	404
	405	Most POSIX character classes have C<\p> counterparts. The difference
	406	is that the C<\p> classes will always match according to the Unicode
	407	properties, regardless whether the string is in UTF-8 format or not.
	408
	409	The following table shows the relation between POSIX character classes
	410	and the Unicode properties:
	411
	412	[[:...:]] \p{...} backslash
	413
	414	alpha IsAlpha
	415	alnum IsAlnum
	416	ascii IsASCII
	417	blank
	418	cntrl IsCntrl
	419	digit IsDigit \d
	420	graph IsGraph
	421	lower IsLower
	422	print IsPrint
	423	punct IsPunct
	424	space IsSpace
	425	IsSpacePerl \s
	426	upper IsUpper
	427	word IsWord
	428	xdigit IsXDigit
	429
	430	Some character classes may have a non-obvious name:
	431
	432	=over 4
	433
	434	=item cntrl
	435
	436	Any control character. Usually, control characters don't produce output
	437	as such, but instead control the terminal somehow: for example newline
	438	and backspace are control characters. All characters with C<ord()> less
	439	than 32 are usually classified as control characters (in ASCII, the ISO
	440	Latin character sets, and Unicode), as is the character C<ord()> value
	441	of 127 (C<DEL>).
	442
	443	=item graph
	444
	445	Any character that is I<graphical>, that is, visible. This class consists
	446	of all the alphanumerical characters and all punctuation characters.
	447
	448	=item print
	449
	450	All printable characters, which is the set of all the graphical characters
	451	plus the space.
	452
	453	=item punct
	454
	455	Any punctuation (special) character.
	456
	457	=back
	458
	459	=head4 Negation
	460
	461	A Perl extension to the POSIX character class is the ability to
	462	negate it. This is done by prefixing the class name with a caret (C<^>).
	463	Some examples:
	464
	465	POSIX Unicode Backslash
	466	[[:^digit:]] \P{IsDigit} \D
	467	[[:^space:]] \P{IsSpace} \S
	468	[[:^word:]] \P{IsWord} \W
	469
	470	=head4 [= =] and [. .]
	471
	472	Perl will recognize the POSIX character classes C<[=class=]>, and
	473	C<[.class.]>, but does not (yet?) support this construct. Use of
	474	such a constructs will lead to an error.
	475
	476
	477	=head4 Examples
	478
	479	/[[:digit:]]/ # Matches a character that is a digit.
	480	/[01[:lower:]]/ # Matches a character that is either a
	481	# lowercase letter, or '0' or '1'.
	482	/[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
	483	# but the letters 'a' to 'f' in either case.
	484	# This is because the character class contains
	485	# all digits, and anything that isn't a
	486	# hex digit, resulting in a class containing
	487	# all characters, but the letters 'a' to 'f'
	488	# and 'A' to 'F'.
	489
	490
	491	=head2 Locale, Unicode and UTF-8
	492
	493	Some of the character classes have a somewhat different behaviour depending
	494	on the internal encoding of the source string, and the locale that is
	495	in effect.
	496
	497	C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
	498	including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
	499
	500	The rule is that if the source string is in UTF-8 format, the character
	501	classes match according to the Unicode properties. If the source string
	502	isn't, then the character classes match according to whatever locale is
	503	in effect. If there is no locale, they match the ASCII defaults
	504	(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
	505
	506	This usually means that if you are matching against characters whose C<ord()>
	507	values are between 128 and 255 inclusive, your character class may match
	508	or not depending on the current locale, and whether the source string is
	509	in UTF-8 format. The string will be in UTF-8 format if it contains
	510	characters whose C<ord()> value exceeds 255. But a string may be in UTF-8
	511	format without it having such characters.
	512
	513	For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
	514	or the POSIX character classes, and use the Unicode properties instead.
	515
	516	=head4 Examples
	517
	518	$str = "\xDF"; # $str is not in UTF-8 format.
	519	$str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
	520	$str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
	521	$str =~ /^\w/; # Match! $str is now in UTF-8 format.
	522	chop $str;
	523	$str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
	524
	525	=cut