Commit | Line | Data |
---|---|---|
47f9c88b GS |
1 | =head1 NAME |
2 | ||
3 | perlrequick - Perl regular expressions quick start | |
4 | ||
5 | =head1 DESCRIPTION | |
6 | ||
7 | This page covers the very basics of understanding, creating and | |
6425a278 GS |
8 | using regular expressions ('regexes') in Perl. |
9 | ||
47f9c88b GS |
10 | |
11 | =head1 The Guide | |
12 | ||
c6ae04d3 KW |
13 | This page assumes you already know things, like what a "pattern" is, and |
14 | the basic syntax of using them. If you don't, see L<perlretut>. | |
15 | ||
47f9c88b GS |
16 | =head2 Simple word matching |
17 | ||
6425a278 GS |
18 | The simplest regex is simply a word, or more generally, a string of |
19 | characters. A regex consisting of a word matches any string that | |
47f9c88b GS |
20 | contains that word: |
21 | ||
22 | "Hello World" =~ /World/; # matches | |
23 | ||
6425a278 | 24 | In this statement, C<World> is a regex and the C<//> enclosing |
1e2a213d | 25 | C</World/> tells Perl to search a string for a match. The operator |
6425a278 GS |
26 | C<=~> associates the string with the regex match and produces a true |
27 | value if the regex matched, or false if the regex did not match. In | |
47f9c88b GS |
28 | our case, C<World> matches the second word in C<"Hello World">, so the |
29 | expression is true. This idea has several variations. | |
30 | ||
31 | Expressions like this are useful in conditionals: | |
32 | ||
33 | print "It matches\n" if "Hello World" =~ /World/; | |
34 | ||
35 | The sense of the match can be reversed by using C<!~> operator: | |
36 | ||
37 | print "It doesn't match\n" if "Hello World" !~ /World/; | |
38 | ||
6425a278 | 39 | The literal string in the regex can be replaced by a variable: |
47f9c88b GS |
40 | |
41 | $greeting = "World"; | |
42 | print "It matches\n" if "Hello World" =~ /$greeting/; | |
43 | ||
44 | If you're matching against C<$_>, the C<$_ =~> part can be omitted: | |
45 | ||
46 | $_ = "Hello World"; | |
47 | print "It matches\n" if /World/; | |
48 | ||
49 | Finally, the C<//> default delimiters for a match can be changed to | |
50 | arbitrary delimiters by putting an C<'m'> out front: | |
51 | ||
52 | "Hello World" =~ m!World!; # matches, delimited by '!' | |
53 | "Hello World" =~ m{World}; # matches, note the matching '{}' | |
54 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', | |
55 | # '/' becomes an ordinary char | |
56 | ||
6425a278 | 57 | Regexes must match a part of the string I<exactly> in order for the |
47f9c88b GS |
58 | statement to be true: |
59 | ||
60 | "Hello World" =~ /world/; # doesn't match, case sensitive | |
61 | "Hello World" =~ /o W/; # matches, ' ' is an ordinary char | |
62 | "Hello World" =~ /World /; # doesn't match, no ' ' at end | |
63 | ||
1e2a213d | 64 | Perl will always match at the earliest possible point in the string: |
47f9c88b GS |
65 | |
66 | "Hello World" =~ /o/; # matches 'o' in 'Hello' | |
67 | "That hat is red" =~ /hat/; # matches 'hat' in 'That' | |
68 | ||
69 | Not all characters can be used 'as is' in a match. Some characters, | |
a89a8c8d KW |
70 | called B<metacharacters>, are considered special, and reserved for use |
71 | in regex notation. The metacharacters are | |
47f9c88b GS |
72 | |
73 | {}[]()^$.|*+?\ | |
74 | ||
a89a8c8d KW |
75 | A metacharacter can be matched literally by putting a backslash before |
76 | it: | |
47f9c88b GS |
77 | |
78 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter | |
79 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + | |
80 | 'C:\WIN32' =~ /C:\\WIN/; # matches | |
5d525260 | 81 | "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches |
47f9c88b | 82 | |
6425a278 GS |
83 | In the last regex, the forward slash C<'/'> is also backslashed, |
84 | because it is used to delimit the regex. | |
47f9c88b | 85 | |
a89a8c8d KW |
86 | Most of the metacharacters aren't always special, and other characters |
87 | (such as the ones delimitting the pattern) become special under various | |
88 | circumstances. This can be confusing and lead to unexpected results. | |
89 | L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential | |
90 | pitfalls. | |
91 | ||
47f9c88b GS |
92 | Non-printable ASCII characters are represented by B<escape sequences>. |
93 | Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> | |
94 | for a carriage return. Arbitrary bytes are represented by octal | |
95 | escape sequences, e.g., C<\033>, or hexadecimal escape sequences, | |
96 | e.g., C<\x1B>: | |
97 | ||
555bd962 | 98 | "1000\t2000" =~ m(0\t2) # matches |
a89a8c8d | 99 | "cat" =~ /\143\x61\x74/ # matches in ASCII, but |
555bd962 | 100 | # a weird way to spell cat |
47f9c88b | 101 | |
caedc70b | 102 | Regexes are treated mostly as double-quoted strings, so variable |
47f9c88b GS |
103 | substitution works: |
104 | ||
105 | $foo = 'house'; | |
106 | 'cathouse' =~ /cat$foo/; # matches | |
107 | 'housecat' =~ /${foo}cat/; # matches | |
108 | ||
6425a278 | 109 | With all of the regexes above, if the regex matched anywhere in the |
47f9c88b GS |
110 | string, it was considered a match. To specify I<where> it should |
111 | match, we would use the B<anchor> metacharacters C<^> and C<$>. The | |
112 | anchor C<^> means match at the beginning of the string and the anchor | |
113 | C<$> means match at the end of the string, or before a newline at the | |
114 | end of the string. Some examples: | |
115 | ||
6425a278 GS |
116 | "housekeeper" =~ /keeper/; # matches |
117 | "housekeeper" =~ /^keeper/; # doesn't match | |
118 | "housekeeper" =~ /keeper$/; # matches | |
119 | "housekeeper\n" =~ /keeper$/; # matches | |
120 | "housekeeper" =~ /^housekeeper$/; # matches | |
47f9c88b GS |
121 | |
122 | =head2 Using character classes | |
123 | ||
124 | A B<character class> allows a set of possible characters, rather than | |
6425a278 | 125 | just a single character, to match at a particular point in a regex. |
a89a8c8d KW |
126 | There are a number of different types of character classes, but usually |
127 | when people use this term, they are referring to the type described in | |
128 | this section, which are technically called "Bracketed character | |
129 | classes", because they are denoted by brackets C<[...]>, with the set of | |
130 | characters to be possibly matched inside. But we'll drop the "bracketed" | |
131 | below to correspond with common usage. Here are some examples of | |
132 | (bracketed) character classes: | |
47f9c88b GS |
133 | |
134 | /cat/; # matches 'cat' | |
6425a278 | 135 | /[bcr]at/; # matches 'bat', 'cat', or 'rat' |
47f9c88b GS |
136 | "abc" =~ /[cab]/; # matches 'a' |
137 | ||
138 | In the last statement, even though C<'c'> is the first character in | |
6425a278 | 139 | the class, the earliest point at which the regex can match is C<'a'>. |
47f9c88b GS |
140 | |
141 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way | |
142 | # 'yes', 'Yes', 'YES', etc. | |
143 | /yes/i; # also match 'yes' in a case-insensitive way | |
144 | ||
145 | The last example shows a match with an C<'i'> B<modifier>, which makes | |
146 | the match case-insensitive. | |
147 | ||
148 | Character classes also have ordinary and special characters, but the | |
149 | sets of ordinary and special characters inside a character class are | |
150 | different than those outside a character class. The special | |
151 | characters for a character class are C<-]\^$> and are matched using an | |
152 | escape: | |
153 | ||
154 | /[\]c]def/; # matches ']def' or 'cdef' | |
155 | $x = 'bcr'; | |
156 | /[$x]at/; # matches 'bat, 'cat', or 'rat' | |
157 | /[\$x]at/; # matches '$at' or 'xat' | |
158 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' | |
159 | ||
160 | The special character C<'-'> acts as a range operator within character | |
161 | classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> | |
162 | become the svelte C<[0-9]> and C<[a-z]>: | |
163 | ||
164 | /item[0-9]/; # matches 'item0' or ... or 'item9' | |
165 | /[0-9a-fA-F]/; # matches a hexadecimal digit | |
166 | ||
167 | If C<'-'> is the first or last character in a character class, it is | |
168 | treated as an ordinary character. | |
169 | ||
170 | The special character C<^> in the first position of a character class | |
171 | denotes a B<negated character class>, which matches any character but | |
6425a278 | 172 | those in the brackets. Both C<[...]> and C<[^...]> must match a |
47f9c88b GS |
173 | character, or the match fails. Then |
174 | ||
175 | /[^a]at/; # doesn't match 'aat' or 'at', but matches | |
176 | # all other 'bat', 'cat, '0at', '%at', etc. | |
177 | /[^0-9]/; # matches a non-numeric character | |
178 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary | |
179 | ||
caedc70b | 180 | Perl has several abbreviations for common character classes. (These |
b81c9143 KW |
181 | definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier. |
182 | Otherwise they could match many more non-ASCII Unicode characters as | |
183 | well. See L<perlrecharclass/Backslash sequences> for details.) | |
47f9c88b GS |
184 | |
185 | =over 4 | |
186 | ||
187 | =item * | |
551e1d92 | 188 | |
5d525260 CW |
189 | \d is a digit and represents |
190 | ||
191 | [0-9] | |
47f9c88b GS |
192 | |
193 | =item * | |
551e1d92 | 194 | |
5d525260 CW |
195 | \s is a whitespace character and represents |
196 | ||
197 | [\ \t\r\n\f] | |
47f9c88b GS |
198 | |
199 | =item * | |
551e1d92 | 200 | |
5d525260 CW |
201 | \w is a word character (alphanumeric or _) and represents |
202 | ||
203 | [0-9a-zA-Z_] | |
47f9c88b GS |
204 | |
205 | =item * | |
551e1d92 | 206 | |
5d525260 CW |
207 | \D is a negated \d; it represents any character but a digit |
208 | ||
209 | [^0-9] | |
47f9c88b GS |
210 | |
211 | =item * | |
551e1d92 | 212 | |
5d525260 CW |
213 | \S is a negated \s; it represents any non-whitespace character |
214 | ||
215 | [^\s] | |
47f9c88b GS |
216 | |
217 | =item * | |
551e1d92 | 218 | |
5d525260 CW |
219 | \W is a negated \w; it represents any non-word character |
220 | ||
221 | [^\w] | |
47f9c88b GS |
222 | |
223 | =item * | |
551e1d92 | 224 | |
47f9c88b GS |
225 | The period '.' matches any character but "\n" |
226 | ||
227 | =back | |
228 | ||
229 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside | |
230 | of character classes. Here are some in use: | |
231 | ||
232 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format | |
233 | /[\d\s]/; # matches any digit or whitespace character | |
234 | /\w\W\w/; # matches a word char, followed by a | |
235 | # non-word char, followed by a word char | |
236 | /..rt/; # matches any two chars, followed by 'rt' | |
237 | /end\./; # matches 'end.' | |
238 | /end[.]/; # same thing, matches 'end.' | |
239 | ||
240 | The S<B<word anchor> > C<\b> matches a boundary between a word | |
241 | character and a non-word character C<\w\W> or C<\W\w>: | |
242 | ||
243 | $x = "Housecat catenates house and cat"; | |
244 | $x =~ /\bcat/; # matches cat in 'catenates' | |
245 | $x =~ /cat\b/; # matches cat in 'housecat' | |
246 | $x =~ /\bcat\b/; # matches 'cat' at end of string | |
247 | ||
248 | In the last example, the end of the string is considered a word | |
249 | boundary. | |
ae3bb8ea KW |
250 | |
251 | For natural language processing (so that, for example, apostrophes are | |
252 | included in words), use instead C<\b{wb}> | |
253 | ||
254 | "don't" =~ / .+? \b{wb} /x; # matches the whole string | |
47f9c88b GS |
255 | |
256 | =head2 Matching this or that | |
257 | ||
da75cd15 | 258 | We can match different character strings with the B<alternation> |
6425a278 | 259 | metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex |
1e2a213d | 260 | C<dog|cat>. As before, Perl will try to match the regex at the |
47f9c88b | 261 | earliest possible point in the string. At each character position, |
1e2a213d KW |
262 | Perl will first try to match the first alternative, C<dog>. If |
263 | C<dog> doesn't match, Perl will then try the next alternative, C<cat>. | |
264 | If C<cat> doesn't match either, then the match fails and Perl moves to | |
47f9c88b GS |
265 | the next position in the string. Some examples: |
266 | ||
267 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat" | |
268 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat" | |
269 | ||
6425a278 | 270 | Even though C<dog> is the first alternative in the second regex, |
47f9c88b GS |
271 | C<cat> is able to match earlier in the string. |
272 | ||
273 | "cats" =~ /c|ca|cat|cats/; # matches "c" | |
274 | "cats" =~ /cats|cat|ca|c/; # matches "cats" | |
275 | ||
276 | At a given character position, the first alternative that allows the | |
210b36aa | 277 | regex match to succeed will be the one that matches. Here, all the |
5d525260 | 278 | alternatives match at the first string position, so the first matches. |
47f9c88b GS |
279 | |
280 | =head2 Grouping things and hierarchical matching | |
281 | ||
6425a278 GS |
282 | The B<grouping> metacharacters C<()> allow a part of a regex to be |
283 | treated as a single unit. Parts of a regex are grouped by enclosing | |
284 | them in parentheses. The regex C<house(cat|keeper)> means match | |
47f9c88b GS |
285 | C<house> followed by either C<cat> or C<keeper>. Some more examples |
286 | are | |
287 | ||
288 | /(a|b)b/; # matches 'ab' or 'bb' | |
289 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere | |
290 | ||
291 | /house(cat|)/; # matches either 'housecat' or 'house' | |
292 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or | |
293 | # 'house'. Note groups can be nested. | |
294 | ||
295 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', | |
296 | # because '20\d\d' can't match | |
297 | ||
298 | =head2 Extracting matches | |
299 | ||
300 | The grouping metacharacters C<()> also allow the extraction of the | |
301 | parts of a string that matched. For each grouping, the part that | |
302 | matched inside goes into the special variables C<$1>, C<$2>, etc. | |
303 | They can be used just as ordinary variables: | |
304 | ||
305 | # extract hours, minutes, seconds | |
306 | $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format | |
307 | $hours = $1; | |
308 | $minutes = $2; | |
309 | $seconds = $3; | |
310 | ||
6425a278 | 311 | In list context, a match C</regex/> with groupings will return the |
47f9c88b GS |
312 | list of matched values C<($1,$2,...)>. So we could rewrite it as |
313 | ||
314 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); | |
315 | ||
6425a278 | 316 | If the groupings in a regex are nested, C<$1> gets the group with the |
47f9c88b | 317 | leftmost opening parenthesis, C<$2> the next opening parenthesis, |
6425a278 | 318 | etc. For example, here is a complex regex and the matching variables |
47f9c88b GS |
319 | indicated below it: |
320 | ||
321 | /(ab(cd|ef)((gi)|j))/; | |
322 | 1 2 34 | |
323 | ||
324 | Associated with the matching variables C<$1>, C<$2>, ... are | |
d8b950dc | 325 | the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are |
6425a278 | 326 | matching variables that can be used I<inside> a regex: |
47f9c88b | 327 | |
d8b950dc | 328 | /(\w\w\w)\s\g1/; # find sequences like 'the the' in string |
47f9c88b | 329 | |
d8b950dc KW |
330 | C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, |
331 | C<\g2>, ... only inside a regex. | |
47f9c88b GS |
332 | |
333 | =head2 Matching repetitions | |
334 | ||
335 | The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us | |
6425a278 | 336 | to determine the number of repeats of a portion of a regex we |
47f9c88b GS |
337 | consider to be a match. Quantifiers are put immediately after the |
338 | character, character class, or grouping that we want to specify. They | |
339 | have the following meanings: | |
340 | ||
341 | =over 4 | |
342 | ||
cb49b31f RB |
343 | =item * |
344 | ||
345 | C<a?> = match 'a' 1 or 0 times | |
346 | ||
347 | =item * | |
348 | ||
349 | C<a*> = match 'a' 0 or more times, i.e., any number of times | |
350 | ||
351 | =item * | |
47f9c88b | 352 | |
cb49b31f | 353 | C<a+> = match 'a' 1 or more times, i.e., at least once |
47f9c88b | 354 | |
cb49b31f | 355 | =item * |
47f9c88b | 356 | |
cb49b31f | 357 | C<a{n,m}> = match at least C<n> times, but not more than C<m> |
47f9c88b GS |
358 | times. |
359 | ||
cb49b31f RB |
360 | =item * |
361 | ||
362 | C<a{n,}> = match at least C<n> or more times | |
363 | ||
364 | =item * | |
47f9c88b | 365 | |
cb49b31f | 366 | C<a{n}> = match exactly C<n> times |
47f9c88b GS |
367 | |
368 | =back | |
369 | ||
370 | Here are some examples: | |
371 | ||
372 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and | |
373 | # any number of digits | |
d8b950dc | 374 | /(\w+)\s+\g1/; # match doubled words of arbitrary length |
c2ac8995 NS |
375 | $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more |
376 | # than 4 digits | |
555bd962 | 377 | $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates |
47f9c88b GS |
378 | |
379 | These quantifiers will try to match as much of the string as possible, | |
6425a278 | 380 | while still allowing the regex to match. So we have |
47f9c88b | 381 | |
6425a278 | 382 | $x = 'the cat in the hat'; |
47f9c88b GS |
383 | $x =~ /^(.*)(at)(.*)$/; # matches, |
384 | # $1 = 'the cat in the h' | |
385 | # $2 = 'at' | |
386 | # $3 = '' (0 matches) | |
387 | ||
388 | The first quantifier C<.*> grabs as much of the string as possible | |
6425a278 | 389 | while still having the regex match. The second quantifier C<.*> has |
47f9c88b GS |
390 | no string left to it, so it matches 0 times. |
391 | ||
392 | =head2 More matching | |
393 | ||
394 | There are a few more things you might want to know about matching | |
72606c45 | 395 | operators. |
f1dc5bb2 | 396 | The global modifier C</g> allows the matching operator to match |
47f9c88b | 397 | within a string as many times as possible. In scalar context, |
f1dc5bb2 | 398 | successive matches against a string will have C</g> jump from match |
47f9c88b GS |
399 | to match, keeping track of position in the string as it goes along. |
400 | You can get or set the position with the C<pos()> function. | |
401 | For example, | |
402 | ||
403 | $x = "cat dog house"; # 3 words | |
404 | while ($x =~ /(\w+)/g) { | |
405 | print "Word is $1, ends at position ", pos $x, "\n"; | |
406 | } | |
407 | ||
408 | prints | |
409 | ||
410 | Word is cat, ends at position 3 | |
411 | Word is dog, ends at position 7 | |
412 | Word is house, ends at position 13 | |
413 | ||
414 | A failed match or changing the target string resets the position. If | |
415 | you don't want the position reset after failure to match, add the | |
f1dc5bb2 | 416 | C</c>, as in C</regex/gc>. |
47f9c88b | 417 | |
f1dc5bb2 | 418 | In list context, C</g> returns a list of matched groupings, or if |
6425a278 | 419 | there are no groupings, a list of matches to the whole regex. So |
47f9c88b GS |
420 | |
421 | @words = ($x =~ /(\w+)/g); # matches, | |
422 | # $word[0] = 'cat' | |
423 | # $word[1] = 'dog' | |
424 | # $word[2] = 'house' | |
425 | ||
426 | =head2 Search and replace | |
427 | ||
6425a278 | 428 | Search and replace is performed using C<s/regex/replacement/modifiers>. |
caedc70b | 429 | The C<replacement> is a Perl double-quoted string that replaces in the |
6425a278 | 430 | string whatever is matched with the C<regex>. The operator C<=~> is |
47f9c88b | 431 | also used here to associate a string with C<s///>. If matching |
caedc70b FC |
432 | against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, |
433 | C<s///> returns the number of substitutions made; otherwise it returns | |
47f9c88b GS |
434 | false. Here are a few examples: |
435 | ||
436 | $x = "Time to feed the cat!"; | |
437 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" | |
438 | $y = "'quoted words'"; | |
439 | $y =~ s/^'(.*)'$/$1/; # strip single quotes, | |
440 | # $y contains "quoted words" | |
441 | ||
442 | With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. | |
443 | are immediately available for use in the replacement expression. With | |
444 | the global modifier, C<s///g> will search and replace all occurrences | |
6425a278 | 445 | of the regex in the string: |
47f9c88b GS |
446 | |
447 | $x = "I batted 4 for 4"; | |
448 | $x =~ s/4/four/; # $x contains "I batted four for 4" | |
449 | $x = "I batted 4 for 4"; | |
450 | $x =~ s/4/four/g; # $x contains "I batted four for four" | |
451 | ||
4f4d7508 DC |
452 | The non-destructive modifier C<s///r> causes the result of the substitution |
453 | to be returned instead of modifying C<$_> (or whatever variable the | |
454 | substitute was bound to with C<=~>): | |
455 | ||
456 | $x = "I like dogs."; | |
457 | $y = $x =~ s/dogs/cats/r; | |
458 | print "$x $y\n"; # prints "I like dogs. I like cats." | |
459 | ||
460 | $x = "Cats are great."; | |
555bd962 BG |
461 | print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ |
462 | s/Frogs/Hedgehogs/r, "\n"; | |
4f4d7508 DC |
463 | # prints "Hedgehogs are great." |
464 | ||
465 | @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3); | |
466 | # @foo is now qw(X X X 1 2 3) | |
467 | ||
47f9c88b GS |
468 | The evaluation modifier C<s///e> wraps an C<eval{...}> around the |
469 | replacement string and the evaluated result is substituted for the | |
6425a278 | 470 | matched substring. Some examples: |
47f9c88b | 471 | |
6425a278 GS |
472 | # reverse all the words in a string |
473 | $x = "the cat in the hat"; | |
474 | $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" | |
47f9c88b | 475 | |
6425a278 GS |
476 | # convert percentage to decimal |
477 | $x = "A 39% hit rate"; | |
478 | $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" | |
47f9c88b | 479 | |
6425a278 GS |
480 | The last example shows that C<s///> can use other delimiters, such as |
481 | C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used | |
caedc70b | 482 | C<s'''>, then the regex and replacement are treated as single-quoted |
6425a278 | 483 | strings. |
47f9c88b GS |
484 | |
485 | =head2 The split operator | |
486 | ||
6425a278 GS |
487 | C<split /regex/, string> splits C<string> into a list of substrings |
488 | and returns that list. The regex determines the character sequence | |
47f9c88b GS |
489 | that C<string> is split with respect to. For example, to split a |
490 | string into words, use | |
491 | ||
492 | $x = "Calvin and Hobbes"; | |
6425a278 GS |
493 | @word = split /\s+/, $x; # $word[0] = 'Calvin' |
494 | # $word[1] = 'and' | |
495 | # $word[2] = 'Hobbes' | |
496 | ||
497 | To extract a comma-delimited list of numbers, use | |
47f9c88b | 498 | |
6425a278 GS |
499 | $x = "1.618,2.718, 3.142"; |
500 | @const = split /,\s*/, $x; # $const[0] = '1.618' | |
501 | # $const[1] = '2.718' | |
502 | # $const[2] = '3.142' | |
503 | ||
504 | If the empty regex C<//> is used, the string is split into individual | |
5d525260 | 505 | characters. If the regex has groupings, then the list produced contains |
47f9c88b GS |
506 | the matched substrings from the groupings as well: |
507 | ||
508 | $x = "/usr/bin"; | |
509 | @parts = split m!(/)!, $x; # $parts[0] = '' | |
510 | # $parts[1] = '/' | |
511 | # $parts[2] = 'usr' | |
512 | # $parts[3] = '/' | |
513 | # $parts[4] = 'bin' | |
514 | ||
6425a278 | 515 | Since the first character of $x matched the regex, C<split> prepended |
47f9c88b GS |
516 | an empty initial element to the list. |
517 | ||
67cdf558 KW |
518 | =head2 C<use re 'strict'> |
519 | ||
520 | New in v5.22, this applies stricter rules than otherwise when compiling | |
521 | regular expression patterns. It can find things that, while legal, may | |
522 | not be what you intended. | |
523 | ||
524 | See L<'strict' in re|re/'strict' mode>. | |
525 | ||
47f9c88b GS |
526 | =head1 BUGS |
527 | ||
528 | None. | |
529 | ||
530 | =head1 SEE ALSO | |
531 | ||
532 | This is just a quick start guide. For a more in-depth tutorial on | |
6425a278 | 533 | regexes, see L<perlretut> and for the reference page, see L<perlre>. |
47f9c88b GS |
534 | |
535 | =head1 AUTHOR AND COPYRIGHT | |
536 | ||
537 | Copyright (c) 2000 Mark Kvale | |
538 | All rights reserved. | |
539 | ||
540 | This document may be distributed under the same terms as Perl itself. | |
541 | ||
6425a278 GS |
542 | =head2 Acknowledgments |
543 | ||
544 | The author would like to thank Mark-Jason Dominus, Tom Christiansen, | |
545 | Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful | |
546 | comments. | |
547 | ||
47f9c88b GS |
548 | =cut |
549 |