1 This file contains the PCRE man page that describes the regular expressions
2 supported by PCRE version 6.7. Note that not all of the features are relevant
3 in the context of Exim. In particular, the version of PCRE that is compiled
4 with Exim does not include UTF-8 support, there is no mechanism for changing
5 the options with which the PCRE functions are called, and features such as
6 callout are not accessible.
7 -----------------------------------------------------------------------------
9 PCREPATTERN(3) PCREPATTERN(3)
13 PCRE - Perl-compatible regular expressions
16 PCRE REGULAR EXPRESSION DETAILS
18 The syntax and semantics of the regular expressions supported by PCRE
19 are described below. Regular expressions are also described in the Perl
20 documentation and in a number of books, some of which have copious
21 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
22 by O'Reilly, covers regular expressions in great detail. This descrip-
23 tion of PCRE's regular expressions is intended as reference material.
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
33 The remainder of this document discusses the patterns that are sup-
34 ported by PCRE when its main matching function, pcre_exec(), is used.
35 From release 6.0, PCRE offers a second matching function,
36 pcre_dfa_exec(), which matches using a different algorithm that is not
37 Perl-compatible. The advantages and disadvantages of the alternative
38 function, and how it differs from the normal function, are discussed in
39 the pcrematching page.
41 A regular expression is a pattern that is matched against a subject
42 string from left to right. Most characters stand for themselves in a
43 pattern, and match the corresponding characters in the subject. As a
44 trivial example, the pattern
48 matches a portion of a subject string that is identical to itself. When
49 caseless matching is specified (the PCRE_CASELESS option), letters are
50 matched independently of case. In UTF-8 mode, PCRE always understands
51 the concept of case for characters whose values are less than 128, so
52 caseless matching is always possible. For characters with higher val-
53 ues, the concept of case is supported if PCRE is compiled with Unicode
54 property support, but not otherwise. If you want to use caseless
55 matching for characters 128 and above, you must ensure that PCRE is
56 compiled with Unicode property support as well as with UTF-8 support.
58 The power of regular expressions comes from the ability to include
59 alternatives and repetitions in the pattern. These are encoded in the
60 pattern by the use of metacharacters, which do not stand for themselves
61 but instead are interpreted in some special way.
63 There are two different sets of metacharacters: those that are recog-
64 nized anywhere in the pattern except within square brackets, and those
65 that are recognized in square brackets. Outside square brackets, the
66 metacharacters are as follows:
68 \ general escape character with several uses
69 ^ assert start of string (or line, in multiline mode)
70 $ assert end of string (or line, in multiline mode)
71 . match any character except newline (by default)
72 [ start character class definition
73 | start of alternative branch
76 ? extends the meaning of (
77 also 0 or 1 quantifier
78 also quantifier minimizer
79 * 0 or more quantifier
80 + 1 or more quantifier
81 also "possessive quantifier"
82 { start min/max quantifier
84 Part of a pattern that is in square brackets is called a "character
85 class". In a character class the only metacharacters are:
87 \ general escape character
88 ^ negate the class, but only if the first character
89 - indicates character range
90 [ POSIX character class (only if followed by POSIX
92 ] terminates the character class
94 The following sections describe the use of each of the metacharacters.
99 The backslash character has several uses. Firstly, if it is followed by
100 a non-alphanumeric character, it takes away any special meaning that
101 character may have. This use of backslash as an escape character
102 applies both inside and outside character classes.
104 For example, if you want to match a * character, you write \* in the
105 pattern. This escaping action applies whether or not the following
106 character would otherwise be interpreted as a metacharacter, so it is
107 always safe to precede a non-alphanumeric with backslash to specify
108 that it stands for itself. In particular, if you want to match a back-
111 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
112 the pattern (other than in a character class) and characters between a
113 # outside a character class and the next newline are ignored. An escap-
114 ing backslash can be used to include a whitespace or # character as
117 If you want to remove the special meaning from a sequence of charac-
118 ters, you can do so by putting them between \Q and \E. This is differ-
119 ent from Perl in that $ and @ are handled as literals in \Q...\E
120 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
121 tion. Note the following examples:
123 Pattern PCRE matches Perl matches
125 \Qabc$xyz\E abc$xyz abc followed by the
127 \Qabc\$xyz\E abc\$xyz abc\$xyz
128 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
130 The \Q...\E sequence is recognized both inside and outside character
133 Non-printing characters
135 A second use of backslash provides a way of encoding non-printing char-
136 acters in patterns in a visible manner. There is no restriction on the
137 appearance of non-printing characters, apart from the binary zero that
138 terminates a pattern, but when a pattern is being prepared by text
139 editing, it is usually easier to use one of the following escape
140 sequences than the binary character it represents:
142 \a alarm, that is, the BEL character (hex 07)
143 \cx "control-x", where x is any character
147 \r carriage return (hex 0D)
149 \ddd character with octal code ddd, or backreference
150 \xhh character with hex code hh
151 \x{hhh..} character with hex code hhh..
153 The precise effect of \cx is as follows: if x is a lower case letter,
154 it is converted to upper case. Then bit 6 of the character (hex 40) is
155 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
158 After \x, from zero to two hexadecimal digits are read (letters can be
159 in upper or lower case). Any number of hexadecimal digits may appear
160 between \x{ and }, but the value of the character code must be less
161 than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
162 the maximum hexadecimal value is 7FFFFFFF). If characters other than
163 hexadecimal digits appear between \x{ and }, or if there is no termi-
164 nating }, this form of escape is not recognized. Instead, the initial
165 \x will be interpreted as a basic hexadecimal escape, with no following
166 digits, giving a character whose value is zero.
168 Characters whose value is less than 256 can be defined by either of the
169 two syntaxes for \x. There is no difference in the way they are han-
170 dled. For example, \xdc is exactly the same as \x{dc}.
172 After \0 up to two further octal digits are read. If there are fewer
173 than two digits, just those that are present are used. Thus the
174 sequence \0\x\07 specifies two binary zeros followed by a BEL character
175 (code value 7). Make sure you supply two digits after the initial zero
176 if the pattern character that follows is itself an octal digit.
178 The handling of a backslash followed by a digit other than 0 is compli-
179 cated. Outside a character class, PCRE reads it and any following dig-
180 its as a decimal number. If the number is less than 10, or if there
181 have been at least that many previous capturing left parentheses in the
182 expression, the entire sequence is taken as a back reference. A
183 description of how this works is given later, following the discussion
184 of parenthesized subpatterns.
186 Inside a character class, or if the decimal number is greater than 9
187 and there have not been that many capturing subpatterns, PCRE re-reads
188 up to three octal digits following the backslash, ane uses them to gen-
189 erate a data character. Any subsequent digits stand for themselves. In
190 non-UTF-8 mode, the value of a character specified in octal must be
191 less than \400. In UTF-8 mode, values up to \777 are permitted. For
194 \040 is another way of writing a space
195 \40 is the same, provided there are fewer than 40
196 previous capturing subpatterns
197 \7 is always a back reference
198 \11 might be a back reference, or another way of
201 \0113 is a tab followed by the character "3"
202 \113 might be a back reference, otherwise the
203 character with octal code 113
204 \377 might be a back reference, otherwise
205 the byte consisting entirely of 1 bits
206 \81 is either a back reference, or a binary zero
207 followed by the two characters "8" and "1"
209 Note that octal values of 100 or greater must not be introduced by a
210 leading zero, because no more than three octal digits are ever read.
212 All the sequences that define a single character value can be used both
213 inside and outside character classes. In addition, inside a character
214 class, the sequence \b is interpreted as the backspace character (hex
215 08), and the sequence \X is interpreted as the character "X". Outside a
216 character class, these sequences have different meanings (see below).
218 Generic character types
220 The third use of backslash is for specifying generic character types.
221 The following are always recognized:
224 \D any character that is not a decimal digit
225 \s any whitespace character
226 \S any character that is not a whitespace character
227 \w any "word" character
228 \W any "non-word" character
230 Each pair of escape sequences partitions the complete set of characters
231 into two disjoint sets. Any given character matches one, and only one,
234 These character type sequences can appear both inside and outside char-
235 acter classes. They each match one character of the appropriate type.
236 If the current matching point is at the end of the subject string, all
237 of them fail, since there is no character to match.
239 For compatibility with Perl, \s does not match the VT character (code
240 11). This makes it different from the the POSIX "space" class. The \s
241 characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If
242 "use locale;" is included in a Perl script, \s may match the VT charac-
243 ter. In PCRE, it never does.)
245 A "word" character is an underscore or any character less than 256 that
246 is a letter or digit. The definition of letters and digits is con-
247 trolled by PCRE's low-valued character tables, and may vary if locale-
248 specific matching is taking place (see "Locale support" in the pcreapi
249 page). For example, in the "fr_FR" (French) locale, some character
250 codes greater than 128 are used for accented letters, and these are
253 In UTF-8 mode, characters with values greater than 128 never match \d,
254 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
255 code character property support is available. The use of locales with
256 Unicode is discouraged.
258 Unicode character properties
260 When PCRE is built with Unicode character property support, three addi-
261 tional escape sequences to match character properties are available
262 when UTF-8 mode is selected. They are:
264 \p{xx} a character with the xx property
265 \P{xx} a character without the xx property
266 \X an extended Unicode sequence
268 The property names represented by xx above are limited to the Unicode
269 script names, the general category properties, and "Any", which matches
270 any character (including newline). Other properties such as "InMusical-
271 Symbols" are not currently supported by PCRE. Note that \P{Any} does
272 not match any characters, so always causes a match failure.
274 Sets of Unicode characters are defined as belonging to certain scripts.
275 A character from one of these sets can be matched using a script name.
281 Those that are not part of an identified script are lumped together as
282 "Common". The current list of scripts is:
284 Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana-
285 dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret,
286 Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati,
287 Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada,
288 Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
289 Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
290 Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
291 banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
294 Each character has exactly one general category property, specified by
295 a two-letter abbreviation. For compatibility with Perl, negation can be
296 specified by including a circumflex between the opening brace and the
297 property name. For example, \p{^Lu} is the same as \P{Lu}.
299 If only one letter is specified with \p or \P, it includes all the gen-
300 eral category properties that start with that letter. In this case, in
301 the absence of negation, the curly brackets in the escape sequence are
302 optional; these two examples have the same effect:
307 The following general category property codes are supported:
334 Pc Connector punctuation
338 Pi Initial punctuation
345 Sm Mathematical symbol
350 Zp Paragraph separator
353 The special property L& is also supported: it matches a character that
354 has the Lu, Ll, or Lt property, in other words, a letter that is not
355 classified as a modifier or "other".
357 The long synonyms for these properties that Perl supports (such as
358 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
359 any of these properties with "Is".
361 No character that is in the Unicode table has the Cn (unassigned) prop-
362 erty. Instead, this property is assumed for any code point that is not
363 in the Unicode table.
365 Specifying caseless matching does not affect these escape sequences.
366 For example, \p{Lu} always matches only upper case letters.
368 The \X escape matches any number of Unicode characters that form an
369 extended Unicode sequence. \X is equivalent to
373 That is, it matches a character without the "mark" property, followed
374 by zero or more characters with the "mark" property, and treats the
375 sequence as an atomic group (see below). Characters with the "mark"
376 property are typically accents that affect the preceding character.
378 Matching characters by Unicode property is not fast, because PCRE has
379 to search a structure that contains data for over fifteen thousand
380 characters. That is why the traditional escape sequences such as \d and
381 \w do not use Unicode properties in PCRE.
385 The fourth use of backslash is for certain simple assertions. An asser-
386 tion specifies a condition that has to be met at a particular point in
387 a match, without consuming any characters from the subject string. The
388 use of subpatterns for more complicated assertions is described below.
389 The backslashed assertions are:
391 \b matches at a word boundary
392 \B matches when not at a word boundary
393 \A matches at start of subject
394 \Z matches at end of subject or before newline at end
395 \z matches at end of subject
396 \G matches at first matching position in subject
398 These assertions may not appear in character classes (but note that \b
399 has a different meaning, namely the backspace character, inside a char-
402 A word boundary is a position in the subject string where the current
403 character and the previous character do not both match \w or \W (i.e.
404 one matches \w and the other matches \W), or the start or end of the
405 string if the first or last character matches \w, respectively.
407 The \A, \Z, and \z assertions differ from the traditional circumflex
408 and dollar (described in the next section) in that they only ever match
409 at the very start and end of the subject string, whatever options are
410 set. Thus, they are independent of multiline mode. These three asser-
411 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
412 affect only the behaviour of the circumflex and dollar metacharacters.
413 However, if the startoffset argument of pcre_exec() is non-zero, indi-
414 cating that matching is to start at a point other than the beginning of
415 the subject, \A can never match. The difference between \Z and \z is
416 that \Z matches before a newline at the end of the string as well as at
417 the very end, whereas \z matches only at the end.
419 The \G assertion is true only when the current matching position is at
420 the start point of the match, as specified by the startoffset argument
421 of pcre_exec(). It differs from \A when the value of startoffset is
422 non-zero. By calling pcre_exec() multiple times with appropriate argu-
423 ments, you can mimic Perl's /g option, and it is in this kind of imple-
424 mentation where \G can be useful.
426 Note, however, that PCRE's interpretation of \G, as the start of the
427 current match, is subtly different from Perl's, which defines it as the
428 end of the previous match. In Perl, these can be different when the
429 previously matched string was empty. Because PCRE does just one match
430 at a time, it cannot reproduce this behaviour.
432 If all the alternatives of a pattern begin with \G, the expression is
433 anchored to the starting match position, and the "anchored" flag is set
434 in the compiled regular expression.
437 CIRCUMFLEX AND DOLLAR
439 Outside a character class, in the default matching mode, the circumflex
440 character is an assertion that is true only if the current matching
441 point is at the start of the subject string. If the startoffset argu-
442 ment of pcre_exec() is non-zero, circumflex can never match if the
443 PCRE_MULTILINE option is unset. Inside a character class, circumflex
444 has an entirely different meaning (see below).
446 Circumflex need not be the first character of the pattern if a number
447 of alternatives are involved, but it should be the first thing in each
448 alternative in which it appears if the pattern is ever to match that
449 branch. If all possible alternatives start with a circumflex, that is,
450 if the pattern is constrained to match only at the start of the sub-
451 ject, it is said to be an "anchored" pattern. (There are also other
452 constructs that can cause a pattern to be anchored.)
454 A dollar character is an assertion that is true only if the current
455 matching point is at the end of the subject string, or immediately
456 before a newline at the end of the string (by default). Dollar need not
457 be the last character of the pattern if a number of alternatives are
458 involved, but it should be the last item in any branch in which it
459 appears. Dollar has no special meaning in a character class.
461 The meaning of dollar can be changed so that it matches only at the
462 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
463 compile time. This does not affect the \Z assertion.
465 The meanings of the circumflex and dollar characters are changed if the
466 PCRE_MULTILINE option is set. When this is the case, a circumflex
467 matches immediately after internal newlines as well as at the start of
468 the subject string. It does not match after a newline that ends the
469 string. A dollar matches before any newlines in the string, as well as
470 at the very end, when PCRE_MULTILINE is set. When newline is specified
471 as the two-character sequence CRLF, isolated CR and LF characters do
472 not indicate newlines.
474 For example, the pattern /^abc$/ matches the subject string "def\nabc"
475 (where \n represents a newline) in multiline mode, but not otherwise.
476 Consequently, patterns that are anchored in single line mode because
477 all branches start with ^ are not anchored in multiline mode, and a
478 match for circumflex is possible when the startoffset argument of
479 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
480 PCRE_MULTILINE is set.
482 Note that the sequences \A, \Z, and \z can be used to match the start
483 and end of the subject in both modes, and if all branches of a pattern
484 start with \A it is always anchored, whether or not PCRE_MULTILINE is
488 FULL STOP (PERIOD, DOT)
490 Outside a character class, a dot in the pattern matches any one charac-
491 ter in the subject string except (by default) a character that signi-
492 fies the end of a line. In UTF-8 mode, the matched character may be
493 more than one byte long. When a line ending is defined as a single
494 character (CR or LF), dot never matches that character; when the two-
495 character sequence CRLF is used, dot does not match CR if it is immedi-
496 ately followed by LF, but otherwise it matches all characters (includ-
497 ing isolated CRs and LFs).
499 The behaviour of dot with regard to newlines can be changed. If the
500 PCRE_DOTALL option is set, a dot matches any one character, without
501 exception. If newline is defined as the two-character sequence CRLF, it
502 takes two dots to match it.
504 The handling of dot is entirely independent of the handling of circum-
505 flex and dollar, the only relationship being that they both involve
506 newlines. Dot has no special meaning in a character class.
509 MATCHING A SINGLE BYTE
511 Outside a character class, the escape sequence \C matches any one byte,
512 both in and out of UTF-8 mode. Unlike a dot, it always matches CR and
513 LF. The feature is provided in Perl in order to match individual bytes
514 in UTF-8 mode. Because it breaks up UTF-8 characters into individual
515 bytes, what remains in the string may be a malformed UTF-8 string. For
516 this reason, the \C escape sequence is best avoided.
518 PCRE does not allow \C to appear in lookbehind assertions (described
519 below), because in UTF-8 mode this would make it impossible to calcu-
520 late the length of the lookbehind.
523 SQUARE BRACKETS AND CHARACTER CLASSES
525 An opening square bracket introduces a character class, terminated by a
526 closing square bracket. A closing square bracket on its own is not spe-
527 cial. If a closing square bracket is required as a member of the class,
528 it should be the first data character in the class (after an initial
529 circumflex, if present) or escaped with a backslash.
531 A character class matches a single character in the subject. In UTF-8
532 mode, the character may occupy more than one byte. A matched character
533 must be in the set of characters defined by the class, unless the first
534 character in the class definition is a circumflex, in which case the
535 subject character must not be in the set defined by the class. If a
536 circumflex is actually required as a member of the class, ensure it is
537 not the first character, or escape it with a backslash.
539 For example, the character class [aeiou] matches any lower case vowel,
540 while [^aeiou] matches any character that is not a lower case vowel.
541 Note that a circumflex is just a convenient notation for specifying the
542 characters that are in the class by enumerating those that are not. A
543 class that starts with a circumflex is not an assertion: it still con-
544 sumes a character from the subject string, and therefore it fails if
545 the current pointer is at the end of the string.
547 In UTF-8 mode, characters with values greater than 255 can be included
548 in a class as a literal string of bytes, or by using the \x{ escaping
551 When caseless matching is set, any letters in a class represent both
552 their upper case and lower case versions, so for example, a caseless
553 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
554 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
555 understands the concept of case for characters whose values are less
556 than 128, so caseless matching is always possible. For characters with
557 higher values, the concept of case is supported if PCRE is compiled
558 with Unicode property support, but not otherwise. If you want to use
559 caseless matching for characters 128 and above, you must ensure that
560 PCRE is compiled with Unicode property support as well as with UTF-8
563 Characters that might indicate line breaks (CR and LF) are never
564 treated in any special way when matching character classes, whatever
565 line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
566 and PCRE_MULTILINE options is used. A class such as [^a] always matches
567 one of these characters.
569 The minus (hyphen) character can be used to specify a range of charac-
570 ters in a character class. For example, [d-m] matches any letter
571 between d and m, inclusive. If a minus character is required in a
572 class, it must be escaped with a backslash or appear in a position
573 where it cannot be interpreted as indicating a range, typically as the
574 first or last character in the class.
576 It is not possible to have the literal character "]" as the end charac-
577 ter of a range. A pattern such as [W-]46] is interpreted as a class of
578 two characters ("W" and "-") followed by a literal string "46]", so it
579 would match "W46]" or "-46]". However, if the "]" is escaped with a
580 backslash it is interpreted as the end of range, so [W-\]46] is inter-
581 preted as a class containing a range followed by two other characters.
582 The octal or hexadecimal representation of "]" can also be used to end
585 Ranges operate in the collating sequence of character values. They can
586 also be used for characters specified numerically, for example
587 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
588 are greater than 255, for example [\x{100}-\x{2ff}].
590 If a range that includes letters is used when caseless matching is set,
591 it matches the letters in either case. For example, [W-c] is equivalent
592 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
593 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
594 accented E characters in both cases. In UTF-8 mode, PCRE supports the
595 concept of case for characters with values greater than 128 only when
596 it is compiled with Unicode property support.
598 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
599 in a character class, and add the characters that they match to the
600 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
601 flex can conveniently be used with the upper case character types to
602 specify a more restricted set of characters than the matching lower
603 case type. For example, the class [^\W_] matches any letter or digit,
606 The only metacharacters that are recognized in character classes are
607 backslash, hyphen (only where it can be interpreted as specifying a
608 range), circumflex (only at the start), opening square bracket (only
609 when it can be interpreted as introducing a POSIX class name - see the
610 next section), and the terminating closing square bracket. However,
611 escaping other non-alphanumeric characters does no harm.
614 POSIX CHARACTER CLASSES
616 Perl supports the POSIX notation for character classes. This uses names
617 enclosed by [: and :] within the enclosing square brackets. PCRE also
618 supports this notation. For example,
622 matches "0", "1", any alphabetic character, or "%". The supported class
625 alnum letters and digits
627 ascii character codes 0 - 127
628 blank space or tab only
629 cntrl control characters
630 digit decimal digits (same as \d)
631 graph printing characters, excluding space
632 lower lower case letters
633 print printing characters, including space
634 punct printing characters, excluding letters and digits
635 space white space (not quite the same as \s)
636 upper upper case letters
637 word "word" characters (same as \w)
638 xdigit hexadecimal digits
640 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
641 and space (32). Notice that this list includes the VT character (code
642 11). This makes "space" different to \s, which does not include VT (for
645 The name "word" is a Perl extension, and "blank" is a GNU extension
646 from Perl 5.8. Another Perl extension is negation, which is indicated
647 by a ^ character after the colon. For example,
651 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
652 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
653 these are not supported, and an error is given if they are encountered.
655 In UTF-8 mode, characters with values greater than 128 do not match any
656 of the POSIX character classes.
661 Vertical bar characters are used to separate alternative patterns. For
666 matches either "gilbert" or "sullivan". Any number of alternatives may
667 appear, and an empty alternative is permitted (matching the empty
668 string). The matching process tries each alternative in turn, from left
669 to right, and the first one that succeeds is used. If the alternatives
670 are within a subpattern (defined below), "succeeds" means matching the
671 rest of the main pattern as well as the alternative in the subpattern.
674 INTERNAL OPTION SETTING
676 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
677 PCRE_EXTENDED options can be changed from within the pattern by a
678 sequence of Perl option letters enclosed between "(?" and ")". The
686 For example, (?im) sets caseless, multiline matching. It is also possi-
687 ble to unset these options by preceding the letter with a hyphen, and a
688 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
689 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
690 is also permitted. If a letter appears both before and after the
691 hyphen, the option is unset.
693 When an option change occurs at top level (that is, not inside subpat-
694 tern parentheses), the change applies to the remainder of the pattern
695 that follows. If the change is placed right at the start of a pattern,
696 PCRE extracts it into the global options (and it will therefore show up
697 in data extracted by the pcre_fullinfo() function).
699 An option change within a subpattern affects only that part of the cur-
700 rent pattern that follows it, so
704 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
705 used). By this means, options can be made to have different settings
706 in different parts of the pattern. Any changes made in one alternative
707 do carry on into subsequent branches within the same subpattern. For
712 matches "ab", "aB", "c", and "C", even though when matching "C" the
713 first branch is abandoned before the option setting. This is because
714 the effects of option settings happen at compile time. There would be
715 some very weird behaviour otherwise.
717 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
718 can be changed in the same way as the Perl-compatible options by using
719 the characters J, U and X respectively.
724 Subpatterns are delimited by parentheses (round brackets), which can be
725 nested. Turning part of a pattern into a subpattern does two things:
727 1. It localizes a set of alternatives. For example, the pattern
731 matches one of the words "cat", "cataract", or "caterpillar". Without
732 the parentheses, it would match "cataract", "erpillar" or the empty
735 2. It sets up the subpattern as a capturing subpattern. This means
736 that, when the whole pattern matches, that portion of the subject
737 string that matched the subpattern is passed back to the caller via the
738 ovector argument of pcre_exec(). Opening parentheses are counted from
739 left to right (starting from 1) to obtain numbers for the capturing
742 For example, if the string "the red king" is matched against the pat-
745 the ((red|white) (king|queen))
747 the captured substrings are "red king", "red", and "king", and are num-
748 bered 1, 2, and 3, respectively.
750 The fact that plain parentheses fulfil two functions is not always
751 helpful. There are often times when a grouping subpattern is required
752 without a capturing requirement. If an opening parenthesis is followed
753 by a question mark and a colon, the subpattern does not do any captur-
754 ing, and is not counted when computing the number of any subsequent
755 capturing subpatterns. For example, if the string "the white queen" is
756 matched against the pattern
758 the ((?:red|white) (king|queen))
760 the captured substrings are "white queen" and "queen", and are numbered
761 1 and 2. The maximum number of capturing subpatterns is 65535, and the
762 maximum depth of nesting of all subpatterns, both capturing and non-
765 As a convenient shorthand, if any option settings are required at the
766 start of a non-capturing subpattern, the option letters may appear
767 between the "?" and the ":". Thus the two patterns
770 (?:(?i)saturday|sunday)
772 match exactly the same set of strings. Because alternative branches are
773 tried from left to right, and options are not reset until the end of
774 the subpattern is reached, an option setting in one branch does affect
775 subsequent branches, so the above patterns match "SUNDAY" as well as
781 Identifying capturing parentheses by number is simple, but it can be
782 very hard to keep track of the numbers in complicated regular expres-
783 sions. Furthermore, if an expression is modified, the numbers may
784 change. To help with this difficulty, PCRE supports the naming of sub-
785 patterns, something that Perl does not provide. The Python syntax
786 (?P<name>...) is used. References to capturing parentheses from other
787 parts of the pattern, such as backreferences, recursion, and condi-
788 tions, can be made by name as well as by number.
790 Names consist of up to 32 alphanumeric characters and underscores.
791 Named capturing parentheses are still allocated numbers as well as
792 names. The PCRE API provides function calls for extracting the name-to-
793 number translation table from a compiled pattern. There is also a con-
794 venience function for extracting a captured substring by name.
796 By default, a name must be unique within a pattern, but it is possible
797 to relax this constraint by setting the PCRE_DUPNAMES option at compile
798 time. This can be useful for patterns where only one instance of the
799 named parentheses can match. Suppose you want to match the name of a
800 weekday, either as a 3-letter abbreviation or as the full name, and in
801 both cases you want to extract the abbreviation. This pattern (ignoring
802 the line breaks) does the job:
804 (?P<DN>Mon|Fri|Sun)(?:day)?|
805 (?P<DN>Tue)(?:sday)?|
806 (?P<DN>Wed)(?:nesday)?|
807 (?P<DN>Thu)(?:rsday)?|
808 (?P<DN>Sat)(?:urday)?
810 There are five capturing substrings, but only one is ever set after a
811 match. The convenience function for extracting the data by name
812 returns the substring for the first, and in this example, the only,
813 subpattern of that name that matched. This saves searching to find
814 which numbered subpattern it was. If you make a reference to a non-
815 unique named subpattern from elsewhere in the pattern, the one that
816 corresponds to the lowest number is used. For further details of the
817 interfaces for handling named subpatterns, see the pcreapi documenta-
823 Repetition is specified by quantifiers, which can follow any of the
826 a literal data character
828 the \C escape sequence
829 the \X escape sequence (in UTF-8 mode with Unicode properties)
830 an escape such as \d that matches a single character
832 a back reference (see next section)
833 a parenthesized subpattern (unless it is an assertion)
835 The general repetition quantifier specifies a minimum and maximum num-
836 ber of permitted matches, by giving the two numbers in curly brackets
837 (braces), separated by a comma. The numbers must be less than 65536,
838 and the first must be less than or equal to the second. For example:
842 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
843 special character. If the second number is omitted, but the comma is
844 present, there is no upper limit; if the second number and the comma
845 are both omitted, the quantifier specifies an exact number of required
850 matches at least 3 successive vowels, but may match many more, while
854 matches exactly 8 digits. An opening curly bracket that appears in a
855 position where a quantifier is not allowed, or one that does not match
856 the syntax of a quantifier, is taken as a literal character. For exam-
857 ple, {,6} is not a quantifier, but a literal string of four characters.
859 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
860 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
861 acters, each of which is represented by a two-byte sequence. Similarly,
862 when Unicode property support is available, \X{3} matches three Unicode
863 extended sequences, each of which may be several bytes long (and they
864 may be of different lengths).
866 The quantifier {0} is permitted, causing the expression to behave as if
867 the previous item and the quantifier were not present.
869 For convenience (and historical compatibility) the three most common
870 quantifiers have single-character abbreviations:
872 * is equivalent to {0,}
873 + is equivalent to {1,}
874 ? is equivalent to {0,1}
876 It is possible to construct infinite loops by following a subpattern
877 that can match no characters with a quantifier that has no upper limit,
882 Earlier versions of Perl and PCRE used to give an error at compile time
883 for such patterns. However, because there are cases where this can be
884 useful, such patterns are now accepted, but if any repetition of the
885 subpattern does in fact match no characters, the loop is forcibly bro-
888 By default, the quantifiers are "greedy", that is, they match as much
889 as possible (up to the maximum number of permitted times), without
890 causing the rest of the pattern to fail. The classic example of where
891 this gives problems is in trying to match comments in C programs. These
892 appear between /* and */ and within the comment, individual * and /
893 characters may appear. An attempt to match C comments by applying the
900 /* first comment */ not comment /* second comment */
902 fails, because it matches the entire string owing to the greediness of
905 However, if a quantifier is followed by a question mark, it ceases to
906 be greedy, and instead matches the minimum number of times possible, so
911 does the right thing with the C comments. The meaning of the various
912 quantifiers is not otherwise changed, just the preferred number of
913 matches. Do not confuse this use of question mark with its use as a
914 quantifier in its own right. Because it has two uses, it can sometimes
915 appear doubled, as in
919 which matches one digit by preference, but can match two if that is the
920 only way the rest of the pattern matches.
922 If the PCRE_UNGREEDY option is set (an option which is not available in
923 Perl), the quantifiers are not greedy by default, but individual ones
924 can be made greedy by following them with a question mark. In other
925 words, it inverts the default behaviour.
927 When a parenthesized subpattern is quantified with a minimum repeat
928 count that is greater than 1 or with a limited maximum, more memory is
929 required for the compiled pattern, in proportion to the size of the
932 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
933 alent to Perl's /s) is set, thus allowing the . to match newlines, the
934 pattern is implicitly anchored, because whatever follows will be tried
935 against every character position in the subject string, so there is no
936 point in retrying the overall match at any position after the first.
937 PCRE normally treats such a pattern as though it were preceded by \A.
939 In cases where it is known that the subject string contains no new-
940 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
941 mization, or alternatively using ^ to indicate anchoring explicitly.
943 However, there is one situation where the optimization cannot be used.
944 When .* is inside capturing parentheses that are the subject of a
945 backreference elsewhere in the pattern, a match at the start may fail,
946 and a later one succeed. Consider, for example:
950 If the subject is "xyz123abc123" the match point is the fourth charac-
951 ter. For this reason, such a pattern is not implicitly anchored.
953 When a capturing subpattern is repeated, the value captured is the sub-
954 string that matched the final iteration. For example, after
956 (tweedle[dume]{3}\s*)+
958 has matched "tweedledum tweedledee" the value of the captured substring
959 is "tweedledee". However, if there are nested capturing subpatterns,
960 the corresponding captured values may have been set in previous itera-
961 tions. For example, after
965 matches "aba" the value of the second captured substring is "b".
968 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
970 With both maximizing and minimizing repetition, failure of what follows
971 normally causes the repeated item to be re-evaluated to see if a dif-
972 ferent number of repeats allows the rest of the pattern to match. Some-
973 times it is useful to prevent this, either to change the nature of the
974 match, or to cause it fail earlier than it otherwise might, when the
975 author of the pattern knows there is no point in carrying on.
977 Consider, for example, the pattern \d+foo when applied to the subject
982 After matching all 6 digits and then failing to match "foo", the normal
983 action of the matcher is to try again with only 5 digits matching the
984 \d+ item, and then with 4, and so on, before ultimately failing.
985 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
986 the means for specifying that once a subpattern has matched, it is not
987 to be re-evaluated in this way.
989 If we use atomic grouping for the previous example, the matcher would
990 give up immediately on failing to match "foo" the first time. The nota-
991 tion is a kind of special parenthesis, starting with (?> as in this
996 This kind of parenthesis "locks up" the part of the pattern it con-
997 tains once it has matched, and a failure further into the pattern is
998 prevented from backtracking into it. Backtracking past it to previous
999 items, however, works as normal.
1001 An alternative description is that a subpattern of this type matches
1002 the string of characters that an identical standalone pattern would
1003 match, if anchored at the current point in the subject string.
1005 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
1006 such as the above example can be thought of as a maximizing repeat that
1007 must swallow everything it can. So, while both \d+ and \d+? are pre-
1008 pared to adjust the number of digits they match in order to make the
1009 rest of the pattern match, (?>\d+) can only match an entire sequence of
1012 Atomic groups in general can of course contain arbitrarily complicated
1013 subpatterns, and can be nested. However, when the subpattern for an
1014 atomic group is just a single repeated item, as in the example above, a
1015 simpler notation, called a "possessive quantifier" can be used. This
1016 consists of an additional + character following a quantifier. Using
1017 this notation, the previous example can be rewritten as
1021 Possessive quantifiers are always greedy; the setting of the
1022 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1023 simpler forms of atomic group. However, there is no difference in the
1024 meaning or processing of a possessive quantifier and the equivalent
1027 The possessive quantifier syntax is an extension to the Perl syntax.
1028 Jeffrey Friedl originated the idea (and the name) in the first edition
1029 of his book. Mike McCloskey liked it, so implemented it when he built
1030 Sun's Java package, and PCRE copied it from there.
1032 When a pattern contains an unlimited repeat inside a subpattern that
1033 can itself be repeated an unlimited number of times, the use of an
1034 atomic group is the only way to avoid some failing matches taking a
1035 very long time indeed. The pattern
1039 matches an unlimited number of substrings that either consist of non-
1040 digits, or digits enclosed in <>, followed by either ! or ?. When it
1041 matches, it runs quickly. However, if it is applied to
1043 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1045 it takes a long time before reporting failure. This is because the
1046 string can be divided between the internal \D+ repeat and the external
1047 * repeat in a large number of ways, and all have to be tried. (The
1048 example uses [!?] rather than a single character at the end, because
1049 both PCRE and Perl have an optimization that allows for fast failure
1050 when a single character is used. They remember the last single charac-
1051 ter that is required for a match, and fail early if it is not present
1052 in the string.) If the pattern is changed so that it uses an atomic
1055 ((?>\D+)|<\d+>)*[!?]
1057 sequences of non-digits cannot be broken, and failure happens quickly.
1062 Outside a character class, a backslash followed by a digit greater than
1063 0 (and possibly further digits) is a back reference to a capturing sub-
1064 pattern earlier (that is, to its left) in the pattern, provided there
1065 have been that many previous capturing left parentheses.
1067 However, if the decimal number following the backslash is less than 10,
1068 it is always taken as a back reference, and causes an error only if
1069 there are not that many capturing left parentheses in the entire pat-
1070 tern. In other words, the parentheses that are referenced need not be
1071 to the left of the reference for numbers less than 10. A "forward back
1072 reference" of this type can make sense when a repetition is involved
1073 and the subpattern to the right has participated in an earlier itera-
1076 It is not possible to have a numerical "forward back reference" to sub-
1077 pattern whose number is 10 or more. However, a back reference to any
1078 subpattern is possible using named parentheses (see below). See also
1079 the subsection entitled "Non-printing characters" above for further
1080 details of the handling of digits following a backslash.
1082 A back reference matches whatever actually matched the capturing sub-
1083 pattern in the current subject string, rather than anything matching
1084 the subpattern itself (see "Subpatterns as subroutines" below for a way
1085 of doing that). So the pattern
1087 (sens|respons)e and \1ibility
1089 matches "sense and sensibility" and "response and responsibility", but
1090 not "sense and responsibility". If caseful matching is in force at the
1091 time of the back reference, the case of letters is relevant. For exam-
1096 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1097 original capturing subpattern is matched caselessly.
1099 Back references to named subpatterns use the Python syntax (?P=name).
1100 We could rewrite the above example as follows:
1102 (?P<p1>(?i)rah)\s+(?P=p1)
1104 A subpattern that is referenced by name may appear in the pattern
1105 before or after the reference.
1107 There may be more than one back reference to the same subpattern. If a
1108 subpattern has not actually been used in a particular match, any back
1109 references to it always fail. For example, the pattern
1113 always fails if it starts to match "a" rather than "bc". Because there
1114 may be many capturing parentheses in a pattern, all digits following
1115 the backslash are taken as part of a potential back reference number.
1116 If the pattern continues with a digit character, some delimiter must be
1117 used to terminate the back reference. If the PCRE_EXTENDED option is
1118 set, this can be whitespace. Otherwise an empty comment (see "Com-
1119 ments" below) can be used.
1121 A back reference that occurs inside the parentheses to which it refers
1122 fails when the subpattern is first used, so, for example, (a\1) never
1123 matches. However, such references can be useful inside repeated sub-
1124 patterns. For example, the pattern
1128 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1129 ation of the subpattern, the back reference matches the character
1130 string corresponding to the previous iteration. In order for this to
1131 work, the pattern must be such that the first iteration does not need
1132 to match the back reference. This can be done using alternation, as in
1133 the example above, or by a quantifier with a minimum of zero.
1138 An assertion is a test on the characters following or preceding the
1139 current matching point that does not actually consume any characters.
1140 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1143 More complicated assertions are coded as subpatterns. There are two
1144 kinds: those that look ahead of the current position in the subject
1145 string, and those that look behind it. An assertion subpattern is
1146 matched in the normal way, except that it does not cause the current
1147 matching position to be changed.
1149 Assertion subpatterns are not capturing subpatterns, and may not be
1150 repeated, because it makes no sense to assert the same thing several
1151 times. If any kind of assertion contains capturing subpatterns within
1152 it, these are counted for the purposes of numbering the capturing sub-
1153 patterns in the whole pattern. However, substring capturing is carried
1154 out only for positive assertions, because it does not make sense for
1155 negative assertions.
1157 Lookahead assertions
1159 Lookahead assertions start with (?= for positive assertions and (?! for
1160 negative assertions. For example,
1164 matches a word followed by a semicolon, but does not include the semi-
1165 colon in the match, and
1169 matches any occurrence of "foo" that is not followed by "bar". Note
1170 that the apparently similar pattern
1174 does not find an occurrence of "bar" that is preceded by something
1175 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1176 the assertion (?!foo) is always true when the next three characters are
1177 "bar". A lookbehind assertion is needed to achieve the other effect.
1179 If you want to force a matching failure at some point in a pattern, the
1180 most convenient way to do it is with (?!) because an empty string
1181 always matches, so an assertion that requires there not to be an empty
1182 string must always fail.
1184 Lookbehind assertions
1186 Lookbehind assertions start with (?<= for positive assertions and (?<!
1187 for negative assertions. For example,
1191 does find an occurrence of "bar" that is not preceded by "foo". The
1192 contents of a lookbehind assertion are restricted such that all the
1193 strings it matches must have a fixed length. However, if there are sev-
1194 eral top-level alternatives, they do not all have to have the same
1203 causes an error at compile time. Branches that match different length
1204 strings are permitted only at the top level of a lookbehind assertion.
1205 This is an extension compared with Perl (at least for 5.8), which
1206 requires all branches to match the same length of string. An assertion
1211 is not permitted, because its single top-level branch can match two
1212 different lengths, but it is acceptable if rewritten to use two top-
1217 The implementation of lookbehind assertions is, for each alternative,
1218 to temporarily move the current position back by the fixed width and
1219 then try to match. If there are insufficient characters before the cur-
1220 rent position, the match is deemed to fail.
1222 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1223 mode) to appear in lookbehind assertions, because it makes it impossi-
1224 ble to calculate the length of the lookbehind. The \X escape, which can
1225 match different numbers of bytes, is also not permitted.
1227 Atomic groups can be used in conjunction with lookbehind assertions to
1228 specify efficient matching at the end of the subject string. Consider a
1229 simple pattern such as
1233 when applied to a long string that does not match. Because matching
1234 proceeds from left to right, PCRE will look for each "a" in the subject
1235 and then see if what follows matches the rest of the pattern. If the
1236 pattern is specified as
1240 the initial .* matches the entire string at first, but when this fails
1241 (because there is no following "a"), it backtracks to match all but the
1242 last character, then all but the last two characters, and so on. Once
1243 again the search for "a" covers the entire string, from right to left,
1244 so we are no better off. However, if the pattern is written as
1248 or, equivalently, using the possessive quantifier syntax,
1252 there can be no backtracking for the .* item; it can match only the
1253 entire string. The subsequent lookbehind assertion does a single test
1254 on the last four characters. If it fails, the match fails immediately.
1255 For long strings, this approach makes a significant difference to the
1258 Using multiple assertions
1260 Several assertions (of any sort) may occur in succession. For example,
1262 (?<=\d{3})(?<!999)foo
1264 matches "foo" preceded by three digits that are not "999". Notice that
1265 each of the assertions is applied independently at the same point in
1266 the subject string. First there is a check that the previous three
1267 characters are all digits, and then there is a check that the same
1268 three characters are not "999". This pattern does not match "foo" pre-
1269 ceded by six characters, the first of which are digits and the last
1270 three of which are not "999". For example, it doesn't match "123abc-
1271 foo". A pattern to do that is
1273 (?<=\d{3}...)(?<!999)foo
1275 This time the first assertion looks at the preceding six characters,
1276 checking that the first three are digits, and then the second assertion
1277 checks that the preceding three characters are not "999".
1279 Assertions can be nested in any combination. For example,
1283 matches an occurrence of "baz" that is preceded by "bar" which in turn
1284 is not preceded by "foo", while
1286 (?<=\d{3}(?!999)...)foo
1288 is another pattern that matches "foo" preceded by three digits and any
1289 three characters that are not "999".
1292 CONDITIONAL SUBPATTERNS
1294 It is possible to cause the matching process to obey a subpattern con-
1295 ditionally or to choose between two alternative subpatterns, depending
1296 on the result of an assertion, or whether a previous capturing subpat-
1297 tern matched or not. The two possible forms of conditional subpattern
1300 (?(condition)yes-pattern)
1301 (?(condition)yes-pattern|no-pattern)
1303 If the condition is satisfied, the yes-pattern is used; otherwise the
1304 no-pattern (if present) is used. If there are more than two alterna-
1305 tives in the subpattern, a compile-time error occurs.
1307 There are three kinds of condition. If the text between the parentheses
1308 consists of a sequence of digits, or a sequence of alphanumeric charac-
1309 ters and underscores, the condition is satisfied if the capturing sub-
1310 pattern of that number or name has previously matched. There is a pos-
1311 sible ambiguity here, because subpattern names may consist entirely of
1312 digits. PCRE looks first for a named subpattern; if it cannot find one
1313 and the text consists entirely of digits, it looks for a subpattern of
1314 that number, which must be greater than zero. Using subpattern names
1315 that consist entirely of digits is not recommended.
1317 Consider the following pattern, which contains non-significant white
1318 space to make it more readable (assume the PCRE_EXTENDED option) and to
1319 divide it into three parts for ease of discussion:
1321 ( \( )? [^()]+ (?(1) \) )
1323 The first part matches an optional opening parenthesis, and if that
1324 character is present, sets it as the first captured substring. The sec-
1325 ond part matches one or more characters that are not parentheses. The
1326 third part is a conditional subpattern that tests whether the first set
1327 of parentheses matched or not. If they did, that is, if subject started
1328 with an opening parenthesis, the condition is true, and so the yes-pat-
1329 tern is executed and a closing parenthesis is required. Otherwise,
1330 since no-pattern is not present, the subpattern matches nothing. In
1331 other words, this pattern matches a sequence of non-parentheses,
1332 optionally enclosed in parentheses. Rewriting it to use a named subpat-
1335 (?P<OPEN> \( )? [^()]+ (?(OPEN) \) )
1337 If the condition is the string (R), and there is no subpattern with the
1338 name R, the condition is satisfied if a recursive call to the pattern
1339 or subpattern has been made. At "top level", the condition is false.
1340 This is a PCRE extension. Recursive patterns are described in the next
1343 If the condition is not a sequence of digits or (R), it must be an
1344 assertion. This may be a positive or negative lookahead or lookbehind
1345 assertion. Consider this pattern, again containing non-significant
1346 white space, and with the two alternatives on the second line:
1349 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1351 The condition is a positive lookahead assertion that matches an
1352 optional sequence of non-letters followed by a letter. In other words,
1353 it tests for the presence of at least one letter in the subject. If a
1354 letter is found, the subject is matched against the first alternative;
1355 otherwise it is matched against the second. This pattern matches
1356 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1357 letters and dd are digits.
1362 The sequence (?# marks the start of a comment that continues up to the
1363 next closing parenthesis. Nested parentheses are not permitted. The
1364 characters that make up a comment play no part in the pattern matching
1367 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1368 character class introduces a comment that continues to immediately
1369 after the next newline in the pattern.
1374 Consider the problem of matching a string in parentheses, allowing for
1375 unlimited nested parentheses. Without the use of recursion, the best
1376 that can be done is to use a pattern that matches up to some fixed
1377 depth of nesting. It is not possible to handle an arbitrary nesting
1378 depth. Perl provides a facility that allows regular expressions to
1379 recurse (amongst other things). It does this by interpolating Perl code
1380 in the expression at run time, and the code can refer to the expression
1381 itself. A Perl pattern to solve the parentheses problem can be created
1384 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1386 The (?p{...}) item interpolates Perl code at run time, and in this case
1387 refers recursively to the pattern in which it appears. Obviously, PCRE
1388 cannot support the interpolation of Perl code. Instead, it supports
1389 some special syntax for recursion of the entire pattern, and also for
1390 individual subpattern recursion.
1392 The special item that consists of (? followed by a number greater than
1393 zero and a closing parenthesis is a recursive call of the subpattern of
1394 the given number, provided that it occurs inside that subpattern. (If
1395 not, it is a "subroutine" call, which is described in the next sec-
1396 tion.) The special item (?R) is a recursive call of the entire regular
1399 A recursive subpattern call is always treated as an atomic group. That
1400 is, once it has matched some of the subject string, it is never re-
1401 entered, even if it contains untried alternatives and there is a subse-
1402 quent matching failure.
1404 This PCRE pattern solves the nested parentheses problem (assume the
1405 PCRE_EXTENDED option is set so that white space is ignored):
1407 \( ( (?>[^()]+) | (?R) )* \)
1409 First it matches an opening parenthesis. Then it matches any number of
1410 substrings which can either be a sequence of non-parentheses, or a
1411 recursive match of the pattern itself (that is, a correctly parenthe-
1412 sized substring). Finally there is a closing parenthesis.
1414 If this were part of a larger pattern, you would not want to recurse
1415 the entire pattern, so instead you could use this:
1417 ( \( ( (?>[^()]+) | (?1) )* \) )
1419 We have put the pattern into parentheses, and caused the recursion to
1420 refer to them instead of the whole pattern. In a larger pattern, keep-
1421 ing track of parenthesis numbers can be tricky. It may be more conve-
1422 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1423 which is an extension to the Python syntax that PCRE uses for named
1424 parentheses (Perl does not provide named parentheses). We could rewrite
1425 the above example as follows:
1427 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1429 This particular example pattern contains nested unlimited repeats, and
1430 so the use of atomic grouping for matching strings of non-parentheses
1431 is important when applying the pattern to strings that do not match.
1432 For example, when this pattern is applied to
1434 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1436 it yields "no match" quickly. However, if atomic grouping is not used,
1437 the match runs for a very long time indeed because there are so many
1438 different ways the + and * repeats can carve up the subject, and all
1439 have to be tested before failure can be reported.
1441 At the end of a match, the values set for any capturing subpatterns are
1442 those from the outermost level of the recursion at which the subpattern
1443 value is set. If you want to obtain intermediate values, a callout
1444 function can be used (see the next section and the pcrecallout documen-
1445 tation). If the pattern above is matched against
1449 the value for the capturing parentheses is "ef", which is the last
1450 value taken on at the top level. If additional parentheses are added,
1453 \( ( ( (?>[^()]+) | (?R) )* ) \)
1457 the string they capture is "ab(cd)ef", the contents of the top level
1458 parentheses. If there are more than 15 capturing parentheses in a pat-
1459 tern, PCRE has to obtain extra memory to store data during a recursion,
1460 which it does by using pcre_malloc, freeing it via pcre_free after-
1461 wards. If no memory can be obtained, the match fails with the
1462 PCRE_ERROR_NOMEMORY error.
1464 Do not confuse the (?R) item with the condition (R), which tests for
1465 recursion. Consider this pattern, which matches text in angle brack-
1466 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1467 brackets (that is, when recursing), whereas any characters are permit-
1468 ted at the outer level.
1470 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1472 In this pattern, (?(R) is the start of a conditional subpattern, with
1473 two different alternatives for the recursive and non-recursive cases.
1474 The (?R) item is the actual recursive call.
1477 SUBPATTERNS AS SUBROUTINES
1479 If the syntax for a recursive subpattern reference (either by number or
1480 by name) is used outside the parentheses to which it refers, it oper-
1481 ates like a subroutine in a programming language. An earlier example
1482 pointed out that the pattern
1484 (sens|respons)e and \1ibility
1486 matches "sense and sensibility" and "response and responsibility", but
1487 not "sense and responsibility". If instead the pattern
1489 (sens|respons)e and (?1)ibility
1491 is used, it does match "sense and responsibility" as well as the other
1492 two strings. Such references, if given numerically, must follow the
1493 subpattern to which they refer. However, named references can refer to
1496 Like recursive subpatterns, a "subroutine" call is always treated as an
1497 atomic group. That is, once it has matched some of the subject string,
1498 it is never re-entered, even if it contains untried alternatives and
1499 there is a subsequent matching failure.
1504 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1505 Perl code to be obeyed in the middle of matching a regular expression.
1506 This makes it possible, amongst other things, to extract different sub-
1507 strings that match the same pair of parentheses when there is a repeti-
1510 PCRE provides a similar feature, but of course it cannot obey arbitrary
1511 Perl code. The feature is called "callout". The caller of PCRE provides
1512 an external function by putting its entry point in the global variable
1513 pcre_callout. By default, this variable contains NULL, which disables
1516 Within a regular expression, (?C) indicates the points at which the
1517 external function is to be called. If you want to identify different
1518 callout points, you can put a number less than 256 after the letter C.
1519 The default value is zero. For example, this pattern has two callout
1524 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1525 automatically installed before each item in the pattern. They are all
1528 During matching, when PCRE reaches a callout point (and pcre_callout is
1529 set), the external function is called. It is provided with the number
1530 of the callout, the position in the pattern, and, optionally, one item
1531 of data originally supplied by the caller of pcre_exec(). The callout
1532 function may cause matching to proceed, to backtrack, or to fail alto-
1533 gether. A complete description of the interface to the callout function
1534 is given in the pcrecallout documentation.
1536 Last updated: 06 June 2006
1537 Copyright (c) 1997-2006 University of Cambridge.