1 This file contains the PCRE man page that describes the regular expressions
2 supported by PCRE version 6.2. Note that not all of the features are relevant
3 in the context of Exim. In particular, the version of PCRE that is compiled
4 with Exim does not include UTF-8 support, there is no mechanism for changing
5 the options with which the PCRE functions are called, and features such as
6 callout are not accessible.
7 -----------------------------------------------------------------------------
9 PCREPATTERN(3) PCREPATTERN(3)
13 PCRE - Perl-compatible regular expressions
16 PCRE REGULAR EXPRESSION DETAILS
18 The syntax and semantics of the regular expressions supported by PCRE
19 are described below. Regular expressions are also described in the Perl
20 documentation and in a number of books, some of which have copious
21 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
22 by O'Reilly, covers regular expressions in great detail. This descrip-
23 tion of PCRE's regular expressions is intended as reference material.
25 The original operation of PCRE was on strings of one-byte characters.
26 However, there is now also support for UTF-8 character strings. To use
27 this, you must build PCRE to include UTF-8 support, and then call
28 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
29 matching is mentioned in several places below. There is also a summary
30 of UTF-8 features in the section on UTF-8 support in the main pcre
33 The remainder of this document discusses the patterns that are sup-
34 ported by PCRE when its main matching function, pcre_exec(), is used.
35 From release 6.0, PCRE offers a second matching function,
36 pcre_dfa_exec(), which matches using a different algorithm that is not
37 Perl-compatible. The advantages and disadvantages of the alternative
38 function, and how it differs from the normal function, are discussed in
39 the pcrematching page.
41 A regular expression is a pattern that is matched against a subject
42 string from left to right. Most characters stand for themselves in a
43 pattern, and match the corresponding characters in the subject. As a
44 trivial example, the pattern
48 matches a portion of a subject string that is identical to itself. When
49 caseless matching is specified (the PCRE_CASELESS option), letters are
50 matched independently of case. In UTF-8 mode, PCRE always understands
51 the concept of case for characters whose values are less than 128, so
52 caseless matching is always possible. For characters with higher val-
53 ues, the concept of case is supported if PCRE is compiled with Unicode
54 property support, but not otherwise. If you want to use caseless
55 matching for characters 128 and above, you must ensure that PCRE is
56 compiled with Unicode property support as well as with UTF-8 support.
58 The power of regular expressions comes from the ability to include
59 alternatives and repetitions in the pattern. These are encoded in the
60 pattern by the use of metacharacters, which do not stand for themselves
61 but instead are interpreted in some special way.
63 There are two different sets of metacharacters: those that are recog-
64 nized anywhere in the pattern except within square brackets, and those
65 that are recognized in square brackets. Outside square brackets, the
66 metacharacters are as follows:
68 \ general escape character with several uses
69 ^ assert start of string (or line, in multiline mode)
70 $ assert end of string (or line, in multiline mode)
71 . match any character except newline (by default)
72 [ start character class definition
73 | start of alternative branch
76 ? extends the meaning of (
77 also 0 or 1 quantifier
78 also quantifier minimizer
79 * 0 or more quantifier
80 + 1 or more quantifier
81 also "possessive quantifier"
82 { start min/max quantifier
84 Part of a pattern that is in square brackets is called a "character
85 class". In a character class the only metacharacters are:
87 \ general escape character
88 ^ negate the class, but only if the first character
89 - indicates character range
90 [ POSIX character class (only if followed by POSIX
92 ] terminates the character class
94 The following sections describe the use of each of the metacharacters.
99 The backslash character has several uses. Firstly, if it is followed by
100 a non-alphanumeric character, it takes away any special meaning that
101 character may have. This use of backslash as an escape character
102 applies both inside and outside character classes.
104 For example, if you want to match a * character, you write \* in the
105 pattern. This escaping action applies whether or not the following
106 character would otherwise be interpreted as a metacharacter, so it is
107 always safe to precede a non-alphanumeric with backslash to specify
108 that it stands for itself. In particular, if you want to match a back-
111 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
112 the pattern (other than in a character class) and characters between a
113 # outside a character class and the next newline character are ignored.
114 An escaping backslash can be used to include a whitespace or # charac-
115 ter as part of the pattern.
117 If you want to remove the special meaning from a sequence of charac-
118 ters, you can do so by putting them between \Q and \E. This is differ-
119 ent from Perl in that $ and @ are handled as literals in \Q...\E
120 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
121 tion. Note the following examples:
123 Pattern PCRE matches Perl matches
125 \Qabc$xyz\E abc$xyz abc followed by the
127 \Qabc\$xyz\E abc\$xyz abc\$xyz
128 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
130 The \Q...\E sequence is recognized both inside and outside character
133 Non-printing characters
135 A second use of backslash provides a way of encoding non-printing char-
136 acters in patterns in a visible manner. There is no restriction on the
137 appearance of non-printing characters, apart from the binary zero that
138 terminates a pattern, but when a pattern is being prepared by text
139 editing, it is usually easier to use one of the following escape
140 sequences than the binary character it represents:
142 \a alarm, that is, the BEL character (hex 07)
143 \cx "control-x", where x is any character
147 \r carriage return (hex 0D)
149 \ddd character with octal code ddd, or backreference
150 \xhh character with hex code hh
151 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
153 The precise effect of \cx is as follows: if x is a lower case letter,
154 it is converted to upper case. Then bit 6 of the character (hex 40) is
155 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
158 After \x, from zero to two hexadecimal digits are read (letters can be
159 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
160 its may appear between \x{ and }, but the value of the character code
161 must be less than 2**31 (that is, the maximum hexadecimal value is
162 7FFFFFFF). If characters other than hexadecimal digits appear between
163 \x{ and }, or if there is no terminating }, this form of escape is not
164 recognized. Instead, the initial \x will be interpreted as a basic
165 hexadecimal escape, with no following digits, giving a character whose
168 Characters whose value is less than 256 can be defined by either of the
169 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
170 in the way they are handled. For example, \xdc is exactly the same as
173 After \0 up to two further octal digits are read. In both cases, if
174 there are fewer than two digits, just those that are present are used.
175 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
176 character (code value 7). Make sure you supply two digits after the
177 initial zero if the pattern character that follows is itself an octal
180 The handling of a backslash followed by a digit other than 0 is compli-
181 cated. Outside a character class, PCRE reads it and any following dig-
182 its as a decimal number. If the number is less than 10, or if there
183 have been at least that many previous capturing left parentheses in the
184 expression, the entire sequence is taken as a back reference. A
185 description of how this works is given later, following the discussion
186 of parenthesized subpatterns.
188 Inside a character class, or if the decimal number is greater than 9
189 and there have not been that many capturing subpatterns, PCRE re-reads
190 up to three octal digits following the backslash, and generates a sin-
191 gle byte from the least significant 8 bits of the value. Any subsequent
192 digits stand for themselves. For example:
194 \040 is another way of writing a space
195 \40 is the same, provided there are fewer than 40
196 previous capturing subpatterns
197 \7 is always a back reference
198 \11 might be a back reference, or another way of
201 \0113 is a tab followed by the character "3"
202 \113 might be a back reference, otherwise the
203 character with octal code 113
204 \377 might be a back reference, otherwise
205 the byte consisting entirely of 1 bits
206 \81 is either a back reference, or a binary zero
207 followed by the two characters "8" and "1"
209 Note that octal values of 100 or greater must not be introduced by a
210 leading zero, because no more than three octal digits are ever read.
212 All the sequences that define a single byte value or a single UTF-8
213 character (in UTF-8 mode) can be used both inside and outside character
214 classes. In addition, inside a character class, the sequence \b is
215 interpreted as the backspace character (hex 08), and the sequence \X is
216 interpreted as the character "X". Outside a character class, these
217 sequences have different meanings (see below).
219 Generic character types
221 The third use of backslash is for specifying generic character types.
222 The following are always recognized:
225 \D any character that is not a decimal digit
226 \s any whitespace character
227 \S any character that is not a whitespace character
228 \w any "word" character
229 \W any "non-word" character
231 Each pair of escape sequences partitions the complete set of characters
232 into two disjoint sets. Any given character matches one, and only one,
235 These character type sequences can appear both inside and outside char-
236 acter classes. They each match one character of the appropriate type.
237 If the current matching point is at the end of the subject string, all
238 of them fail, since there is no character to match.
240 For compatibility with Perl, \s does not match the VT character (code
241 11). This makes it different from the the POSIX "space" class. The \s
242 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
244 A "word" character is an underscore or any character less than 256 that
245 is a letter or digit. The definition of letters and digits is con-
246 trolled by PCRE's low-valued character tables, and may vary if locale-
247 specific matching is taking place (see "Locale support" in the pcreapi
248 page). For example, in the "fr_FR" (French) locale, some character
249 codes greater than 128 are used for accented letters, and these are
252 In UTF-8 mode, characters with values greater than 128 never match \d,
253 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
254 code character property support is available.
256 Unicode character properties
258 When PCRE is built with Unicode character property support, three addi-
259 tional escape sequences to match generic character types are available
260 when UTF-8 mode is selected. They are:
262 \p{xx} a character with the xx property
263 \P{xx} a character without the xx property
264 \X an extended Unicode sequence
266 The property names represented by xx above are limited to the Unicode
267 general category properties. Each character has exactly one such prop-
268 erty, specified by a two-letter abbreviation. For compatibility with
269 Perl, negation can be specified by including a circumflex between the
270 opening brace and the property name. For example, \p{^Lu} is the same
273 If only one letter is specified with \p or \P, it includes all the
274 properties that start with that letter. In this case, in the absence of
275 negation, the curly brackets in the escape sequence are optional; these
276 two examples have the same effect:
281 The following property codes are supported:
308 Pc Connector punctuation
312 Pi Initial punctuation
319 Sm Mathematical symbol
324 Zp Paragraph separator
327 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
330 Specifying caseless matching does not affect these escape sequences.
331 For example, \p{Lu} always matches only upper case letters.
333 The \X escape matches any number of Unicode characters that form an
334 extended Unicode sequence. \X is equivalent to
338 That is, it matches a character without the "mark" property, followed
339 by zero or more characters with the "mark" property, and treats the
340 sequence as an atomic group (see below). Characters with the "mark"
341 property are typically accents that affect the preceding character.
343 Matching characters by Unicode property is not fast, because PCRE has
344 to search a structure that contains data for over fifteen thousand
345 characters. That is why the traditional escape sequences such as \d and
346 \w do not use Unicode properties in PCRE.
350 The fourth use of backslash is for certain simple assertions. An asser-
351 tion specifies a condition that has to be met at a particular point in
352 a match, without consuming any characters from the subject string. The
353 use of subpatterns for more complicated assertions is described below.
354 The backslashed assertions are:
356 \b matches at a word boundary
357 \B matches when not at a word boundary
358 \A matches at start of subject
359 \Z matches at end of subject or before newline at end
360 \z matches at end of subject
361 \G matches at first matching position in subject
363 These assertions may not appear in character classes (but note that \b
364 has a different meaning, namely the backspace character, inside a char-
367 A word boundary is a position in the subject string where the current
368 character and the previous character do not both match \w or \W (i.e.
369 one matches \w and the other matches \W), or the start or end of the
370 string if the first or last character matches \w, respectively.
372 The \A, \Z, and \z assertions differ from the traditional circumflex
373 and dollar (described in the next section) in that they only ever match
374 at the very start and end of the subject string, whatever options are
375 set. Thus, they are independent of multiline mode. These three asser-
376 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
377 affect only the behaviour of the circumflex and dollar metacharacters.
378 However, if the startoffset argument of pcre_exec() is non-zero, indi-
379 cating that matching is to start at a point other than the beginning of
380 the subject, \A can never match. The difference between \Z and \z is
381 that \Z matches before a newline that is the last character of the
382 string as well as at the end of the string, whereas \z matches only at
385 The \G assertion is true only when the current matching position is at
386 the start point of the match, as specified by the startoffset argument
387 of pcre_exec(). It differs from \A when the value of startoffset is
388 non-zero. By calling pcre_exec() multiple times with appropriate argu-
389 ments, you can mimic Perl's /g option, and it is in this kind of imple-
390 mentation where \G can be useful.
392 Note, however, that PCRE's interpretation of \G, as the start of the
393 current match, is subtly different from Perl's, which defines it as the
394 end of the previous match. In Perl, these can be different when the
395 previously matched string was empty. Because PCRE does just one match
396 at a time, it cannot reproduce this behaviour.
398 If all the alternatives of a pattern begin with \G, the expression is
399 anchored to the starting match position, and the "anchored" flag is set
400 in the compiled regular expression.
403 CIRCUMFLEX AND DOLLAR
405 Outside a character class, in the default matching mode, the circumflex
406 character is an assertion that is true only if the current matching
407 point is at the start of the subject string. If the startoffset argu-
408 ment of pcre_exec() is non-zero, circumflex can never match if the
409 PCRE_MULTILINE option is unset. Inside a character class, circumflex
410 has an entirely different meaning (see below).
412 Circumflex need not be the first character of the pattern if a number
413 of alternatives are involved, but it should be the first thing in each
414 alternative in which it appears if the pattern is ever to match that
415 branch. If all possible alternatives start with a circumflex, that is,
416 if the pattern is constrained to match only at the start of the sub-
417 ject, it is said to be an "anchored" pattern. (There are also other
418 constructs that can cause a pattern to be anchored.)
420 A dollar character is an assertion that is true only if the current
421 matching point is at the end of the subject string, or immediately
422 before a newline character that is the last character in the string (by
423 default). Dollar need not be the last character of the pattern if a
424 number of alternatives are involved, but it should be the last item in
425 any branch in which it appears. Dollar has no special meaning in a
428 The meaning of dollar can be changed so that it matches only at the
429 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
430 compile time. This does not affect the \Z assertion.
432 The meanings of the circumflex and dollar characters are changed if the
433 PCRE_MULTILINE option is set. When this is the case, they match immedi-
434 ately after and immediately before an internal newline character,
435 respectively, in addition to matching at the start and end of the sub-
436 ject string. For example, the pattern /^abc$/ matches the subject
437 string "def\nabc" (where \n represents a newline character) in multi-
438 line mode, but not otherwise. Consequently, patterns that are anchored
439 in single line mode because all branches start with ^ are not anchored
440 in multiline mode, and a match for circumflex is possible when the
441 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
442 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
444 Note that the sequences \A, \Z, and \z can be used to match the start
445 and end of the subject in both modes, and if all branches of a pattern
446 start with \A it is always anchored, whether PCRE_MULTILINE is set or
450 FULL STOP (PERIOD, DOT)
452 Outside a character class, a dot in the pattern matches any one charac-
453 ter in the subject, including a non-printing character, but not (by
454 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
455 which might be more than one byte long, except (by default) newline. If
456 the PCRE_DOTALL option is set, dots match newlines as well. The han-
457 dling of dot is entirely independent of the handling of circumflex and
458 dollar, the only relationship being that they both involve newline
459 characters. Dot has no special meaning in a character class.
462 MATCHING A SINGLE BYTE
464 Outside a character class, the escape sequence \C matches any one byte,
465 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
466 The feature is provided in Perl in order to match individual bytes in
467 UTF-8 mode. Because it breaks up UTF-8 characters into individual
468 bytes, what remains in the string may be a malformed UTF-8 string. For
469 this reason, the \C escape sequence is best avoided.
471 PCRE does not allow \C to appear in lookbehind assertions (described
472 below), because in UTF-8 mode this would make it impossible to calcu-
473 late the length of the lookbehind.
476 SQUARE BRACKETS AND CHARACTER CLASSES
478 An opening square bracket introduces a character class, terminated by a
479 closing square bracket. A closing square bracket on its own is not spe-
480 cial. If a closing square bracket is required as a member of the class,
481 it should be the first data character in the class (after an initial
482 circumflex, if present) or escaped with a backslash.
484 A character class matches a single character in the subject. In UTF-8
485 mode, the character may occupy more than one byte. A matched character
486 must be in the set of characters defined by the class, unless the first
487 character in the class definition is a circumflex, in which case the
488 subject character must not be in the set defined by the class. If a
489 circumflex is actually required as a member of the class, ensure it is
490 not the first character, or escape it with a backslash.
492 For example, the character class [aeiou] matches any lower case vowel,
493 while [^aeiou] matches any character that is not a lower case vowel.
494 Note that a circumflex is just a convenient notation for specifying the
495 characters that are in the class by enumerating those that are not. A
496 class that starts with a circumflex is not an assertion: it still con-
497 sumes a character from the subject string, and therefore it fails if
498 the current pointer is at the end of the string.
500 In UTF-8 mode, characters with values greater than 255 can be included
501 in a class as a literal string of bytes, or by using the \x{ escaping
504 When caseless matching is set, any letters in a class represent both
505 their upper case and lower case versions, so for example, a caseless
506 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
507 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
508 understands the concept of case for characters whose values are less
509 than 128, so caseless matching is always possible. For characters with
510 higher values, the concept of case is supported if PCRE is compiled
511 with Unicode property support, but not otherwise. If you want to use
512 caseless matching for characters 128 and above, you must ensure that
513 PCRE is compiled with Unicode property support as well as with UTF-8
516 The newline character is never treated in any special way in character
517 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
518 options is. A class such as [^a] will always match a newline.
520 The minus (hyphen) character can be used to specify a range of charac-
521 ters in a character class. For example, [d-m] matches any letter
522 between d and m, inclusive. If a minus character is required in a
523 class, it must be escaped with a backslash or appear in a position
524 where it cannot be interpreted as indicating a range, typically as the
525 first or last character in the class.
527 It is not possible to have the literal character "]" as the end charac-
528 ter of a range. A pattern such as [W-]46] is interpreted as a class of
529 two characters ("W" and "-") followed by a literal string "46]", so it
530 would match "W46]" or "-46]". However, if the "]" is escaped with a
531 backslash it is interpreted as the end of range, so [W-\]46] is inter-
532 preted as a class containing a range followed by two other characters.
533 The octal or hexadecimal representation of "]" can also be used to end
536 Ranges operate in the collating sequence of character values. They can
537 also be used for characters specified numerically, for example
538 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
539 are greater than 255, for example [\x{100}-\x{2ff}].
541 If a range that includes letters is used when caseless matching is set,
542 it matches the letters in either case. For example, [W-c] is equivalent
543 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
544 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
545 accented E characters in both cases. In UTF-8 mode, PCRE supports the
546 concept of case for characters with values greater than 128 only when
547 it is compiled with Unicode property support.
549 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
550 in a character class, and add the characters that they match to the
551 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
552 flex can conveniently be used with the upper case character types to
553 specify a more restricted set of characters than the matching lower
554 case type. For example, the class [^\W_] matches any letter or digit,
557 The only metacharacters that are recognized in character classes are
558 backslash, hyphen (only where it can be interpreted as specifying a
559 range), circumflex (only at the start), opening square bracket (only
560 when it can be interpreted as introducing a POSIX class name - see the
561 next section), and the terminating closing square bracket. However,
562 escaping other non-alphanumeric characters does no harm.
565 POSIX CHARACTER CLASSES
567 Perl supports the POSIX notation for character classes. This uses names
568 enclosed by [: and :] within the enclosing square brackets. PCRE also
569 supports this notation. For example,
573 matches "0", "1", any alphabetic character, or "%". The supported class
576 alnum letters and digits
578 ascii character codes 0 - 127
579 blank space or tab only
580 cntrl control characters
581 digit decimal digits (same as \d)
582 graph printing characters, excluding space
583 lower lower case letters
584 print printing characters, including space
585 punct printing characters, excluding letters and digits
586 space white space (not quite the same as \s)
587 upper upper case letters
588 word "word" characters (same as \w)
589 xdigit hexadecimal digits
591 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
592 and space (32). Notice that this list includes the VT character (code
593 11). This makes "space" different to \s, which does not include VT (for
596 The name "word" is a Perl extension, and "blank" is a GNU extension
597 from Perl 5.8. Another Perl extension is negation, which is indicated
598 by a ^ character after the colon. For example,
602 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
603 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
604 these are not supported, and an error is given if they are encountered.
606 In UTF-8 mode, characters with values greater than 128 do not match any
607 of the POSIX character classes.
612 Vertical bar characters are used to separate alternative patterns. For
617 matches either "gilbert" or "sullivan". Any number of alternatives may
618 appear, and an empty alternative is permitted (matching the empty
619 string). The matching process tries each alternative in turn, from
620 left to right, and the first one that succeeds is used. If the alterna-
621 tives are within a subpattern (defined below), "succeeds" means match-
622 ing the rest of the main pattern as well as the alternative in the sub-
626 INTERNAL OPTION SETTING
628 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
629 PCRE_EXTENDED options can be changed from within the pattern by a
630 sequence of Perl option letters enclosed between "(?" and ")". The
638 For example, (?im) sets caseless, multiline matching. It is also possi-
639 ble to unset these options by preceding the letter with a hyphen, and a
640 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
641 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
642 is also permitted. If a letter appears both before and after the
643 hyphen, the option is unset.
645 When an option change occurs at top level (that is, not inside subpat-
646 tern parentheses), the change applies to the remainder of the pattern
647 that follows. If the change is placed right at the start of a pattern,
648 PCRE extracts it into the global options (and it will therefore show up
649 in data extracted by the pcre_fullinfo() function).
651 An option change within a subpattern affects only that part of the cur-
652 rent pattern that follows it, so
656 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
657 used). By this means, options can be made to have different settings
658 in different parts of the pattern. Any changes made in one alternative
659 do carry on into subsequent branches within the same subpattern. For
664 matches "ab", "aB", "c", and "C", even though when matching "C" the
665 first branch is abandoned before the option setting. This is because
666 the effects of option settings happen at compile time. There would be
667 some very weird behaviour otherwise.
669 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
670 in the same way as the Perl-compatible options by using the characters
671 U and X respectively. The (?X) flag setting is special in that it must
672 always occur earlier in the pattern than any of the additional features
673 it turns on, even when it is at top level. It is best to put it at the
679 Subpatterns are delimited by parentheses (round brackets), which can be
680 nested. Turning part of a pattern into a subpattern does two things:
682 1. It localizes a set of alternatives. For example, the pattern
686 matches one of the words "cat", "cataract", or "caterpillar". Without
687 the parentheses, it would match "cataract", "erpillar" or the empty
690 2. It sets up the subpattern as a capturing subpattern. This means
691 that, when the whole pattern matches, that portion of the subject
692 string that matched the subpattern is passed back to the caller via the
693 ovector argument of pcre_exec(). Opening parentheses are counted from
694 left to right (starting from 1) to obtain numbers for the capturing
697 For example, if the string "the red king" is matched against the pat-
700 the ((red|white) (king|queen))
702 the captured substrings are "red king", "red", and "king", and are num-
703 bered 1, 2, and 3, respectively.
705 The fact that plain parentheses fulfil two functions is not always
706 helpful. There are often times when a grouping subpattern is required
707 without a capturing requirement. If an opening parenthesis is followed
708 by a question mark and a colon, the subpattern does not do any captur-
709 ing, and is not counted when computing the number of any subsequent
710 capturing subpatterns. For example, if the string "the white queen" is
711 matched against the pattern
713 the ((?:red|white) (king|queen))
715 the captured substrings are "white queen" and "queen", and are numbered
716 1 and 2. The maximum number of capturing subpatterns is 65535, and the
717 maximum depth of nesting of all subpatterns, both capturing and non-
720 As a convenient shorthand, if any option settings are required at the
721 start of a non-capturing subpattern, the option letters may appear
722 between the "?" and the ":". Thus the two patterns
725 (?:(?i)saturday|sunday)
727 match exactly the same set of strings. Because alternative branches are
728 tried from left to right, and options are not reset until the end of
729 the subpattern is reached, an option setting in one branch does affect
730 subsequent branches, so the above patterns match "SUNDAY" as well as
736 Identifying capturing parentheses by number is simple, but it can be
737 very hard to keep track of the numbers in complicated regular expres-
738 sions. Furthermore, if an expression is modified, the numbers may
739 change. To help with this difficulty, PCRE supports the naming of sub-
740 patterns, something that Perl does not provide. The Python syntax
741 (?P<name>...) is used. Names consist of alphanumeric characters and
742 underscores, and must be unique within a pattern.
744 Named capturing parentheses are still allocated numbers as well as
745 names. The PCRE API provides function calls for extracting the name-to-
746 number translation table from a compiled pattern. There is also a con-
747 venience function for extracting a captured substring by name. For fur-
748 ther details see the pcreapi documentation.
753 Repetition is specified by quantifiers, which can follow any of the
756 a literal data character
758 the \C escape sequence
759 the \X escape sequence (in UTF-8 mode with Unicode properties)
760 an escape such as \d that matches a single character
762 a back reference (see next section)
763 a parenthesized subpattern (unless it is an assertion)
765 The general repetition quantifier specifies a minimum and maximum num-
766 ber of permitted matches, by giving the two numbers in curly brackets
767 (braces), separated by a comma. The numbers must be less than 65536,
768 and the first must be less than or equal to the second. For example:
772 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
773 special character. If the second number is omitted, but the comma is
774 present, there is no upper limit; if the second number and the comma
775 are both omitted, the quantifier specifies an exact number of required
780 matches at least 3 successive vowels, but may match many more, while
784 matches exactly 8 digits. An opening curly bracket that appears in a
785 position where a quantifier is not allowed, or one that does not match
786 the syntax of a quantifier, is taken as a literal character. For exam-
787 ple, {,6} is not a quantifier, but a literal string of four characters.
789 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
790 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
791 acters, each of which is represented by a two-byte sequence. Similarly,
792 when Unicode property support is available, \X{3} matches three Unicode
793 extended sequences, each of which may be several bytes long (and they
794 may be of different lengths).
796 The quantifier {0} is permitted, causing the expression to behave as if
797 the previous item and the quantifier were not present.
799 For convenience (and historical compatibility) the three most common
800 quantifiers have single-character abbreviations:
802 * is equivalent to {0,}
803 + is equivalent to {1,}
804 ? is equivalent to {0,1}
806 It is possible to construct infinite loops by following a subpattern
807 that can match no characters with a quantifier that has no upper limit,
812 Earlier versions of Perl and PCRE used to give an error at compile time
813 for such patterns. However, because there are cases where this can be
814 useful, such patterns are now accepted, but if any repetition of the
815 subpattern does in fact match no characters, the loop is forcibly bro-
818 By default, the quantifiers are "greedy", that is, they match as much
819 as possible (up to the maximum number of permitted times), without
820 causing the rest of the pattern to fail. The classic example of where
821 this gives problems is in trying to match comments in C programs. These
822 appear between /* and */ and within the comment, individual * and /
823 characters may appear. An attempt to match C comments by applying the
830 /* first comment */ not comment /* second comment */
832 fails, because it matches the entire string owing to the greediness of
835 However, if a quantifier is followed by a question mark, it ceases to
836 be greedy, and instead matches the minimum number of times possible, so
841 does the right thing with the C comments. The meaning of the various
842 quantifiers is not otherwise changed, just the preferred number of
843 matches. Do not confuse this use of question mark with its use as a
844 quantifier in its own right. Because it has two uses, it can sometimes
845 appear doubled, as in
849 which matches one digit by preference, but can match two if that is the
850 only way the rest of the pattern matches.
852 If the PCRE_UNGREEDY option is set (an option which is not available in
853 Perl), the quantifiers are not greedy by default, but individual ones
854 can be made greedy by following them with a question mark. In other
855 words, it inverts the default behaviour.
857 When a parenthesized subpattern is quantified with a minimum repeat
858 count that is greater than 1 or with a limited maximum, more memory is
859 required for the compiled pattern, in proportion to the size of the
862 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
863 alent to Perl's /s) is set, thus allowing the . to match newlines, the
864 pattern is implicitly anchored, because whatever follows will be tried
865 against every character position in the subject string, so there is no
866 point in retrying the overall match at any position after the first.
867 PCRE normally treats such a pattern as though it were preceded by \A.
869 In cases where it is known that the subject string contains no new-
870 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
871 mization, or alternatively using ^ to indicate anchoring explicitly.
873 However, there is one situation where the optimization cannot be used.
874 When .* is inside capturing parentheses that are the subject of a
875 backreference elsewhere in the pattern, a match at the start may fail,
876 and a later one succeed. Consider, for example:
880 If the subject is "xyz123abc123" the match point is the fourth charac-
881 ter. For this reason, such a pattern is not implicitly anchored.
883 When a capturing subpattern is repeated, the value captured is the sub-
884 string that matched the final iteration. For example, after
886 (tweedle[dume]{3}\s*)+
888 has matched "tweedledum tweedledee" the value of the captured substring
889 is "tweedledee". However, if there are nested capturing subpatterns,
890 the corresponding captured values may have been set in previous itera-
891 tions. For example, after
895 matches "aba" the value of the second captured substring is "b".
898 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
900 With both maximizing and minimizing repetition, failure of what follows
901 normally causes the repeated item to be re-evaluated to see if a dif-
902 ferent number of repeats allows the rest of the pattern to match. Some-
903 times it is useful to prevent this, either to change the nature of the
904 match, or to cause it fail earlier than it otherwise might, when the
905 author of the pattern knows there is no point in carrying on.
907 Consider, for example, the pattern \d+foo when applied to the subject
912 After matching all 6 digits and then failing to match "foo", the normal
913 action of the matcher is to try again with only 5 digits matching the
914 \d+ item, and then with 4, and so on, before ultimately failing.
915 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
916 the means for specifying that once a subpattern has matched, it is not
917 to be re-evaluated in this way.
919 If we use atomic grouping for the previous example, the matcher would
920 give up immediately on failing to match "foo" the first time. The nota-
921 tion is a kind of special parenthesis, starting with (?> as in this
926 This kind of parenthesis "locks up" the part of the pattern it con-
927 tains once it has matched, and a failure further into the pattern is
928 prevented from backtracking into it. Backtracking past it to previous
929 items, however, works as normal.
931 An alternative description is that a subpattern of this type matches
932 the string of characters that an identical standalone pattern would
933 match, if anchored at the current point in the subject string.
935 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
936 such as the above example can be thought of as a maximizing repeat that
937 must swallow everything it can. So, while both \d+ and \d+? are pre-
938 pared to adjust the number of digits they match in order to make the
939 rest of the pattern match, (?>\d+) can only match an entire sequence of
942 Atomic groups in general can of course contain arbitrarily complicated
943 subpatterns, and can be nested. However, when the subpattern for an
944 atomic group is just a single repeated item, as in the example above, a
945 simpler notation, called a "possessive quantifier" can be used. This
946 consists of an additional + character following a quantifier. Using
947 this notation, the previous example can be rewritten as
951 Possessive quantifiers are always greedy; the setting of the
952 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
953 simpler forms of atomic group. However, there is no difference in the
954 meaning or processing of a possessive quantifier and the equivalent
957 The possessive quantifier syntax is an extension to the Perl syntax. It
958 originates in Sun's Java package.
960 When a pattern contains an unlimited repeat inside a subpattern that
961 can itself be repeated an unlimited number of times, the use of an
962 atomic group is the only way to avoid some failing matches taking a
963 very long time indeed. The pattern
967 matches an unlimited number of substrings that either consist of non-
968 digits, or digits enclosed in <>, followed by either ! or ?. When it
969 matches, it runs quickly. However, if it is applied to
971 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
973 it takes a long time before reporting failure. This is because the
974 string can be divided between the internal \D+ repeat and the external
975 * repeat in a large number of ways, and all have to be tried. (The
976 example uses [!?] rather than a single character at the end, because
977 both PCRE and Perl have an optimization that allows for fast failure
978 when a single character is used. They remember the last single charac-
979 ter that is required for a match, and fail early if it is not present
980 in the string.) If the pattern is changed so that it uses an atomic
985 sequences of non-digits cannot be broken, and failure happens quickly.
990 Outside a character class, a backslash followed by a digit greater than
991 0 (and possibly further digits) is a back reference to a capturing sub-
992 pattern earlier (that is, to its left) in the pattern, provided there
993 have been that many previous capturing left parentheses.
995 However, if the decimal number following the backslash is less than 10,
996 it is always taken as a back reference, and causes an error only if
997 there are not that many capturing left parentheses in the entire pat-
998 tern. In other words, the parentheses that are referenced need not be
999 to the left of the reference for numbers less than 10. See the subsec-
1000 tion entitled "Non-printing characters" above for further details of
1001 the handling of digits following a backslash.
1003 A back reference matches whatever actually matched the capturing sub-
1004 pattern in the current subject string, rather than anything matching
1005 the subpattern itself (see "Subpatterns as subroutines" below for a way
1006 of doing that). So the pattern
1008 (sens|respons)e and \1ibility
1010 matches "sense and sensibility" and "response and responsibility", but
1011 not "sense and responsibility". If caseful matching is in force at the
1012 time of the back reference, the case of letters is relevant. For exam-
1017 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1018 original capturing subpattern is matched caselessly.
1020 Back references to named subpatterns use the Python syntax (?P=name).
1021 We could rewrite the above example as follows:
1023 (?<p1>(?i)rah)\s+(?P=p1)
1025 There may be more than one back reference to the same subpattern. If a
1026 subpattern has not actually been used in a particular match, any back
1027 references to it always fail. For example, the pattern
1031 always fails if it starts to match "a" rather than "bc". Because there
1032 may be many capturing parentheses in a pattern, all digits following
1033 the backslash are taken as part of a potential back reference number.
1034 If the pattern continues with a digit character, some delimiter must be
1035 used to terminate the back reference. If the PCRE_EXTENDED option is
1036 set, this can be whitespace. Otherwise an empty comment (see "Com-
1037 ments" below) can be used.
1039 A back reference that occurs inside the parentheses to which it refers
1040 fails when the subpattern is first used, so, for example, (a\1) never
1041 matches. However, such references can be useful inside repeated sub-
1042 patterns. For example, the pattern
1046 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1047 ation of the subpattern, the back reference matches the character
1048 string corresponding to the previous iteration. In order for this to
1049 work, the pattern must be such that the first iteration does not need
1050 to match the back reference. This can be done using alternation, as in
1051 the example above, or by a quantifier with a minimum of zero.
1056 An assertion is a test on the characters following or preceding the
1057 current matching point that does not actually consume any characters.
1058 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1061 More complicated assertions are coded as subpatterns. There are two
1062 kinds: those that look ahead of the current position in the subject
1063 string, and those that look behind it. An assertion subpattern is
1064 matched in the normal way, except that it does not cause the current
1065 matching position to be changed.
1067 Assertion subpatterns are not capturing subpatterns, and may not be
1068 repeated, because it makes no sense to assert the same thing several
1069 times. If any kind of assertion contains capturing subpatterns within
1070 it, these are counted for the purposes of numbering the capturing sub-
1071 patterns in the whole pattern. However, substring capturing is carried
1072 out only for positive assertions, because it does not make sense for
1073 negative assertions.
1075 Lookahead assertions
1077 Lookahead assertions start with (?= for positive assertions and (?! for
1078 negative assertions. For example,
1082 matches a word followed by a semicolon, but does not include the semi-
1083 colon in the match, and
1087 matches any occurrence of "foo" that is not followed by "bar". Note
1088 that the apparently similar pattern
1092 does not find an occurrence of "bar" that is preceded by something
1093 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1094 the assertion (?!foo) is always true when the next three characters are
1095 "bar". A lookbehind assertion is needed to achieve the other effect.
1097 If you want to force a matching failure at some point in a pattern, the
1098 most convenient way to do it is with (?!) because an empty string
1099 always matches, so an assertion that requires there not to be an empty
1100 string must always fail.
1102 Lookbehind assertions
1104 Lookbehind assertions start with (?<= for positive assertions and (?<!
1105 for negative assertions. For example,
1109 does find an occurrence of "bar" that is not preceded by "foo". The
1110 contents of a lookbehind assertion are restricted such that all the
1111 strings it matches must have a fixed length. However, if there are sev-
1112 eral alternatives, they do not all have to have the same fixed length.
1121 causes an error at compile time. Branches that match different length
1122 strings are permitted only at the top level of a lookbehind assertion.
1123 This is an extension compared with Perl (at least for 5.8), which
1124 requires all branches to match the same length of string. An assertion
1129 is not permitted, because its single top-level branch can match two
1130 different lengths, but it is acceptable if rewritten to use two top-
1135 The implementation of lookbehind assertions is, for each alternative,
1136 to temporarily move the current position back by the fixed width and
1137 then try to match. If there are insufficient characters before the cur-
1138 rent position, the match is deemed to fail.
1140 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1141 mode) to appear in lookbehind assertions, because it makes it impossi-
1142 ble to calculate the length of the lookbehind. The \X escape, which can
1143 match different numbers of bytes, is also not permitted.
1145 Atomic groups can be used in conjunction with lookbehind assertions to
1146 specify efficient matching at the end of the subject string. Consider a
1147 simple pattern such as
1151 when applied to a long string that does not match. Because matching
1152 proceeds from left to right, PCRE will look for each "a" in the subject
1153 and then see if what follows matches the rest of the pattern. If the
1154 pattern is specified as
1158 the initial .* matches the entire string at first, but when this fails
1159 (because there is no following "a"), it backtracks to match all but the
1160 last character, then all but the last two characters, and so on. Once
1161 again the search for "a" covers the entire string, from right to left,
1162 so we are no better off. However, if the pattern is written as
1166 or, equivalently, using the possessive quantifier syntax,
1170 there can be no backtracking for the .* item; it can match only the
1171 entire string. The subsequent lookbehind assertion does a single test
1172 on the last four characters. If it fails, the match fails immediately.
1173 For long strings, this approach makes a significant difference to the
1176 Using multiple assertions
1178 Several assertions (of any sort) may occur in succession. For example,
1180 (?<=\d{3})(?<!999)foo
1182 matches "foo" preceded by three digits that are not "999". Notice that
1183 each of the assertions is applied independently at the same point in
1184 the subject string. First there is a check that the previous three
1185 characters are all digits, and then there is a check that the same
1186 three characters are not "999". This pattern does not match "foo" pre-
1187 ceded by six characters, the first of which are digits and the last
1188 three of which are not "999". For example, it doesn't match "123abc-
1189 foo". A pattern to do that is
1191 (?<=\d{3}...)(?<!999)foo
1193 This time the first assertion looks at the preceding six characters,
1194 checking that the first three are digits, and then the second assertion
1195 checks that the preceding three characters are not "999".
1197 Assertions can be nested in any combination. For example,
1201 matches an occurrence of "baz" that is preceded by "bar" which in turn
1202 is not preceded by "foo", while
1204 (?<=\d{3}(?!999)...)foo
1206 is another pattern that matches "foo" preceded by three digits and any
1207 three characters that are not "999".
1210 CONDITIONAL SUBPATTERNS
1212 It is possible to cause the matching process to obey a subpattern con-
1213 ditionally or to choose between two alternative subpatterns, depending
1214 on the result of an assertion, or whether a previous capturing subpat-
1215 tern matched or not. The two possible forms of conditional subpattern
1218 (?(condition)yes-pattern)
1219 (?(condition)yes-pattern|no-pattern)
1221 If the condition is satisfied, the yes-pattern is used; otherwise the
1222 no-pattern (if present) is used. If there are more than two alterna-
1223 tives in the subpattern, a compile-time error occurs.
1225 There are three kinds of condition. If the text between the parentheses
1226 consists of a sequence of digits, the condition is satisfied if the
1227 capturing subpattern of that number has previously matched. The number
1228 must be greater than zero. Consider the following pattern, which con-
1229 tains non-significant white space to make it more readable (assume the
1230 PCRE_EXTENDED option) and to divide it into three parts for ease of
1233 ( \( )? [^()]+ (?(1) \) )
1235 The first part matches an optional opening parenthesis, and if that
1236 character is present, sets it as the first captured substring. The sec-
1237 ond part matches one or more characters that are not parentheses. The
1238 third part is a conditional subpattern that tests whether the first set
1239 of parentheses matched or not. If they did, that is, if subject started
1240 with an opening parenthesis, the condition is true, and so the yes-pat-
1241 tern is executed and a closing parenthesis is required. Otherwise,
1242 since no-pattern is not present, the subpattern matches nothing. In
1243 other words, this pattern matches a sequence of non-parentheses,
1244 optionally enclosed in parentheses.
1246 If the condition is the string (R), it is satisfied if a recursive call
1247 to the pattern or subpattern has been made. At "top level", the condi-
1248 tion is false. This is a PCRE extension. Recursive patterns are
1249 described in the next section.
1251 If the condition is not a sequence of digits or (R), it must be an
1252 assertion. This may be a positive or negative lookahead or lookbehind
1253 assertion. Consider this pattern, again containing non-significant
1254 white space, and with the two alternatives on the second line:
1257 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1259 The condition is a positive lookahead assertion that matches an
1260 optional sequence of non-letters followed by a letter. In other words,
1261 it tests for the presence of at least one letter in the subject. If a
1262 letter is found, the subject is matched against the first alternative;
1263 otherwise it is matched against the second. This pattern matches
1264 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1265 letters and dd are digits.
1270 The sequence (?# marks the start of a comment that continues up to the
1271 next closing parenthesis. Nested parentheses are not permitted. The
1272 characters that make up a comment play no part in the pattern matching
1275 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1276 character class introduces a comment that continues up to the next new-
1277 line character in the pattern.
1282 Consider the problem of matching a string in parentheses, allowing for
1283 unlimited nested parentheses. Without the use of recursion, the best
1284 that can be done is to use a pattern that matches up to some fixed
1285 depth of nesting. It is not possible to handle an arbitrary nesting
1286 depth. Perl provides a facility that allows regular expressions to
1287 recurse (amongst other things). It does this by interpolating Perl code
1288 in the expression at run time, and the code can refer to the expression
1289 itself. A Perl pattern to solve the parentheses problem can be created
1292 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1294 The (?p{...}) item interpolates Perl code at run time, and in this case
1295 refers recursively to the pattern in which it appears. Obviously, PCRE
1296 cannot support the interpolation of Perl code. Instead, it supports
1297 some special syntax for recursion of the entire pattern, and also for
1298 individual subpattern recursion.
1300 The special item that consists of (? followed by a number greater than
1301 zero and a closing parenthesis is a recursive call of the subpattern of
1302 the given number, provided that it occurs inside that subpattern. (If
1303 not, it is a "subroutine" call, which is described in the next sec-
1304 tion.) The special item (?R) is a recursive call of the entire regular
1307 For example, this PCRE pattern solves the nested parentheses problem
1308 (assume the PCRE_EXTENDED option is set so that white space is
1311 \( ( (?>[^()]+) | (?R) )* \)
1313 First it matches an opening parenthesis. Then it matches any number of
1314 substrings which can either be a sequence of non-parentheses, or a
1315 recursive match of the pattern itself (that is a correctly parenthe-
1316 sized substring). Finally there is a closing parenthesis.
1318 If this were part of a larger pattern, you would not want to recurse
1319 the entire pattern, so instead you could use this:
1321 ( \( ( (?>[^()]+) | (?1) )* \) )
1323 We have put the pattern into parentheses, and caused the recursion to
1324 refer to them instead of the whole pattern. In a larger pattern, keep-
1325 ing track of parenthesis numbers can be tricky. It may be more conve-
1326 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1327 which is an extension to the Python syntax that PCRE uses for named
1328 parentheses (Perl does not provide named parentheses). We could rewrite
1329 the above example as follows:
1331 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1333 This particular example pattern contains nested unlimited repeats, and
1334 so the use of atomic grouping for matching strings of non-parentheses
1335 is important when applying the pattern to strings that do not match.
1336 For example, when this pattern is applied to
1338 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1340 it yields "no match" quickly. However, if atomic grouping is not used,
1341 the match runs for a very long time indeed because there are so many
1342 different ways the + and * repeats can carve up the subject, and all
1343 have to be tested before failure can be reported.
1345 At the end of a match, the values set for any capturing subpatterns are
1346 those from the outermost level of the recursion at which the subpattern
1347 value is set. If you want to obtain intermediate values, a callout
1348 function can be used (see the next section and the pcrecallout documen-
1349 tation). If the pattern above is matched against
1353 the value for the capturing parentheses is "ef", which is the last
1354 value taken on at the top level. If additional parentheses are added,
1357 \( ( ( (?>[^()]+) | (?R) )* ) \)
1361 the string they capture is "ab(cd)ef", the contents of the top level
1362 parentheses. If there are more than 15 capturing parentheses in a pat-
1363 tern, PCRE has to obtain extra memory to store data during a recursion,
1364 which it does by using pcre_malloc, freeing it via pcre_free after-
1365 wards. If no memory can be obtained, the match fails with the
1366 PCRE_ERROR_NOMEMORY error.
1368 Do not confuse the (?R) item with the condition (R), which tests for
1369 recursion. Consider this pattern, which matches text in angle brack-
1370 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1371 brackets (that is, when recursing), whereas any characters are permit-
1372 ted at the outer level.
1374 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1376 In this pattern, (?(R) is the start of a conditional subpattern, with
1377 two different alternatives for the recursive and non-recursive cases.
1378 The (?R) item is the actual recursive call.
1381 SUBPATTERNS AS SUBROUTINES
1383 If the syntax for a recursive subpattern reference (either by number or
1384 by name) is used outside the parentheses to which it refers, it oper-
1385 ates like a subroutine in a programming language. An earlier example
1386 pointed out that the pattern
1388 (sens|respons)e and \1ibility
1390 matches "sense and sensibility" and "response and responsibility", but
1391 not "sense and responsibility". If instead the pattern
1393 (sens|respons)e and (?1)ibility
1395 is used, it does match "sense and responsibility" as well as the other
1396 two strings. Such references must, however, follow the subpattern to
1402 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1403 Perl code to be obeyed in the middle of matching a regular expression.
1404 This makes it possible, amongst other things, to extract different sub-
1405 strings that match the same pair of parentheses when there is a repeti-
1408 PCRE provides a similar feature, but of course it cannot obey arbitrary
1409 Perl code. The feature is called "callout". The caller of PCRE provides
1410 an external function by putting its entry point in the global variable
1411 pcre_callout. By default, this variable contains NULL, which disables
1414 Within a regular expression, (?C) indicates the points at which the
1415 external function is to be called. If you want to identify different
1416 callout points, you can put a number less than 256 after the letter C.
1417 The default value is zero. For example, this pattern has two callout
1422 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1423 automatically installed before each item in the pattern. They are all
1426 During matching, when PCRE reaches a callout point (and pcre_callout is
1427 set), the external function is called. It is provided with the number
1428 of the callout, the position in the pattern, and, optionally, one item
1429 of data originally supplied by the caller of pcre_exec(). The callout
1430 function may cause matching to proceed, to backtrack, or to fail alto-
1431 gether. A complete description of the interface to the callout function
1432 is given in the pcrecallout documentation.
1434 Last updated: 28 February 2005
1435 Copyright (c) 1997-2005 University of Cambridge.