1 This file contains the PCRE man page that describes the regular expressions
2 supported by PCRE version 6.0. Note that not all of the features are relevant
3 in the context of Exim. In particular, the version of PCRE that is compiled
4 with Exim does not include UTF-8 support, there is no mechanism for changing
5 the options with which the PCRE functions are called, and features such as
6 callout are not accessible.
7 -----------------------------------------------------------------------------
12 PCRE - Perl-compatible regular expressions
15 PCRE REGULAR EXPRESSION DETAILS
17 The syntax and semantics of the regular expressions supported by PCRE
18 are described below. Regular expressions are also described in the Perl
19 documentation and in a number of books, some of which have copious
20 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
21 by O'Reilly, covers regular expressions in great detail. This descrip-
22 tion of PCRE's regular expressions is intended as reference material.
24 The original operation of PCRE was on strings of one-byte characters.
25 However, there is now also support for UTF-8 character strings. To use
26 this, you must build PCRE to include UTF-8 support, and then call
27 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
28 matching is mentioned in several places below. There is also a summary
29 of UTF-8 features in the section on UTF-8 support in the main pcre
32 The remainder of this document discusses the patterns that are sup-
33 ported by PCRE when its main matching function, pcre_exec(), is used.
34 From release 6.0, PCRE offers a second matching function,
35 pcre_dfa_exec(), which matches using a different algorithm that is not
36 Perl-compatible. The advantages and disadvantages of the alternative
37 function, and how it differs from the normal function, are discussed in
38 the pcrematching page.
40 A regular expression is a pattern that is matched against a subject
41 string from left to right. Most characters stand for themselves in a
42 pattern, and match the corresponding characters in the subject. As a
43 trivial example, the pattern
47 matches a portion of a subject string that is identical to itself. When
48 caseless matching is specified (the PCRE_CASELESS option), letters are
49 matched independently of case. In UTF-8 mode, PCRE always understands
50 the concept of case for characters whose values are less than 128, so
51 caseless matching is always possible. For characters with higher val-
52 ues, the concept of case is supported if PCRE is compiled with Unicode
53 property support, but not otherwise. If you want to use caseless
54 matching for characters 128 and above, you must ensure that PCRE is
55 compiled with Unicode property support as well as with UTF-8 support.
57 The power of regular expressions comes from the ability to include
58 alternatives and repetitions in the pattern. These are encoded in the
59 pattern by the use of metacharacters, which do not stand for themselves
60 but instead are interpreted in some special way.
62 There are two different sets of metacharacters: those that are recog-
63 nized anywhere in the pattern except within square brackets, and those
64 that are recognized in square brackets. Outside square brackets, the
65 metacharacters are as follows:
67 \ general escape character with several uses
68 ^ assert start of string (or line, in multiline mode)
69 $ assert end of string (or line, in multiline mode)
70 . match any character except newline (by default)
71 [ start character class definition
72 | start of alternative branch
75 ? extends the meaning of (
76 also 0 or 1 quantifier
77 also quantifier minimizer
78 * 0 or more quantifier
79 + 1 or more quantifier
80 also "possessive quantifier"
81 { start min/max quantifier
83 Part of a pattern that is in square brackets is called a "character
84 class". In a character class the only metacharacters are:
86 \ general escape character
87 ^ negate the class, but only if the first character
88 - indicates character range
89 [ POSIX character class (only if followed by POSIX
91 ] terminates the character class
93 The following sections describe the use of each of the metacharacters.
98 The backslash character has several uses. Firstly, if it is followed by
99 a non-alphanumeric character, it takes away any special meaning that
100 character may have. This use of backslash as an escape character
101 applies both inside and outside character classes.
103 For example, if you want to match a * character, you write \* in the
104 pattern. This escaping action applies whether or not the following
105 character would otherwise be interpreted as a metacharacter, so it is
106 always safe to precede a non-alphanumeric with backslash to specify
107 that it stands for itself. In particular, if you want to match a back-
110 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
111 the pattern (other than in a character class) and characters between a
112 # outside a character class and the next newline character are ignored.
113 An escaping backslash can be used to include a whitespace or # charac-
114 ter as part of the pattern.
116 If you want to remove the special meaning from a sequence of charac-
117 ters, you can do so by putting them between \Q and \E. This is differ-
118 ent from Perl in that $ and @ are handled as literals in \Q...\E
119 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
120 tion. Note the following examples:
122 Pattern PCRE matches Perl matches
124 \Qabc$xyz\E abc$xyz abc followed by the
126 \Qabc\$xyz\E abc\$xyz abc\$xyz
127 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
129 The \Q...\E sequence is recognized both inside and outside character
132 Non-printing characters
134 A second use of backslash provides a way of encoding non-printing char-
135 acters in patterns in a visible manner. There is no restriction on the
136 appearance of non-printing characters, apart from the binary zero that
137 terminates a pattern, but when a pattern is being prepared by text
138 editing, it is usually easier to use one of the following escape
139 sequences than the binary character it represents:
141 \a alarm, that is, the BEL character (hex 07)
142 \cx "control-x", where x is any character
146 \r carriage return (hex 0D)
148 \ddd character with octal code ddd, or backreference
149 \xhh character with hex code hh
150 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
152 The precise effect of \cx is as follows: if x is a lower case letter,
153 it is converted to upper case. Then bit 6 of the character (hex 40) is
154 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
157 After \x, from zero to two hexadecimal digits are read (letters can be
158 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
159 its may appear between \x{ and }, but the value of the character code
160 must be less than 2**31 (that is, the maximum hexadecimal value is
161 7FFFFFFF). If characters other than hexadecimal digits appear between
162 \x{ and }, or if there is no terminating }, this form of escape is not
163 recognized. Instead, the initial \x will be interpreted as a basic
164 hexadecimal escape, with no following digits, giving a character whose
167 Characters whose value is less than 256 can be defined by either of the
168 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
169 in the way they are handled. For example, \xdc is exactly the same as
172 After \0 up to two further octal digits are read. In both cases, if
173 there are fewer than two digits, just those that are present are used.
174 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
175 character (code value 7). Make sure you supply two digits after the
176 initial zero if the pattern character that follows is itself an octal
179 The handling of a backslash followed by a digit other than 0 is compli-
180 cated. Outside a character class, PCRE reads it and any following dig-
181 its as a decimal number. If the number is less than 10, or if there
182 have been at least that many previous capturing left parentheses in the
183 expression, the entire sequence is taken as a back reference. A
184 description of how this works is given later, following the discussion
185 of parenthesized subpatterns.
187 Inside a character class, or if the decimal number is greater than 9
188 and there have not been that many capturing subpatterns, PCRE re-reads
189 up to three octal digits following the backslash, and generates a sin-
190 gle byte from the least significant 8 bits of the value. Any subsequent
191 digits stand for themselves. For example:
193 \040 is another way of writing a space
194 \40 is the same, provided there are fewer than 40
195 previous capturing subpatterns
196 \7 is always a back reference
197 \11 might be a back reference, or another way of
200 \0113 is a tab followed by the character "3"
201 \113 might be a back reference, otherwise the
202 character with octal code 113
203 \377 might be a back reference, otherwise
204 the byte consisting entirely of 1 bits
205 \81 is either a back reference, or a binary zero
206 followed by the two characters "8" and "1"
208 Note that octal values of 100 or greater must not be introduced by a
209 leading zero, because no more than three octal digits are ever read.
211 All the sequences that define a single byte value or a single UTF-8
212 character (in UTF-8 mode) can be used both inside and outside character
213 classes. In addition, inside a character class, the sequence \b is
214 interpreted as the backspace character (hex 08), and the sequence \X is
215 interpreted as the character "X". Outside a character class, these
216 sequences have different meanings (see below).
218 Generic character types
220 The third use of backslash is for specifying generic character types.
221 The following are always recognized:
224 \D any character that is not a decimal digit
225 \s any whitespace character
226 \S any character that is not a whitespace character
227 \w any "word" character
228 \W any "non-word" character
230 Each pair of escape sequences partitions the complete set of characters
231 into two disjoint sets. Any given character matches one, and only one,
234 These character type sequences can appear both inside and outside char-
235 acter classes. They each match one character of the appropriate type.
236 If the current matching point is at the end of the subject string, all
237 of them fail, since there is no character to match.
239 For compatibility with Perl, \s does not match the VT character (code
240 11). This makes it different from the the POSIX "space" class. The \s
241 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
243 A "word" character is an underscore or any character less than 256 that
244 is a letter or digit. The definition of letters and digits is con-
245 trolled by PCRE's low-valued character tables, and may vary if locale-
246 specific matching is taking place (see "Locale support" in the pcreapi
247 page). For example, in the "fr_FR" (French) locale, some character
248 codes greater than 128 are used for accented letters, and these are
251 In UTF-8 mode, characters with values greater than 128 never match \d,
252 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
253 code character property support is available.
255 Unicode character properties
257 When PCRE is built with Unicode character property support, three addi-
258 tional escape sequences to match generic character types are available
259 when UTF-8 mode is selected. They are:
261 \p{xx} a character with the xx property
262 \P{xx} a character without the xx property
263 \X an extended Unicode sequence
265 The property names represented by xx above are limited to the Unicode
266 general category properties. Each character has exactly one such prop-
267 erty, specified by a two-letter abbreviation. For compatibility with
268 Perl, negation can be specified by including a circumflex between the
269 opening brace and the property name. For example, \p{^Lu} is the same
272 If only one letter is specified with \p or \P, it includes all the
273 properties that start with that letter. In this case, in the absence of
274 negation, the curly brackets in the escape sequence are optional; these
275 two examples have the same effect:
280 The following property codes are supported:
307 Pc Connector punctuation
311 Pi Initial punctuation
318 Sm Mathematical symbol
323 Zp Paragraph separator
326 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
329 Specifying caseless matching does not affect these escape sequences.
330 For example, \p{Lu} always matches only upper case letters.
332 The \X escape matches any number of Unicode characters that form an
333 extended Unicode sequence. \X is equivalent to
337 That is, it matches a character without the "mark" property, followed
338 by zero or more characters with the "mark" property, and treats the
339 sequence as an atomic group (see below). Characters with the "mark"
340 property are typically accents that affect the preceding character.
342 Matching characters by Unicode property is not fast, because PCRE has
343 to search a structure that contains data for over fifteen thousand
344 characters. That is why the traditional escape sequences such as \d and
345 \w do not use Unicode properties in PCRE.
349 The fourth use of backslash is for certain simple assertions. An asser-
350 tion specifies a condition that has to be met at a particular point in
351 a match, without consuming any characters from the subject string. The
352 use of subpatterns for more complicated assertions is described below.
353 The backslashed assertions are:
355 \b matches at a word boundary
356 \B matches when not at a word boundary
357 \A matches at start of subject
358 \Z matches at end of subject or before newline at end
359 \z matches at end of subject
360 \G matches at first matching position in subject
362 These assertions may not appear in character classes (but note that \b
363 has a different meaning, namely the backspace character, inside a char-
366 A word boundary is a position in the subject string where the current
367 character and the previous character do not both match \w or \W (i.e.
368 one matches \w and the other matches \W), or the start or end of the
369 string if the first or last character matches \w, respectively.
371 The \A, \Z, and \z assertions differ from the traditional circumflex
372 and dollar (described in the next section) in that they only ever match
373 at the very start and end of the subject string, whatever options are
374 set. Thus, they are independent of multiline mode. These three asser-
375 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
376 affect only the behaviour of the circumflex and dollar metacharacters.
377 However, if the startoffset argument of pcre_exec() is non-zero, indi-
378 cating that matching is to start at a point other than the beginning of
379 the subject, \A can never match. The difference between \Z and \z is
380 that \Z matches before a newline that is the last character of the
381 string as well as at the end of the string, whereas \z matches only at
384 The \G assertion is true only when the current matching position is at
385 the start point of the match, as specified by the startoffset argument
386 of pcre_exec(). It differs from \A when the value of startoffset is
387 non-zero. By calling pcre_exec() multiple times with appropriate argu-
388 ments, you can mimic Perl's /g option, and it is in this kind of imple-
389 mentation where \G can be useful.
391 Note, however, that PCRE's interpretation of \G, as the start of the
392 current match, is subtly different from Perl's, which defines it as the
393 end of the previous match. In Perl, these can be different when the
394 previously matched string was empty. Because PCRE does just one match
395 at a time, it cannot reproduce this behaviour.
397 If all the alternatives of a pattern begin with \G, the expression is
398 anchored to the starting match position, and the "anchored" flag is set
399 in the compiled regular expression.
402 CIRCUMFLEX AND DOLLAR
404 Outside a character class, in the default matching mode, the circumflex
405 character is an assertion that is true only if the current matching
406 point is at the start of the subject string. If the startoffset argu-
407 ment of pcre_exec() is non-zero, circumflex can never match if the
408 PCRE_MULTILINE option is unset. Inside a character class, circumflex
409 has an entirely different meaning (see below).
411 Circumflex need not be the first character of the pattern if a number
412 of alternatives are involved, but it should be the first thing in each
413 alternative in which it appears if the pattern is ever to match that
414 branch. If all possible alternatives start with a circumflex, that is,
415 if the pattern is constrained to match only at the start of the sub-
416 ject, it is said to be an "anchored" pattern. (There are also other
417 constructs that can cause a pattern to be anchored.)
419 A dollar character is an assertion that is true only if the current
420 matching point is at the end of the subject string, or immediately
421 before a newline character that is the last character in the string (by
422 default). Dollar need not be the last character of the pattern if a
423 number of alternatives are involved, but it should be the last item in
424 any branch in which it appears. Dollar has no special meaning in a
427 The meaning of dollar can be changed so that it matches only at the
428 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
429 compile time. This does not affect the \Z assertion.
431 The meanings of the circumflex and dollar characters are changed if the
432 PCRE_MULTILINE option is set. When this is the case, they match immedi-
433 ately after and immediately before an internal newline character,
434 respectively, in addition to matching at the start and end of the sub-
435 ject string. For example, the pattern /^abc$/ matches the subject
436 string "def\nabc" (where \n represents a newline character) in multi-
437 line mode, but not otherwise. Consequently, patterns that are anchored
438 in single line mode because all branches start with ^ are not anchored
439 in multiline mode, and a match for circumflex is possible when the
440 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
441 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
443 Note that the sequences \A, \Z, and \z can be used to match the start
444 and end of the subject in both modes, and if all branches of a pattern
445 start with \A it is always anchored, whether PCRE_MULTILINE is set or
449 FULL STOP (PERIOD, DOT)
451 Outside a character class, a dot in the pattern matches any one charac-
452 ter in the subject, including a non-printing character, but not (by
453 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
454 which might be more than one byte long, except (by default) newline. If
455 the PCRE_DOTALL option is set, dots match newlines as well. The han-
456 dling of dot is entirely independent of the handling of circumflex and
457 dollar, the only relationship being that they both involve newline
458 characters. Dot has no special meaning in a character class.
461 MATCHING A SINGLE BYTE
463 Outside a character class, the escape sequence \C matches any one byte,
464 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
465 The feature is provided in Perl in order to match individual bytes in
466 UTF-8 mode. Because it breaks up UTF-8 characters into individual
467 bytes, what remains in the string may be a malformed UTF-8 string. For
468 this reason, the \C escape sequence is best avoided.
470 PCRE does not allow \C to appear in lookbehind assertions (described
471 below), because in UTF-8 mode this would make it impossible to calcu-
472 late the length of the lookbehind.
475 SQUARE BRACKETS AND CHARACTER CLASSES
477 An opening square bracket introduces a character class, terminated by a
478 closing square bracket. A closing square bracket on its own is not spe-
479 cial. If a closing square bracket is required as a member of the class,
480 it should be the first data character in the class (after an initial
481 circumflex, if present) or escaped with a backslash.
483 A character class matches a single character in the subject. In UTF-8
484 mode, the character may occupy more than one byte. A matched character
485 must be in the set of characters defined by the class, unless the first
486 character in the class definition is a circumflex, in which case the
487 subject character must not be in the set defined by the class. If a
488 circumflex is actually required as a member of the class, ensure it is
489 not the first character, or escape it with a backslash.
491 For example, the character class [aeiou] matches any lower case vowel,
492 while [^aeiou] matches any character that is not a lower case vowel.
493 Note that a circumflex is just a convenient notation for specifying the
494 characters that are in the class by enumerating those that are not. A
495 class that starts with a circumflex is not an assertion: it still con-
496 sumes a character from the subject string, and therefore it fails if
497 the current pointer is at the end of the string.
499 In UTF-8 mode, characters with values greater than 255 can be included
500 in a class as a literal string of bytes, or by using the \x{ escaping
503 When caseless matching is set, any letters in a class represent both
504 their upper case and lower case versions, so for example, a caseless
505 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
506 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
507 understands the concept of case for characters whose values are less
508 than 128, so caseless matching is always possible. For characters with
509 higher values, the concept of case is supported if PCRE is compiled
510 with Unicode property support, but not otherwise. If you want to use
511 caseless matching for characters 128 and above, you must ensure that
512 PCRE is compiled with Unicode property support as well as with UTF-8
515 The newline character is never treated in any special way in character
516 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
517 options is. A class such as [^a] will always match a newline.
519 The minus (hyphen) character can be used to specify a range of charac-
520 ters in a character class. For example, [d-m] matches any letter
521 between d and m, inclusive. If a minus character is required in a
522 class, it must be escaped with a backslash or appear in a position
523 where it cannot be interpreted as indicating a range, typically as the
524 first or last character in the class.
526 It is not possible to have the literal character "]" as the end charac-
527 ter of a range. A pattern such as [W-]46] is interpreted as a class of
528 two characters ("W" and "-") followed by a literal string "46]", so it
529 would match "W46]" or "-46]". However, if the "]" is escaped with a
530 backslash it is interpreted as the end of range, so [W-\]46] is inter-
531 preted as a class containing a range followed by two other characters.
532 The octal or hexadecimal representation of "]" can also be used to end
535 Ranges operate in the collating sequence of character values. They can
536 also be used for characters specified numerically, for example
537 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
538 are greater than 255, for example [\x{100}-\x{2ff}].
540 If a range that includes letters is used when caseless matching is set,
541 it matches the letters in either case. For example, [W-c] is equivalent
542 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
543 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
544 accented E characters in both cases. In UTF-8 mode, PCRE supports the
545 concept of case for characters with values greater than 128 only when
546 it is compiled with Unicode property support.
548 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
549 in a character class, and add the characters that they match to the
550 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
551 flex can conveniently be used with the upper case character types to
552 specify a more restricted set of characters than the matching lower
553 case type. For example, the class [^\W_] matches any letter or digit,
556 The only metacharacters that are recognized in character classes are
557 backslash, hyphen (only where it can be interpreted as specifying a
558 range), circumflex (only at the start), opening square bracket (only
559 when it can be interpreted as introducing a POSIX class name - see the
560 next section), and the terminating closing square bracket. However,
561 escaping other non-alphanumeric characters does no harm.
564 POSIX CHARACTER CLASSES
566 Perl supports the POSIX notation for character classes. This uses names
567 enclosed by [: and :] within the enclosing square brackets. PCRE also
568 supports this notation. For example,
572 matches "0", "1", any alphabetic character, or "%". The supported class
575 alnum letters and digits
577 ascii character codes 0 - 127
578 blank space or tab only
579 cntrl control characters
580 digit decimal digits (same as \d)
581 graph printing characters, excluding space
582 lower lower case letters
583 print printing characters, including space
584 punct printing characters, excluding letters and digits
585 space white space (not quite the same as \s)
586 upper upper case letters
587 word "word" characters (same as \w)
588 xdigit hexadecimal digits
590 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
591 and space (32). Notice that this list includes the VT character (code
592 11). This makes "space" different to \s, which does not include VT (for
595 The name "word" is a Perl extension, and "blank" is a GNU extension
596 from Perl 5.8. Another Perl extension is negation, which is indicated
597 by a ^ character after the colon. For example,
601 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
602 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
603 these are not supported, and an error is given if they are encountered.
605 In UTF-8 mode, characters with values greater than 128 do not match any
606 of the POSIX character classes.
611 Vertical bar characters are used to separate alternative patterns. For
616 matches either "gilbert" or "sullivan". Any number of alternatives may
617 appear, and an empty alternative is permitted (matching the empty
618 string). The matching process tries each alternative in turn, from
619 left to right, and the first one that succeeds is used. If the alterna-
620 tives are within a subpattern (defined below), "succeeds" means match-
621 ing the rest of the main pattern as well as the alternative in the sub-
625 INTERNAL OPTION SETTING
627 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
628 PCRE_EXTENDED options can be changed from within the pattern by a
629 sequence of Perl option letters enclosed between "(?" and ")". The
637 For example, (?im) sets caseless, multiline matching. It is also possi-
638 ble to unset these options by preceding the letter with a hyphen, and a
639 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
640 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
641 is also permitted. If a letter appears both before and after the
642 hyphen, the option is unset.
644 When an option change occurs at top level (that is, not inside subpat-
645 tern parentheses), the change applies to the remainder of the pattern
646 that follows. If the change is placed right at the start of a pattern,
647 PCRE extracts it into the global options (and it will therefore show up
648 in data extracted by the pcre_fullinfo() function).
650 An option change within a subpattern affects only that part of the cur-
651 rent pattern that follows it, so
655 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
656 used). By this means, options can be made to have different settings
657 in different parts of the pattern. Any changes made in one alternative
658 do carry on into subsequent branches within the same subpattern. For
663 matches "ab", "aB", "c", and "C", even though when matching "C" the
664 first branch is abandoned before the option setting. This is because
665 the effects of option settings happen at compile time. There would be
666 some very weird behaviour otherwise.
668 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
669 in the same way as the Perl-compatible options by using the characters
670 U and X respectively. The (?X) flag setting is special in that it must
671 always occur earlier in the pattern than any of the additional features
672 it turns on, even when it is at top level. It is best to put it at the
678 Subpatterns are delimited by parentheses (round brackets), which can be
679 nested. Turning part of a pattern into a subpattern does two things:
681 1. It localizes a set of alternatives. For example, the pattern
685 matches one of the words "cat", "cataract", or "caterpillar". Without
686 the parentheses, it would match "cataract", "erpillar" or the empty
689 2. It sets up the subpattern as a capturing subpattern. This means
690 that, when the whole pattern matches, that portion of the subject
691 string that matched the subpattern is passed back to the caller via the
692 ovector argument of pcre_exec(). Opening parentheses are counted from
693 left to right (starting from 1) to obtain numbers for the capturing
696 For example, if the string "the red king" is matched against the pat-
699 the ((red|white) (king|queen))
701 the captured substrings are "red king", "red", and "king", and are num-
702 bered 1, 2, and 3, respectively.
704 The fact that plain parentheses fulfil two functions is not always
705 helpful. There are often times when a grouping subpattern is required
706 without a capturing requirement. If an opening parenthesis is followed
707 by a question mark and a colon, the subpattern does not do any captur-
708 ing, and is not counted when computing the number of any subsequent
709 capturing subpatterns. For example, if the string "the white queen" is
710 matched against the pattern
712 the ((?:red|white) (king|queen))
714 the captured substrings are "white queen" and "queen", and are numbered
715 1 and 2. The maximum number of capturing subpatterns is 65535, and the
716 maximum depth of nesting of all subpatterns, both capturing and non-
719 As a convenient shorthand, if any option settings are required at the
720 start of a non-capturing subpattern, the option letters may appear
721 between the "?" and the ":". Thus the two patterns
724 (?:(?i)saturday|sunday)
726 match exactly the same set of strings. Because alternative branches are
727 tried from left to right, and options are not reset until the end of
728 the subpattern is reached, an option setting in one branch does affect
729 subsequent branches, so the above patterns match "SUNDAY" as well as
735 Identifying capturing parentheses by number is simple, but it can be
736 very hard to keep track of the numbers in complicated regular expres-
737 sions. Furthermore, if an expression is modified, the numbers may
738 change. To help with this difficulty, PCRE supports the naming of sub-
739 patterns, something that Perl does not provide. The Python syntax
740 (?P<name>...) is used. Names consist of alphanumeric characters and
741 underscores, and must be unique within a pattern.
743 Named capturing parentheses are still allocated numbers as well as
744 names. The PCRE API provides function calls for extracting the name-to-
745 number translation table from a compiled pattern. There is also a con-
746 venience function for extracting a captured substring by name. For fur-
747 ther details see the pcreapi documentation.
752 Repetition is specified by quantifiers, which can follow any of the
755 a literal data character
757 the \C escape sequence
758 the \X escape sequence (in UTF-8 mode with Unicode properties)
759 an escape such as \d that matches a single character
761 a back reference (see next section)
762 a parenthesized subpattern (unless it is an assertion)
764 The general repetition quantifier specifies a minimum and maximum num-
765 ber of permitted matches, by giving the two numbers in curly brackets
766 (braces), separated by a comma. The numbers must be less than 65536,
767 and the first must be less than or equal to the second. For example:
771 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
772 special character. If the second number is omitted, but the comma is
773 present, there is no upper limit; if the second number and the comma
774 are both omitted, the quantifier specifies an exact number of required
779 matches at least 3 successive vowels, but may match many more, while
783 matches exactly 8 digits. An opening curly bracket that appears in a
784 position where a quantifier is not allowed, or one that does not match
785 the syntax of a quantifier, is taken as a literal character. For exam-
786 ple, {,6} is not a quantifier, but a literal string of four characters.
788 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
789 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
790 acters, each of which is represented by a two-byte sequence. Similarly,
791 when Unicode property support is available, \X{3} matches three Unicode
792 extended sequences, each of which may be several bytes long (and they
793 may be of different lengths).
795 The quantifier {0} is permitted, causing the expression to behave as if
796 the previous item and the quantifier were not present.
798 For convenience (and historical compatibility) the three most common
799 quantifiers have single-character abbreviations:
801 * is equivalent to {0,}
802 + is equivalent to {1,}
803 ? is equivalent to {0,1}
805 It is possible to construct infinite loops by following a subpattern
806 that can match no characters with a quantifier that has no upper limit,
811 Earlier versions of Perl and PCRE used to give an error at compile time
812 for such patterns. However, because there are cases where this can be
813 useful, such patterns are now accepted, but if any repetition of the
814 subpattern does in fact match no characters, the loop is forcibly bro-
817 By default, the quantifiers are "greedy", that is, they match as much
818 as possible (up to the maximum number of permitted times), without
819 causing the rest of the pattern to fail. The classic example of where
820 this gives problems is in trying to match comments in C programs. These
821 appear between /* and */ and within the comment, individual * and /
822 characters may appear. An attempt to match C comments by applying the
829 /* first comment */ not comment /* second comment */
831 fails, because it matches the entire string owing to the greediness of
834 However, if a quantifier is followed by a question mark, it ceases to
835 be greedy, and instead matches the minimum number of times possible, so
840 does the right thing with the C comments. The meaning of the various
841 quantifiers is not otherwise changed, just the preferred number of
842 matches. Do not confuse this use of question mark with its use as a
843 quantifier in its own right. Because it has two uses, it can sometimes
844 appear doubled, as in
848 which matches one digit by preference, but can match two if that is the
849 only way the rest of the pattern matches.
851 If the PCRE_UNGREEDY option is set (an option which is not available in
852 Perl), the quantifiers are not greedy by default, but individual ones
853 can be made greedy by following them with a question mark. In other
854 words, it inverts the default behaviour.
856 When a parenthesized subpattern is quantified with a minimum repeat
857 count that is greater than 1 or with a limited maximum, more memory is
858 required for the compiled pattern, in proportion to the size of the
861 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
862 alent to Perl's /s) is set, thus allowing the . to match newlines, the
863 pattern is implicitly anchored, because whatever follows will be tried
864 against every character position in the subject string, so there is no
865 point in retrying the overall match at any position after the first.
866 PCRE normally treats such a pattern as though it were preceded by \A.
868 In cases where it is known that the subject string contains no new-
869 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
870 mization, or alternatively using ^ to indicate anchoring explicitly.
872 However, there is one situation where the optimization cannot be used.
873 When .* is inside capturing parentheses that are the subject of a
874 backreference elsewhere in the pattern, a match at the start may fail,
875 and a later one succeed. Consider, for example:
879 If the subject is "xyz123abc123" the match point is the fourth charac-
880 ter. For this reason, such a pattern is not implicitly anchored.
882 When a capturing subpattern is repeated, the value captured is the sub-
883 string that matched the final iteration. For example, after
885 (tweedle[dume]{3}\s*)+
887 has matched "tweedledum tweedledee" the value of the captured substring
888 is "tweedledee". However, if there are nested capturing subpatterns,
889 the corresponding captured values may have been set in previous itera-
890 tions. For example, after
894 matches "aba" the value of the second captured substring is "b".
897 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
899 With both maximizing and minimizing repetition, failure of what follows
900 normally causes the repeated item to be re-evaluated to see if a dif-
901 ferent number of repeats allows the rest of the pattern to match. Some-
902 times it is useful to prevent this, either to change the nature of the
903 match, or to cause it fail earlier than it otherwise might, when the
904 author of the pattern knows there is no point in carrying on.
906 Consider, for example, the pattern \d+foo when applied to the subject
911 After matching all 6 digits and then failing to match "foo", the normal
912 action of the matcher is to try again with only 5 digits matching the
913 \d+ item, and then with 4, and so on, before ultimately failing.
914 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
915 the means for specifying that once a subpattern has matched, it is not
916 to be re-evaluated in this way.
918 If we use atomic grouping for the previous example, the matcher would
919 give up immediately on failing to match "foo" the first time. The nota-
920 tion is a kind of special parenthesis, starting with (?> as in this
925 This kind of parenthesis "locks up" the part of the pattern it con-
926 tains once it has matched, and a failure further into the pattern is
927 prevented from backtracking into it. Backtracking past it to previous
928 items, however, works as normal.
930 An alternative description is that a subpattern of this type matches
931 the string of characters that an identical standalone pattern would
932 match, if anchored at the current point in the subject string.
934 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
935 such as the above example can be thought of as a maximizing repeat that
936 must swallow everything it can. So, while both \d+ and \d+? are pre-
937 pared to adjust the number of digits they match in order to make the
938 rest of the pattern match, (?>\d+) can only match an entire sequence of
941 Atomic groups in general can of course contain arbitrarily complicated
942 subpatterns, and can be nested. However, when the subpattern for an
943 atomic group is just a single repeated item, as in the example above, a
944 simpler notation, called a "possessive quantifier" can be used. This
945 consists of an additional + character following a quantifier. Using
946 this notation, the previous example can be rewritten as
950 Possessive quantifiers are always greedy; the setting of the
951 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
952 simpler forms of atomic group. However, there is no difference in the
953 meaning or processing of a possessive quantifier and the equivalent
956 The possessive quantifier syntax is an extension to the Perl syntax. It
957 originates in Sun's Java package.
959 When a pattern contains an unlimited repeat inside a subpattern that
960 can itself be repeated an unlimited number of times, the use of an
961 atomic group is the only way to avoid some failing matches taking a
962 very long time indeed. The pattern
966 matches an unlimited number of substrings that either consist of non-
967 digits, or digits enclosed in <>, followed by either ! or ?. When it
968 matches, it runs quickly. However, if it is applied to
970 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
972 it takes a long time before reporting failure. This is because the
973 string can be divided between the internal \D+ repeat and the external
974 * repeat in a large number of ways, and all have to be tried. (The
975 example uses [!?] rather than a single character at the end, because
976 both PCRE and Perl have an optimization that allows for fast failure
977 when a single character is used. They remember the last single charac-
978 ter that is required for a match, and fail early if it is not present
979 in the string.) If the pattern is changed so that it uses an atomic
984 sequences of non-digits cannot be broken, and failure happens quickly.
989 Outside a character class, a backslash followed by a digit greater than
990 0 (and possibly further digits) is a back reference to a capturing sub-
991 pattern earlier (that is, to its left) in the pattern, provided there
992 have been that many previous capturing left parentheses.
994 However, if the decimal number following the backslash is less than 10,
995 it is always taken as a back reference, and causes an error only if
996 there are not that many capturing left parentheses in the entire pat-
997 tern. In other words, the parentheses that are referenced need not be
998 to the left of the reference for numbers less than 10. See the subsec-
999 tion entitled "Non-printing characters" above for further details of
1000 the handling of digits following a backslash.
1002 A back reference matches whatever actually matched the capturing sub-
1003 pattern in the current subject string, rather than anything matching
1004 the subpattern itself (see "Subpatterns as subroutines" below for a way
1005 of doing that). So the pattern
1007 (sens|respons)e and \1ibility
1009 matches "sense and sensibility" and "response and responsibility", but
1010 not "sense and responsibility". If caseful matching is in force at the
1011 time of the back reference, the case of letters is relevant. For exam-
1016 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
1017 original capturing subpattern is matched caselessly.
1019 Back references to named subpatterns use the Python syntax (?P=name).
1020 We could rewrite the above example as follows:
1022 (?<p1>(?i)rah)\s+(?P=p1)
1024 There may be more than one back reference to the same subpattern. If a
1025 subpattern has not actually been used in a particular match, any back
1026 references to it always fail. For example, the pattern
1030 always fails if it starts to match "a" rather than "bc". Because there
1031 may be many capturing parentheses in a pattern, all digits following
1032 the backslash are taken as part of a potential back reference number.
1033 If the pattern continues with a digit character, some delimiter must be
1034 used to terminate the back reference. If the PCRE_EXTENDED option is
1035 set, this can be whitespace. Otherwise an empty comment (see "Com-
1036 ments" below) can be used.
1038 A back reference that occurs inside the parentheses to which it refers
1039 fails when the subpattern is first used, so, for example, (a\1) never
1040 matches. However, such references can be useful inside repeated sub-
1041 patterns. For example, the pattern
1045 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1046 ation of the subpattern, the back reference matches the character
1047 string corresponding to the previous iteration. In order for this to
1048 work, the pattern must be such that the first iteration does not need
1049 to match the back reference. This can be done using alternation, as in
1050 the example above, or by a quantifier with a minimum of zero.
1055 An assertion is a test on the characters following or preceding the
1056 current matching point that does not actually consume any characters.
1057 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
1060 More complicated assertions are coded as subpatterns. There are two
1061 kinds: those that look ahead of the current position in the subject
1062 string, and those that look behind it. An assertion subpattern is
1063 matched in the normal way, except that it does not cause the current
1064 matching position to be changed.
1066 Assertion subpatterns are not capturing subpatterns, and may not be
1067 repeated, because it makes no sense to assert the same thing several
1068 times. If any kind of assertion contains capturing subpatterns within
1069 it, these are counted for the purposes of numbering the capturing sub-
1070 patterns in the whole pattern. However, substring capturing is carried
1071 out only for positive assertions, because it does not make sense for
1072 negative assertions.
1074 Lookahead assertions
1076 Lookahead assertions start with (?= for positive assertions and (?! for
1077 negative assertions. For example,
1081 matches a word followed by a semicolon, but does not include the semi-
1082 colon in the match, and
1086 matches any occurrence of "foo" that is not followed by "bar". Note
1087 that the apparently similar pattern
1091 does not find an occurrence of "bar" that is preceded by something
1092 other than "foo"; it finds any occurrence of "bar" whatsoever, because
1093 the assertion (?!foo) is always true when the next three characters are
1094 "bar". A lookbehind assertion is needed to achieve the other effect.
1096 If you want to force a matching failure at some point in a pattern, the
1097 most convenient way to do it is with (?!) because an empty string
1098 always matches, so an assertion that requires there not to be an empty
1099 string must always fail.
1101 Lookbehind assertions
1103 Lookbehind assertions start with (?<= for positive assertions and (?<!
1104 for negative assertions. For example,
1108 does find an occurrence of "bar" that is not preceded by "foo". The
1109 contents of a lookbehind assertion are restricted such that all the
1110 strings it matches must have a fixed length. However, if there are sev-
1111 eral alternatives, they do not all have to have the same fixed length.
1120 causes an error at compile time. Branches that match different length
1121 strings are permitted only at the top level of a lookbehind assertion.
1122 This is an extension compared with Perl (at least for 5.8), which
1123 requires all branches to match the same length of string. An assertion
1128 is not permitted, because its single top-level branch can match two
1129 different lengths, but it is acceptable if rewritten to use two top-
1134 The implementation of lookbehind assertions is, for each alternative,
1135 to temporarily move the current position back by the fixed width and
1136 then try to match. If there are insufficient characters before the cur-
1137 rent position, the match is deemed to fail.
1139 PCRE does not allow the \C escape (which matches a single byte in UTF-8
1140 mode) to appear in lookbehind assertions, because it makes it impossi-
1141 ble to calculate the length of the lookbehind. The \X escape, which can
1142 match different numbers of bytes, is also not permitted.
1144 Atomic groups can be used in conjunction with lookbehind assertions to
1145 specify efficient matching at the end of the subject string. Consider a
1146 simple pattern such as
1150 when applied to a long string that does not match. Because matching
1151 proceeds from left to right, PCRE will look for each "a" in the subject
1152 and then see if what follows matches the rest of the pattern. If the
1153 pattern is specified as
1157 the initial .* matches the entire string at first, but when this fails
1158 (because there is no following "a"), it backtracks to match all but the
1159 last character, then all but the last two characters, and so on. Once
1160 again the search for "a" covers the entire string, from right to left,
1161 so we are no better off. However, if the pattern is written as
1165 or, equivalently, using the possessive quantifier syntax,
1169 there can be no backtracking for the .* item; it can match only the
1170 entire string. The subsequent lookbehind assertion does a single test
1171 on the last four characters. If it fails, the match fails immediately.
1172 For long strings, this approach makes a significant difference to the
1175 Using multiple assertions
1177 Several assertions (of any sort) may occur in succession. For example,
1179 (?<=\d{3})(?<!999)foo
1181 matches "foo" preceded by three digits that are not "999". Notice that
1182 each of the assertions is applied independently at the same point in
1183 the subject string. First there is a check that the previous three
1184 characters are all digits, and then there is a check that the same
1185 three characters are not "999". This pattern does not match "foo" pre-
1186 ceded by six characters, the first of which are digits and the last
1187 three of which are not "999". For example, it doesn't match "123abc-
1188 foo". A pattern to do that is
1190 (?<=\d{3}...)(?<!999)foo
1192 This time the first assertion looks at the preceding six characters,
1193 checking that the first three are digits, and then the second assertion
1194 checks that the preceding three characters are not "999".
1196 Assertions can be nested in any combination. For example,
1200 matches an occurrence of "baz" that is preceded by "bar" which in turn
1201 is not preceded by "foo", while
1203 (?<=\d{3}(?!999)...)foo
1205 is another pattern that matches "foo" preceded by three digits and any
1206 three characters that are not "999".
1209 CONDITIONAL SUBPATTERNS
1211 It is possible to cause the matching process to obey a subpattern con-
1212 ditionally or to choose between two alternative subpatterns, depending
1213 on the result of an assertion, or whether a previous capturing subpat-
1214 tern matched or not. The two possible forms of conditional subpattern
1217 (?(condition)yes-pattern)
1218 (?(condition)yes-pattern|no-pattern)
1220 If the condition is satisfied, the yes-pattern is used; otherwise the
1221 no-pattern (if present) is used. If there are more than two alterna-
1222 tives in the subpattern, a compile-time error occurs.
1224 There are three kinds of condition. If the text between the parentheses
1225 consists of a sequence of digits, the condition is satisfied if the
1226 capturing subpattern of that number has previously matched. The number
1227 must be greater than zero. Consider the following pattern, which con-
1228 tains non-significant white space to make it more readable (assume the
1229 PCRE_EXTENDED option) and to divide it into three parts for ease of
1232 ( \( )? [^()]+ (?(1) \) )
1234 The first part matches an optional opening parenthesis, and if that
1235 character is present, sets it as the first captured substring. The sec-
1236 ond part matches one or more characters that are not parentheses. The
1237 third part is a conditional subpattern that tests whether the first set
1238 of parentheses matched or not. If they did, that is, if subject started
1239 with an opening parenthesis, the condition is true, and so the yes-pat-
1240 tern is executed and a closing parenthesis is required. Otherwise,
1241 since no-pattern is not present, the subpattern matches nothing. In
1242 other words, this pattern matches a sequence of non-parentheses,
1243 optionally enclosed in parentheses.
1245 If the condition is the string (R), it is satisfied if a recursive call
1246 to the pattern or subpattern has been made. At "top level", the condi-
1247 tion is false. This is a PCRE extension. Recursive patterns are
1248 described in the next section.
1250 If the condition is not a sequence of digits or (R), it must be an
1251 assertion. This may be a positive or negative lookahead or lookbehind
1252 assertion. Consider this pattern, again containing non-significant
1253 white space, and with the two alternatives on the second line:
1256 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1258 The condition is a positive lookahead assertion that matches an
1259 optional sequence of non-letters followed by a letter. In other words,
1260 it tests for the presence of at least one letter in the subject. If a
1261 letter is found, the subject is matched against the first alternative;
1262 otherwise it is matched against the second. This pattern matches
1263 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1264 letters and dd are digits.
1269 The sequence (?# marks the start of a comment that continues up to the
1270 next closing parenthesis. Nested parentheses are not permitted. The
1271 characters that make up a comment play no part in the pattern matching
1274 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1275 character class introduces a comment that continues up to the next new-
1276 line character in the pattern.
1281 Consider the problem of matching a string in parentheses, allowing for
1282 unlimited nested parentheses. Without the use of recursion, the best
1283 that can be done is to use a pattern that matches up to some fixed
1284 depth of nesting. It is not possible to handle an arbitrary nesting
1285 depth. Perl provides a facility that allows regular expressions to
1286 recurse (amongst other things). It does this by interpolating Perl code
1287 in the expression at run time, and the code can refer to the expression
1288 itself. A Perl pattern to solve the parentheses problem can be created
1291 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1293 The (?p{...}) item interpolates Perl code at run time, and in this case
1294 refers recursively to the pattern in which it appears. Obviously, PCRE
1295 cannot support the interpolation of Perl code. Instead, it supports
1296 some special syntax for recursion of the entire pattern, and also for
1297 individual subpattern recursion.
1299 The special item that consists of (? followed by a number greater than
1300 zero and a closing parenthesis is a recursive call of the subpattern of
1301 the given number, provided that it occurs inside that subpattern. (If
1302 not, it is a "subroutine" call, which is described in the next sec-
1303 tion.) The special item (?R) is a recursive call of the entire regular
1306 For example, this PCRE pattern solves the nested parentheses problem
1307 (assume the PCRE_EXTENDED option is set so that white space is
1310 \( ( (?>[^()]+) | (?R) )* \)
1312 First it matches an opening parenthesis. Then it matches any number of
1313 substrings which can either be a sequence of non-parentheses, or a
1314 recursive match of the pattern itself (that is a correctly parenthe-
1315 sized substring). Finally there is a closing parenthesis.
1317 If this were part of a larger pattern, you would not want to recurse
1318 the entire pattern, so instead you could use this:
1320 ( \( ( (?>[^()]+) | (?1) )* \) )
1322 We have put the pattern into parentheses, and caused the recursion to
1323 refer to them instead of the whole pattern. In a larger pattern, keep-
1324 ing track of parenthesis numbers can be tricky. It may be more conve-
1325 nient to use named parentheses instead. For this, PCRE uses (?P>name),
1326 which is an extension to the Python syntax that PCRE uses for named
1327 parentheses (Perl does not provide named parentheses). We could rewrite
1328 the above example as follows:
1330 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1332 This particular example pattern contains nested unlimited repeats, and
1333 so the use of atomic grouping for matching strings of non-parentheses
1334 is important when applying the pattern to strings that do not match.
1335 For example, when this pattern is applied to
1337 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1339 it yields "no match" quickly. However, if atomic grouping is not used,
1340 the match runs for a very long time indeed because there are so many
1341 different ways the + and * repeats can carve up the subject, and all
1342 have to be tested before failure can be reported.
1344 At the end of a match, the values set for any capturing subpatterns are
1345 those from the outermost level of the recursion at which the subpattern
1346 value is set. If you want to obtain intermediate values, a callout
1347 function can be used (see the next section and the pcrecallout documen-
1348 tation). If the pattern above is matched against
1352 the value for the capturing parentheses is "ef", which is the last
1353 value taken on at the top level. If additional parentheses are added,
1356 \( ( ( (?>[^()]+) | (?R) )* ) \)
1360 the string they capture is "ab(cd)ef", the contents of the top level
1361 parentheses. If there are more than 15 capturing parentheses in a pat-
1362 tern, PCRE has to obtain extra memory to store data during a recursion,
1363 which it does by using pcre_malloc, freeing it via pcre_free after-
1364 wards. If no memory can be obtained, the match fails with the
1365 PCRE_ERROR_NOMEMORY error.
1367 Do not confuse the (?R) item with the condition (R), which tests for
1368 recursion. Consider this pattern, which matches text in angle brack-
1369 ets, allowing for arbitrary nesting. Only digits are allowed in nested
1370 brackets (that is, when recursing), whereas any characters are permit-
1371 ted at the outer level.
1373 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
1375 In this pattern, (?(R) is the start of a conditional subpattern, with
1376 two different alternatives for the recursive and non-recursive cases.
1377 The (?R) item is the actual recursive call.
1380 SUBPATTERNS AS SUBROUTINES
1382 If the syntax for a recursive subpattern reference (either by number or
1383 by name) is used outside the parentheses to which it refers, it oper-
1384 ates like a subroutine in a programming language. An earlier example
1385 pointed out that the pattern
1387 (sens|respons)e and \1ibility
1389 matches "sense and sensibility" and "response and responsibility", but
1390 not "sense and responsibility". If instead the pattern
1392 (sens|respons)e and (?1)ibility
1394 is used, it does match "sense and responsibility" as well as the other
1395 two strings. Such references must, however, follow the subpattern to
1401 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1402 Perl code to be obeyed in the middle of matching a regular expression.
1403 This makes it possible, amongst other things, to extract different sub-
1404 strings that match the same pair of parentheses when there is a repeti-
1407 PCRE provides a similar feature, but of course it cannot obey arbitrary
1408 Perl code. The feature is called "callout". The caller of PCRE provides
1409 an external function by putting its entry point in the global variable
1410 pcre_callout. By default, this variable contains NULL, which disables
1413 Within a regular expression, (?C) indicates the points at which the
1414 external function is to be called. If you want to identify different
1415 callout points, you can put a number less than 256 after the letter C.
1416 The default value is zero. For example, this pattern has two callout
1421 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1422 automatically installed before each item in the pattern. They are all
1425 During matching, when PCRE reaches a callout point (and pcre_callout is
1426 set), the external function is called. It is provided with the number
1427 of the callout, the position in the pattern, and, optionally, one item
1428 of data originally supplied by the caller of pcre_exec(). The callout
1429 function may cause matching to proceed, to backtrack, or to fail alto-
1430 gether. A complete description of the interface to the callout function
1431 is given in the pcrecallout documentation.
1433 Last updated: 28 February 2005
1434 Copyright (c) 1997-2005 University of Cambridge.