doc/doc-txt/pcrepattern.txt

   1 This file contains the PCRE man page that describes the regular expressions
   2 supported by PCRE version 6.0. Note that not all of the features are relevant
   3 in the context of Exim. In particular, the version of PCRE that is compiled
   4 with Exim does not include UTF-8 support, there is no mechanism for changing
   5 the options with which the PCRE functions are called, and features such as
   6 callout are not accessible.
   7 -----------------------------------------------------------------------------
   8
   9
  10
  11 NAME
  12        PCRE - Perl-compatible regular expressions
  13
  14
  15 PCRE REGULAR EXPRESSION DETAILS
  16
  17        The  syntax  and semantics of the regular expressions supported by PCRE
  18        are described below. Regular expressions are also described in the Perl
  19        documentation  and  in  a  number  of books, some of which have copious
  20        examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
  21        by  O'Reilly, covers regular expressions in great detail. This descrip-
  22        tion of PCRE's regular expressions is intended as reference material.
  23
  24        The original operation of PCRE was on strings of  one-byte  characters.
  25        However,  there is now also support for UTF-8 character strings. To use
  26        this, you must build PCRE to  include  UTF-8  support,  and  then  call
  27        pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
  28        matching is mentioned in several places below. There is also a  summary
  29        of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
  30        page.
  31
  32        The remainder of this document discusses the  patterns  that  are  sup-
  33        ported  by  PCRE when its main matching function, pcre_exec(), is used.
  34        From  release  6.0,   PCRE   offers   a   second   matching   function,
  35        pcre_dfa_exec(),  which matches using a different algorithm that is not
  36        Perl-compatible. The advantages and disadvantages  of  the  alternative
  37        function, and how it differs from the normal function, are discussed in
  38        the pcrematching page.
  39
  40        A regular expression is a pattern that is  matched  against  a  subject
  41        string  from  left  to right. Most characters stand for themselves in a
  42        pattern, and match the corresponding characters in the  subject.  As  a
  43        trivial example, the pattern
  44
  45          The quick brown fox
  46
  47        matches a portion of a subject string that is identical to itself. When
  48        caseless matching is specified (the PCRE_CASELESS option), letters  are
  49        matched  independently  of case. In UTF-8 mode, PCRE always understands
  50        the concept of case for characters whose values are less than  128,  so
  51        caseless  matching  is always possible. For characters with higher val-
  52        ues, the concept of case is supported if PCRE is compiled with  Unicode
  53        property  support,  but  not  otherwise.   If  you want to use caseless
  54        matching for characters 128 and above, you must  ensure  that  PCRE  is
  55        compiled with Unicode property support as well as with UTF-8 support.
  56
  57        The  power  of  regular  expressions  comes from the ability to include
  58        alternatives and repetitions in the pattern. These are encoded  in  the
  59        pattern by the use of metacharacters, which do not stand for themselves
  60        but instead are interpreted in some special way.
  61
  62        There are two different sets of metacharacters: those that  are  recog-
  63        nized  anywhere in the pattern except within square brackets, and those
  64        that are recognized in square brackets. Outside  square  brackets,  the
  65        metacharacters are as follows:
  66
  67          \      general escape character with several uses
  68          ^      assert start of string (or line, in multiline mode)
  69          $      assert end of string (or line, in multiline mode)
  70          .      match any character except newline (by default)
  71          [      start character class definition
  72          |      start of alternative branch
  73          (      start subpattern
  74          )      end subpattern
  75          ?      extends the meaning of (
  76                 also 0 or 1 quantifier
  77                 also quantifier minimizer
  78          *      0 or more quantifier
  79          +      1 or more quantifier
  80                 also "possessive quantifier"
  81          {      start min/max quantifier
  82
  83        Part  of  a  pattern  that is in square brackets is called a "character
  84        class". In a character class the only metacharacters are:
  85
  86          \      general escape character
  87          ^      negate the class, but only if the first character
  88          -      indicates character range
  89          [      POSIX character class (only if followed by POSIX
  90                   syntax)
  91          ]      terminates the character class
  92
  93        The following sections describe the use of each of the  metacharacters.
  94
  95
  96 BACKSLASH
  97
  98        The backslash character has several uses. Firstly, if it is followed by
  99        a non-alphanumeric character, it takes away any  special  meaning  that
 100        character  may  have.  This  use  of  backslash  as an escape character
 101        applies both inside and outside character classes.
 102
 103        For example, if you want to match a * character, you write  \*  in  the
 104        pattern.   This  escaping  action  applies whether or not the following
 105        character would otherwise be interpreted as a metacharacter, so  it  is
 106        always  safe  to  precede  a non-alphanumeric with backslash to specify
 107        that it stands for itself. In particular, if you want to match a  back-
 108        slash, you write \\.
 109
 110        If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
 111        the pattern (other than in a character class) and characters between  a
 112        # outside a character class and the next newline character are ignored.
 113        An escaping backslash can be used to include a whitespace or #  charac-
 114        ter as part of the pattern.
 115
 116        If  you  want  to remove the special meaning from a sequence of charac-
 117        ters, you can do so by putting them between \Q and \E. This is  differ-
 118        ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
 119        sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
 120        tion. Note the following examples:
 121
 122          Pattern            PCRE matches   Perl matches
 123
 124          \Qabc$xyz\E        abc$xyz        abc followed by the
 125                                              contents of $xyz
 126          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 127          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 128
 129        The  \Q...\E  sequence  is recognized both inside and outside character
 130        classes.
 131
 132    Non-printing characters
 133
 134        A second use of backslash provides a way of encoding non-printing char-
 135        acters  in patterns in a visible manner. There is no restriction on the
 136        appearance of non-printing characters, apart from the binary zero  that
 137        terminates  a  pattern,  but  when  a pattern is being prepared by text
 138        editing, it is usually easier  to  use  one  of  the  following  escape
 139        sequences than the binary character it represents:
 140
 141          \a        alarm, that is, the BEL character (hex 07)
 142          \cx       "control-x", where x is any character
 143          \e        escape (hex 1B)
 144          \f        formfeed (hex 0C)
 145          \n        newline (hex 0A)
 146          \r        carriage return (hex 0D)
 147          \t        tab (hex 09)
 148          \ddd      character with octal code ddd, or backreference
 149          \xhh      character with hex code hh
 150          \x{hhh..} character with hex code hhh... (UTF-8 mode only)
 151
 152        The  precise  effect of \cx is as follows: if x is a lower case letter,
 153        it is converted to upper case. Then bit 6 of the character (hex 40)  is
 154        inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
 155        becomes hex 7B.
 156
 157        After \x, from zero to two hexadecimal digits are read (letters can  be
 158        in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
 159        its may appear between \x{ and }, but the value of the  character  code
 160        must  be  less  than  2**31  (that is, the maximum hexadecimal value is
 161        7FFFFFFF). If characters other than hexadecimal digits  appear  between
 162        \x{  and }, or if there is no terminating }, this form of escape is not
 163        recognized. Instead, the initial \x will  be  interpreted  as  a  basic
 164        hexadecimal  escape, with no following digits, giving a character whose
 165        value is zero.
 166
 167        Characters whose value is less than 256 can be defined by either of the
 168        two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
 169        in the way they are handled. For example, \xdc is exactly the  same  as
 170        \x{dc}.
 171
 172        After  \0  up  to  two further octal digits are read. In both cases, if
 173        there are fewer than two digits, just those that are present are  used.
 174        Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
 175        character (code value 7). Make sure you supply  two  digits  after  the
 176        initial  zero  if the pattern character that follows is itself an octal
 177        digit.
 178
 179        The handling of a backslash followed by a digit other than 0 is compli-
 180        cated.  Outside a character class, PCRE reads it and any following dig-
 181        its as a decimal number. If the number is less than  10,  or  if  there
 182        have been at least that many previous capturing left parentheses in the
 183        expression, the entire  sequence  is  taken  as  a  back  reference.  A
 184        description  of how this works is given later, following the discussion
 185        of parenthesized subpatterns.
 186
 187        Inside a character class, or if the decimal number is  greater  than  9
 188        and  there have not been that many capturing subpatterns, PCRE re-reads
 189        up to three octal digits following the backslash, and generates a  sin-
 190        gle byte from the least significant 8 bits of the value. Any subsequent
 191        digits stand for themselves.  For example:
 192
 193          \040   is another way of writing a space
 194          \40    is the same, provided there are fewer than 40
 195                    previous capturing subpatterns
 196          \7     is always a back reference
 197          \11    might be a back reference, or another way of
 198                    writing a tab
 199          \011   is always a tab
 200          \0113  is a tab followed by the character "3"
 201          \113   might be a back reference, otherwise the
 202                    character with octal code 113
 203          \377   might be a back reference, otherwise
 204                    the byte consisting entirely of 1 bits
 205          \81    is either a back reference, or a binary zero
 206                    followed by the two characters "8" and "1"
 207
 208        Note that octal values of 100 or greater must not be  introduced  by  a
 209        leading zero, because no more than three octal digits are ever read.
 210
 211        All  the  sequences  that  define a single byte value or a single UTF-8
 212        character (in UTF-8 mode) can be used both inside and outside character
 213        classes.  In  addition,  inside  a  character class, the sequence \b is
 214        interpreted as the backspace character (hex 08), and the sequence \X is
 215        interpreted  as  the  character  "X".  Outside a character class, these
 216        sequences have different meanings (see below).
 217
 218    Generic character types
 219
 220        The third use of backslash is for specifying generic  character  types.
 221        The following are always recognized:
 222
 223          \d     any decimal digit
 224          \D     any character that is not a decimal digit
 225          \s     any whitespace character
 226          \S     any character that is not a whitespace character
 227          \w     any "word" character
 228          \W     any "non-word" character
 229
 230        Each pair of escape sequences partitions the complete set of characters
 231        into two disjoint sets. Any given character matches one, and only  one,
 232        of each pair.
 233
 234        These character type sequences can appear both inside and outside char-
 235        acter classes. They each match one character of the  appropriate  type.
 236        If  the current matching point is at the end of the subject string, all
 237        of them fail, since there is no character to match.
 238
 239        For compatibility with Perl, \s does not match the VT  character  (code
 240        11).   This makes it different from the the POSIX "space" class. The \s
 241        characters are HT (9), LF (10), FF (12), CR (13), and space (32).
 242
 243        A "word" character is an underscore or any character less than 256 that
 244        is  a  letter  or  digit.  The definition of letters and digits is con-
 245        trolled by PCRE's low-valued character tables, and may vary if  locale-
 246        specific  matching is taking place (see "Locale support" in the pcreapi
 247        page). For example, in the  "fr_FR"  (French)  locale,  some  character
 248        codes  greater  than  128  are used for accented letters, and these are
 249        matched by \w.
 250
 251        In UTF-8 mode, characters with values greater than 128 never match  \d,
 252        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
 253        code character property support is available.
 254
 255    Unicode character properties
 256
 257        When PCRE is built with Unicode character property support, three addi-
 258        tional  escape sequences to match generic character types are available
 259        when UTF-8 mode is selected. They are:
 260
 261         \p{xx}   a character with the xx property
 262         \P{xx}   a character without the xx property
 263         \X       an extended Unicode sequence
 264
 265        The property names represented by xx above are limited to  the  Unicode
 266        general  category properties. Each character has exactly one such prop-
 267        erty, specified by a two-letter abbreviation.  For  compatibility  with
 268        Perl,  negation  can be specified by including a circumflex between the
 269        opening brace and the property name. For example, \p{^Lu} is  the  same
 270        as \P{Lu}.
 271
 272        If  only  one  letter  is  specified with \p or \P, it includes all the
 273        properties that start with that letter. In this case, in the absence of
 274        negation, the curly brackets in the escape sequence are optional; these
 275        two examples have the same effect:
 276
 277          \p{L}
 278          \pL
 279
 280        The following property codes are supported:
 281
 282          C     Other
 283          Cc    Control
 284          Cf    Format
 285          Cn    Unassigned
 286          Co    Private use
 287          Cs    Surrogate
 288
 289          L     Letter
 290          Ll    Lower case letter
 291          Lm    Modifier letter
 292          Lo    Other letter
 293          Lt    Title case letter
 294          Lu    Upper case letter
 295
 296          M     Mark
 297          Mc    Spacing mark
 298          Me    Enclosing mark
 299          Mn    Non-spacing mark
 300
 301          N     Number
 302          Nd    Decimal number
 303          Nl    Letter number
 304          No    Other number
 305
 306          P     Punctuation
 307          Pc    Connector punctuation
 308          Pd    Dash punctuation
 309          Pe    Close punctuation
 310          Pf    Final punctuation
 311          Pi    Initial punctuation
 312          Po    Other punctuation
 313          Ps    Open punctuation
 314
 315          S     Symbol
 316          Sc    Currency symbol
 317          Sk    Modifier symbol
 318          Sm    Mathematical symbol
 319          So    Other symbol
 320
 321          Z     Separator
 322          Zl    Line separator
 323          Zp    Paragraph separator
 324          Zs    Space separator
 325
 326        Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-
 327        ported by PCRE.
 328
 329        Specifying  caseless  matching  does not affect these escape sequences.
 330        For example, \p{Lu} always matches only upper case letters.
 331
 332        The \X escape matches any number of Unicode  characters  that  form  an
 333        extended Unicode sequence. \X is equivalent to
 334
 335          (?>\PM\pM*)
 336
 337        That  is,  it matches a character without the "mark" property, followed
 338        by zero or more characters with the "mark"  property,  and  treats  the
 339        sequence  as  an  atomic group (see below).  Characters with the "mark"
 340        property are typically accents that affect the preceding character.
 341
 342        Matching characters by Unicode property is not fast, because  PCRE  has
 343        to  search  a  structure  that  contains data for over fifteen thousand
 344        characters. That is why the traditional escape sequences such as \d and
 345        \w do not use Unicode properties in PCRE.
 346
 347    Simple assertions
 348
 349        The fourth use of backslash is for certain simple assertions. An asser-
 350        tion specifies a condition that has to be met at a particular point  in
 351        a  match, without consuming any characters from the subject string. The
 352        use of subpatterns for more complicated assertions is described  below.
 353        The backslashed assertions are:
 354
 355          \b     matches at a word boundary
 356          \B     matches when not at a word boundary
 357          \A     matches at start of subject
 358          \Z     matches at end of subject or before newline at end
 359          \z     matches at end of subject
 360          \G     matches at first matching position in subject
 361
 362        These  assertions may not appear in character classes (but note that \b
 363        has a different meaning, namely the backspace character, inside a char-
 364        acter class).
 365
 366        A  word  boundary is a position in the subject string where the current
 367        character and the previous character do not both match \w or  \W  (i.e.
 368        one  matches  \w  and the other matches \W), or the start or end of the
 369        string if the first or last character matches \w, respectively.
 370
 371        The \A, \Z, and \z assertions differ from  the  traditional  circumflex
 372        and dollar (described in the next section) in that they only ever match
 373        at the very start and end of the subject string, whatever  options  are
 374        set.  Thus,  they are independent of multiline mode. These three asser-
 375        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
 376        affect  only the behaviour of the circumflex and dollar metacharacters.
 377        However, if the startoffset argument of pcre_exec() is non-zero,  indi-
 378        cating that matching is to start at a point other than the beginning of
 379        the subject, \A can never match. The difference between \Z  and  \z  is
 380        that  \Z  matches  before  a  newline that is the last character of the
 381        string as well as at the end of the string, whereas \z matches only  at
 382        the end.
 383
 384        The  \G assertion is true only when the current matching position is at
 385        the start point of the match, as specified by the startoffset  argument
 386        of  pcre_exec().  It  differs  from \A when the value of startoffset is
 387        non-zero. By calling pcre_exec() multiple times with appropriate  argu-
 388        ments, you can mimic Perl's /g option, and it is in this kind of imple-
 389        mentation where \G can be useful.
 390
 391        Note, however, that PCRE's interpretation of \G, as the  start  of  the
 392        current match, is subtly different from Perl's, which defines it as the
 393        end of the previous match. In Perl, these can  be  different  when  the
 394        previously  matched  string was empty. Because PCRE does just one match
 395        at a time, it cannot reproduce this behaviour.
 396
 397        If all the alternatives of a pattern begin with \G, the  expression  is
 398        anchored to the starting match position, and the "anchored" flag is set
 399        in the compiled regular expression.
 400
 401
 402 CIRCUMFLEX AND DOLLAR
 403
 404        Outside a character class, in the default matching mode, the circumflex
 405        character  is  an  assertion  that is true only if the current matching
 406        point is at the start of the subject string. If the  startoffset  argu-
 407        ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
 408        PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
 409        has an entirely different meaning (see below).
 410
 411        Circumflex  need  not be the first character of the pattern if a number
 412        of alternatives are involved, but it should be the first thing in  each
 413        alternative  in  which  it appears if the pattern is ever to match that
 414        branch. If all possible alternatives start with a circumflex, that  is,
 415        if  the  pattern  is constrained to match only at the start of the sub-
 416        ject, it is said to be an "anchored" pattern.  (There  are  also  other
 417        constructs that can cause a pattern to be anchored.)
 418
 419        A  dollar  character  is  an assertion that is true only if the current
 420        matching point is at the end of  the  subject  string,  or  immediately
 421        before a newline character that is the last character in the string (by
 422        default). Dollar need not be the last character of  the  pattern  if  a
 423        number  of alternatives are involved, but it should be the last item in
 424        any branch in which it appears.  Dollar has no  special  meaning  in  a
 425        character class.
 426
 427        The  meaning  of  dollar  can be changed so that it matches only at the
 428        very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
 429        compile time. This does not affect the \Z assertion.
 430
 431        The meanings of the circumflex and dollar characters are changed if the
 432        PCRE_MULTILINE option is set. When this is the case, they match immedi-
 433        ately  after  and  immediately  before  an  internal newline character,
 434        respectively, in addition to matching at the start and end of the  sub-
 435        ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
 436        string "def\nabc" (where \n represents a newline character)  in  multi-
 437        line mode, but not otherwise.  Consequently, patterns that are anchored
 438        in single line mode because all branches start with ^ are not  anchored
 439        in  multiline  mode,  and  a  match for circumflex is possible when the
 440        startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-
 441        LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
 442
 443        Note  that  the sequences \A, \Z, and \z can be used to match the start
 444        and end of the subject in both modes, and if all branches of a  pattern
 445        start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
 446        not.
 447
 448
 449 FULL STOP (PERIOD, DOT)
 450
 451        Outside a character class, a dot in the pattern matches any one charac-
 452        ter  in  the  subject,  including a non-printing character, but not (by
 453        default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
 454        which might be more than one byte long, except (by default) newline. If
 455        the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-
 456        dling  of dot is entirely independent of the handling of circumflex and
 457        dollar, the only relationship being  that  they  both  involve  newline
 458        characters. Dot has no special meaning in a character class.
 459
 460
 461 MATCHING A SINGLE BYTE
 462
 463        Outside a character class, the escape sequence \C matches any one byte,
 464        both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.
 465        The  feature  is provided in Perl in order to match individual bytes in
 466        UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual
 467        bytes,  what remains in the string may be a malformed UTF-8 string. For
 468        this reason, the \C escape sequence is best avoided.
 469
 470        PCRE does not allow \C to appear in  lookbehind  assertions  (described
 471        below),  because  in UTF-8 mode this would make it impossible to calcu-
 472        late the length of the lookbehind.
 473
 474
 475 SQUARE BRACKETS AND CHARACTER CLASSES
 476
 477        An opening square bracket introduces a character class, terminated by a
 478        closing square bracket. A closing square bracket on its own is not spe-
 479        cial. If a closing square bracket is required as a member of the class,
 480        it  should  be  the first data character in the class (after an initial
 481        circumflex, if present) or escaped with a backslash.
 482
 483        A character class matches a single character in the subject.  In  UTF-8
 484        mode,  the character may occupy more than one byte. A matched character
 485        must be in the set of characters defined by the class, unless the first
 486        character  in  the  class definition is a circumflex, in which case the
 487        subject character must not be in the set defined by  the  class.  If  a
 488        circumflex  is actually required as a member of the class, ensure it is
 489        not the first character, or escape it with a backslash.
 490
 491        For example, the character class [aeiou] matches any lower case  vowel,
 492        while  [^aeiou]  matches  any character that is not a lower case vowel.
 493        Note that a circumflex is just a convenient notation for specifying the
 494        characters  that  are in the class by enumerating those that are not. A
 495        class that starts with a circumflex is not an assertion: it still  con-
 496        sumes  a  character  from the subject string, and therefore it fails if
 497        the current pointer is at the end of the string.
 498
 499        In UTF-8 mode, characters with values greater than 255 can be  included
 500        in  a  class as a literal string of bytes, or by using the \x{ escaping
 501        mechanism.
 502
 503        When caseless matching is set, any letters in a  class  represent  both
 504        their  upper  case  and lower case versions, so for example, a caseless
 505        [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
 506        match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
 507        understands the concept of case for characters whose  values  are  less
 508        than  128, so caseless matching is always possible. For characters with
 509        higher values, the concept of case is supported  if  PCRE  is  compiled
 510        with  Unicode  property support, but not otherwise.  If you want to use
 511        caseless matching for characters 128 and above, you  must  ensure  that
 512        PCRE  is  compiled  with Unicode property support as well as with UTF-8
 513        support.
 514
 515        The newline character is never treated in any special way in  character
 516        classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
 517        options is. A class such as [^a] will always match a newline.
 518
 519        The minus (hyphen) character can be used to specify a range of  charac-
 520        ters  in  a  character  class.  For  example,  [d-m] matches any letter
 521        between d and m, inclusive. If a  minus  character  is  required  in  a
 522        class,  it  must  be  escaped  with a backslash or appear in a position
 523        where it cannot be interpreted as indicating a range, typically as  the
 524        first or last character in the class.
 525
 526        It is not possible to have the literal character "]" as the end charac-
 527        ter of a range. A pattern such as [W-]46] is interpreted as a class  of
 528        two  characters ("W" and "-") followed by a literal string "46]", so it
 529        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
 530        backslash  it is interpreted as the end of range, so [W-\]46] is inter-
 531        preted as a class containing a range followed by two other  characters.
 532        The  octal or hexadecimal representation of "]" can also be used to end
 533        a range.
 534
 535        Ranges operate in the collating sequence of character values. They  can
 536        also   be  used  for  characters  specified  numerically,  for  example
 537        [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
 538        are greater than 255, for example [\x{100}-\x{2ff}].
 539
 540        If a range that includes letters is used when caseless matching is set,
 541        it matches the letters in either case. For example, [W-c] is equivalent
 542        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
 543        character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
 544        accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
 545        concept of case for characters with values greater than 128  only  when
 546        it is compiled with Unicode property support.
 547
 548        The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
 549        in a character class, and add the characters that  they  match  to  the
 550        class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
 551        flex can conveniently be used with the upper case  character  types  to
 552        specify  a  more  restricted  set of characters than the matching lower
 553        case type. For example, the class [^\W_] matches any letter  or  digit,
 554        but not underscore.
 555
 556        The  only  metacharacters  that are recognized in character classes are
 557        backslash, hyphen (only where it can be  interpreted  as  specifying  a
 558        range),  circumflex  (only  at the start), opening square bracket (only
 559        when it can be interpreted as introducing a POSIX class name - see  the
 560        next  section),  and  the  terminating closing square bracket. However,
 561        escaping other non-alphanumeric characters does no harm.
 562
 563
 564 POSIX CHARACTER CLASSES
 565
 566        Perl supports the POSIX notation for character classes. This uses names
 567        enclosed  by  [: and :] within the enclosing square brackets. PCRE also
 568        supports this notation. For example,
 569
 570          [01[:alpha:]%]
 571
 572        matches "0", "1", any alphabetic character, or "%". The supported class
 573        names are
 574
 575          alnum    letters and digits
 576          alpha    letters
 577          ascii    character codes 0 - 127
 578          blank    space or tab only
 579          cntrl    control characters
 580          digit    decimal digits (same as \d)
 581          graph    printing characters, excluding space
 582          lower    lower case letters
 583          print    printing characters, including space
 584          punct    printing characters, excluding letters and digits
 585          space    white space (not quite the same as \s)
 586          upper    upper case letters
 587          word     "word" characters (same as \w)
 588          xdigit   hexadecimal digits
 589
 590        The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
 591        and space (32). Notice that this list includes the VT  character  (code
 592        11). This makes "space" different to \s, which does not include VT (for
 593        Perl compatibility).
 594
 595        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
 596        from  Perl  5.8. Another Perl extension is negation, which is indicated
 597        by a ^ character after the colon. For example,
 598
 599          [12[:^digit:]]
 600
 601        matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
 602        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 603        these are not supported, and an error is given if they are encountered.
 604
 605        In UTF-8 mode, characters with values greater than 128 do not match any
 606        of the POSIX character classes.
 607
 608
 609 VERTICAL BAR
 610
 611        Vertical bar characters are used to separate alternative patterns.  For
 612        example, the pattern
 613
 614          gilbert|sullivan
 615
 616        matches  either "gilbert" or "sullivan". Any number of alternatives may
 617        appear, and an empty  alternative  is  permitted  (matching  the  empty
 618        string).   The  matching  process  tries each alternative in turn, from
 619        left to right, and the first one that succeeds is used. If the alterna-
 620        tives  are within a subpattern (defined below), "succeeds" means match-
 621        ing the rest of the main pattern as well as the alternative in the sub-
 622        pattern.
 623
 624
 625 INTERNAL OPTION SETTING
 626
 627        The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
 628        PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
 629        sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
 630        option letters are
 631
 632          i  for PCRE_CASELESS
 633          m  for PCRE_MULTILINE
 634          s  for PCRE_DOTALL
 635          x  for PCRE_EXTENDED
 636
 637        For example, (?im) sets caseless, multiline matching. It is also possi-
 638        ble to unset these options by preceding the letter with a hyphen, and a
 639        combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
 640        LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
 641        is also permitted. If a  letter  appears  both  before  and  after  the
 642        hyphen, the option is unset.
 643
 644        When  an option change occurs at top level (that is, not inside subpat-
 645        tern parentheses), the change applies to the remainder of  the  pattern
 646        that follows.  If the change is placed right at the start of a pattern,
 647        PCRE extracts it into the global options (and it will therefore show up
 648        in data extracted by the pcre_fullinfo() function).
 649
 650        An option change within a subpattern affects only that part of the cur-
 651        rent pattern that follows it, so
 652
 653          (a(?i)b)c
 654
 655        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
 656        used).   By  this means, options can be made to have different settings
 657        in different parts of the pattern. Any changes made in one  alternative
 658        do  carry  on  into subsequent branches within the same subpattern. For
 659        example,
 660
 661          (a(?i)b|c)
 662
 663        matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
 664        first  branch  is  abandoned before the option setting. This is because
 665        the effects of option settings happen at compile time. There  would  be
 666        some very weird behaviour otherwise.
 667
 668        The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
 669        in the same way as the Perl-compatible options by using the  characters
 670        U  and X respectively. The (?X) flag setting is special in that it must
 671        always occur earlier in the pattern than any of the additional features
 672        it  turns on, even when it is at top level. It is best to put it at the
 673        start.
 674
 675
 676 SUBPATTERNS
 677
 678        Subpatterns are delimited by parentheses (round brackets), which can be
 679        nested.  Turning part of a pattern into a subpattern does two things:
 680
 681        1. It localizes a set of alternatives. For example, the pattern
 682
 683          cat(aract|erpillar|)
 684
 685        matches  one  of the words "cat", "cataract", or "caterpillar". Without
 686        the parentheses, it would match "cataract",  "erpillar"  or  the  empty
 687        string.
 688
 689        2.  It  sets  up  the  subpattern as a capturing subpattern. This means
 690        that, when the whole pattern  matches,  that  portion  of  the  subject
 691        string that matched the subpattern is passed back to the caller via the
 692        ovector argument of pcre_exec(). Opening parentheses are  counted  from
 693        left  to  right  (starting  from 1) to obtain numbers for the capturing
 694        subpatterns.
 695
 696        For example, if the string "the red king" is matched against  the  pat-
 697        tern
 698
 699          the ((red|white) (king|queen))
 700
 701        the captured substrings are "red king", "red", and "king", and are num-
 702        bered 1, 2, and 3, respectively.
 703
 704        The fact that plain parentheses fulfil  two  functions  is  not  always
 705        helpful.   There are often times when a grouping subpattern is required
 706        without a capturing requirement. If an opening parenthesis is  followed
 707        by  a question mark and a colon, the subpattern does not do any captur-
 708        ing, and is not counted when computing the  number  of  any  subsequent
 709        capturing  subpatterns. For example, if the string "the white queen" is
 710        matched against the pattern
 711
 712          the ((?:red|white) (king|queen))
 713
 714        the captured substrings are "white queen" and "queen", and are numbered
 715        1  and 2. The maximum number of capturing subpatterns is 65535, and the
 716        maximum depth of nesting of all subpatterns, both  capturing  and  non-
 717        capturing, is 200.
 718
 719        As  a  convenient shorthand, if any option settings are required at the
 720        start of a non-capturing subpattern,  the  option  letters  may  appear
 721        between the "?" and the ":". Thus the two patterns
 722
 723          (?i:saturday|sunday)
 724          (?:(?i)saturday|sunday)
 725
 726        match exactly the same set of strings. Because alternative branches are
 727        tried from left to right, and options are not reset until  the  end  of
 728        the  subpattern is reached, an option setting in one branch does affect
 729        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
 730        "Saturday".
 731
 732
 733 NAMED SUBPATTERNS
 734
 735        Identifying  capturing  parentheses  by number is simple, but it can be
 736        very hard to keep track of the numbers in complicated  regular  expres-
 737        sions.  Furthermore,  if  an  expression  is  modified, the numbers may
 738        change. To help with this difficulty, PCRE supports the naming of  sub-
 739        patterns,  something  that  Perl  does  not  provide. The Python syntax
 740        (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
 741        underscores, and must be unique within a pattern.
 742
 743        Named  capturing  parentheses  are  still  allocated numbers as well as
 744        names. The PCRE API provides function calls for extracting the name-to-
 745        number  translation table from a compiled pattern. There is also a con-
 746        venience function for extracting a captured substring by name. For fur-
 747        ther details see the pcreapi documentation.
 748
 749
 750 REPETITION
 751
 752        Repetition  is  specified  by  quantifiers, which can follow any of the
 753        following items:
 754
 755          a literal data character
 756          the . metacharacter
 757          the \C escape sequence
 758          the \X escape sequence (in UTF-8 mode with Unicode properties)
 759          an escape such as \d that matches a single character
 760          a character class
 761          a back reference (see next section)
 762          a parenthesized subpattern (unless it is an assertion)
 763
 764        The general repetition quantifier specifies a minimum and maximum  num-
 765        ber  of  permitted matches, by giving the two numbers in curly brackets
 766        (braces), separated by a comma. The numbers must be  less  than  65536,
 767        and the first must be less than or equal to the second. For example:
 768
 769          z{2,4}
 770
 771        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
 772        special character. If the second number is omitted, but  the  comma  is
 773        present,  there  is  no upper limit; if the second number and the comma
 774        are both omitted, the quantifier specifies an exact number of  required
 775        matches. Thus
 776
 777          [aeiou]{3,}
 778
 779        matches at least 3 successive vowels, but may match many more, while
 780
 781          \d{8}
 782
 783        matches  exactly  8  digits. An opening curly bracket that appears in a
 784        position where a quantifier is not allowed, or one that does not  match
 785        the  syntax of a quantifier, is taken as a literal character. For exam-
 786        ple, {,6} is not a quantifier, but a literal string of four characters.
 787
 788        In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
 789        individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
 790        acters, each of which is represented by a two-byte sequence. Similarly,
 791        when Unicode property support is available, \X{3} matches three Unicode
 792        extended  sequences,  each of which may be several bytes long (and they
 793        may be of different lengths).
 794
 795        The quantifier {0} is permitted, causing the expression to behave as if
 796        the previous item and the quantifier were not present.
 797
 798        For  convenience  (and  historical compatibility) the three most common
 799        quantifiers have single-character abbreviations:
 800
 801          *    is equivalent to {0,}
 802          +    is equivalent to {1,}
 803          ?    is equivalent to {0,1}
 804
 805        It is possible to construct infinite loops by  following  a  subpattern
 806        that can match no characters with a quantifier that has no upper limit,
 807        for example:
 808
 809          (a?)*
 810
 811        Earlier versions of Perl and PCRE used to give an error at compile time
 812        for  such  patterns. However, because there are cases where this can be
 813        useful, such patterns are now accepted, but if any  repetition  of  the
 814        subpattern  does in fact match no characters, the loop is forcibly bro-
 815        ken.
 816
 817        By default, the quantifiers are "greedy", that is, they match  as  much
 818        as  possible  (up  to  the  maximum number of permitted times), without
 819        causing the rest of the pattern to fail. The classic example  of  where
 820        this gives problems is in trying to match comments in C programs. These
 821        appear between /* and */ and within the comment,  individual  *  and  /
 822        characters  may  appear. An attempt to match C comments by applying the
 823        pattern
 824
 825          /\*.*\*/
 826
 827        to the string
 828
 829          /* first comment */  not comment  /* second comment */
 830
 831        fails, because it matches the entire string owing to the greediness  of
 832        the .*  item.
 833
 834        However,  if  a quantifier is followed by a question mark, it ceases to
 835        be greedy, and instead matches the minimum number of times possible, so
 836        the pattern
 837
 838          /\*.*?\*/
 839
 840        does  the  right  thing with the C comments. The meaning of the various
 841        quantifiers is not otherwise changed,  just  the  preferred  number  of
 842        matches.   Do  not  confuse this use of question mark with its use as a
 843        quantifier in its own right. Because it has two uses, it can  sometimes
 844        appear doubled, as in
 845
 846          \d??\d
 847
 848        which matches one digit by preference, but can match two if that is the
 849        only way the rest of the pattern matches.
 850
 851        If the PCRE_UNGREEDY option is set (an option which is not available in
 852        Perl),  the  quantifiers are not greedy by default, but individual ones
 853        can be made greedy by following them with a  question  mark.  In  other
 854        words, it inverts the default behaviour.
 855
 856        When  a  parenthesized  subpattern  is quantified with a minimum repeat
 857        count that is greater than 1 or with a limited maximum, more memory  is
 858        required  for  the  compiled  pattern, in proportion to the size of the
 859        minimum or maximum.
 860
 861        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
 862        alent  to Perl's /s) is set, thus allowing the . to match newlines, the
 863        pattern is implicitly anchored, because whatever follows will be  tried
 864        against  every character position in the subject string, so there is no
 865        point in retrying the overall match at any position  after  the  first.
 866        PCRE normally treats such a pattern as though it were preceded by \A.
 867
 868        In  cases  where  it  is known that the subject string contains no new-
 869        lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
 870        mization, or alternatively using ^ to indicate anchoring explicitly.
 871
 872        However,  there is one situation where the optimization cannot be used.
 873        When .*  is inside capturing parentheses that  are  the  subject  of  a
 874        backreference  elsewhere in the pattern, a match at the start may fail,
 875        and a later one succeed. Consider, for example:
 876
 877          (.*)abc\1
 878
 879        If the subject is "xyz123abc123" the match point is the fourth  charac-
 880        ter. For this reason, such a pattern is not implicitly anchored.
 881
 882        When a capturing subpattern is repeated, the value captured is the sub-
 883        string that matched the final iteration. For example, after
 884
 885          (tweedle[dume]{3}\s*)+
 886
 887        has matched "tweedledum tweedledee" the value of the captured substring
 888        is  "tweedledee".  However,  if there are nested capturing subpatterns,
 889        the corresponding captured values may have been set in previous  itera-
 890        tions. For example, after
 891
 892          /(a|(b))+/
 893
 894        matches "aba" the value of the second captured substring is "b".
 895
 896
 897 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
 898
 899        With both maximizing and minimizing repetition, failure of what follows
 900        normally causes the repeated item to be re-evaluated to see if  a  dif-
 901        ferent number of repeats allows the rest of the pattern to match. Some-
 902        times it is useful to prevent this, either to change the nature of  the
 903        match,  or  to  cause it fail earlier than it otherwise might, when the
 904        author of the pattern knows there is no point in carrying on.
 905
 906        Consider, for example, the pattern \d+foo when applied to  the  subject
 907        line
 908
 909          123456bar
 910
 911        After matching all 6 digits and then failing to match "foo", the normal
 912        action of the matcher is to try again with only 5 digits  matching  the
 913        \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
 914        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
 915        the  means for specifying that once a subpattern has matched, it is not
 916        to be re-evaluated in this way.
 917
 918        If we use atomic grouping for the previous example, the  matcher  would
 919        give up immediately on failing to match "foo" the first time. The nota-
 920        tion is a kind of special parenthesis, starting with  (?>  as  in  this
 921        example:
 922
 923          (?>\d+)foo
 924
 925        This  kind  of  parenthesis "locks up" the  part of the pattern it con-
 926        tains once it has matched, and a failure further into  the  pattern  is
 927        prevented  from  backtracking into it. Backtracking past it to previous
 928        items, however, works as normal.
 929
 930        An alternative description is that a subpattern of  this  type  matches
 931        the  string  of  characters  that an identical standalone pattern would
 932        match, if anchored at the current point in the subject string.
 933
 934        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
 935        such as the above example can be thought of as a maximizing repeat that
 936        must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
 937        pared  to  adjust  the number of digits they match in order to make the
 938        rest of the pattern match, (?>\d+) can only match an entire sequence of
 939        digits.
 940
 941        Atomic  groups in general can of course contain arbitrarily complicated
 942        subpatterns, and can be nested. However, when  the  subpattern  for  an
 943        atomic group is just a single repeated item, as in the example above, a
 944        simpler notation, called a "possessive quantifier" can  be  used.  This
 945        consists  of  an  additional  + character following a quantifier. Using
 946        this notation, the previous example can be rewritten as
 947
 948          \d++foo
 949
 950        Possessive  quantifiers  are  always  greedy;  the   setting   of   the
 951        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
 952        simpler forms of atomic group. However, there is no difference  in  the
 953        meaning  or  processing  of  a possessive quantifier and the equivalent
 954        atomic group.
 955
 956        The possessive quantifier syntax is an extension to the Perl syntax. It
 957        originates in Sun's Java package.
 958
 959        When  a  pattern  contains an unlimited repeat inside a subpattern that
 960        can itself be repeated an unlimited number of  times,  the  use  of  an
 961        atomic  group  is  the  only way to avoid some failing matches taking a
 962        very long time indeed. The pattern
 963
 964          (\D+|<\d+>)*[!?]
 965
 966        matches an unlimited number of substrings that either consist  of  non-
 967        digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
 968        matches, it runs quickly. However, if it is applied to
 969
 970          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 971
 972        it takes a long time before reporting  failure.  This  is  because  the
 973        string  can be divided between the internal \D+ repeat and the external
 974        * repeat in a large number of ways, and all  have  to  be  tried.  (The
 975        example  uses  [!?]  rather than a single character at the end, because
 976        both PCRE and Perl have an optimization that allows  for  fast  failure
 977        when  a single character is used. They remember the last single charac-
 978        ter that is required for a match, and fail early if it is  not  present
 979        in  the  string.)  If  the pattern is changed so that it uses an atomic
 980        group, like this:
 981
 982          ((?>\D+)|<\d+>)*[!?]
 983
 984        sequences of non-digits cannot be broken, and failure happens  quickly.
 985
 986
 987 BACK REFERENCES
 988
 989        Outside a character class, a backslash followed by a digit greater than
 990        0 (and possibly further digits) is a back reference to a capturing sub-
 991        pattern  earlier  (that is, to its left) in the pattern, provided there
 992        have been that many previous capturing left parentheses.
 993
 994        However, if the decimal number following the backslash is less than 10,
 995        it  is  always  taken  as a back reference, and causes an error only if
 996        there are not that many capturing left parentheses in the  entire  pat-
 997        tern.  In  other words, the parentheses that are referenced need not be
 998        to the left of the reference for numbers less than 10. See the  subsec-
 999        tion  entitled  "Non-printing  characters" above for further details of
1000        the handling of digits following a backslash.
1001
1002        A back reference matches whatever actually matched the  capturing  sub-
1003        pattern  in  the  current subject string, rather than anything matching
1004        the subpattern itself (see "Subpatterns as subroutines" below for a way
1005        of doing that). So the pattern
1006
1007          (sens|respons)e and \1ibility
1008
1009        matches  "sense and sensibility" and "response and responsibility", but
1010        not "sense and responsibility". If caseful matching is in force at  the
1011        time  of the back reference, the case of letters is relevant. For exam-
1012        ple,
1013
1014          ((?i)rah)\s+\1
1015
1016        matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1017        original capturing subpattern is matched caselessly.
1018
1019        Back  references  to named subpatterns use the Python syntax (?P=name).
1020        We could rewrite the above example as follows:
1021
1022          (?<p1>(?i)rah)\s+(?P=p1)
1023
1024        There may be more than one back reference to the same subpattern. If  a
1025        subpattern  has  not actually been used in a particular match, any back
1026        references to it always fail. For example, the pattern
1027
1028          (a|(bc))\2
1029
1030        always fails if it starts to match "a" rather than "bc". Because  there
1031        may  be  many  capturing parentheses in a pattern, all digits following
1032        the backslash are taken as part of a potential back  reference  number.
1033        If the pattern continues with a digit character, some delimiter must be
1034        used to terminate the back reference. If the  PCRE_EXTENDED  option  is
1035        set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
1036        ments" below) can be used.
1037
1038        A back reference that occurs inside the parentheses to which it  refers
1039        fails  when  the subpattern is first used, so, for example, (a\1) never
1040        matches.  However, such references can be useful inside  repeated  sub-
1041        patterns. For example, the pattern
1042
1043          (a|b\1)+
1044
1045        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1046        ation of the subpattern,  the  back  reference  matches  the  character
1047        string  corresponding  to  the previous iteration. In order for this to
1048        work, the pattern must be such that the first iteration does  not  need
1049        to  match the back reference. This can be done using alternation, as in
1050        the example above, or by a quantifier with a minimum of zero.
1051
1052
1053 ASSERTIONS
1054
1055        An assertion is a test on the characters  following  or  preceding  the
1056        current  matching  point that does not actually consume any characters.
1057        The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
1058        described above.
1059
1060        More  complicated  assertions  are  coded as subpatterns. There are two
1061        kinds: those that look ahead of the current  position  in  the  subject
1062        string,  and  those  that  look  behind  it. An assertion subpattern is
1063        matched in the normal way, except that it does not  cause  the  current
1064        matching position to be changed.
1065
1066        Assertion  subpatterns  are  not  capturing subpatterns, and may not be
1067        repeated, because it makes no sense to assert the  same  thing  several
1068        times.  If  any kind of assertion contains capturing subpatterns within
1069        it, these are counted for the purposes of numbering the capturing  sub-
1070        patterns in the whole pattern.  However, substring capturing is carried
1071        out only for positive assertions, because it does not  make  sense  for
1072        negative assertions.
1073
1074    Lookahead assertions
1075
1076        Lookahead assertions start with (?= for positive assertions and (?! for
1077        negative assertions. For example,
1078
1079          \w+(?=;)
1080
1081        matches a word followed by a semicolon, but does not include the  semi-
1082        colon in the match, and
1083
1084          foo(?!bar)
1085
1086        matches  any  occurrence  of  "foo" that is not followed by "bar". Note
1087        that the apparently similar pattern
1088
1089          (?!foo)bar
1090
1091        does not find an occurrence of "bar"  that  is  preceded  by  something
1092        other  than "foo"; it finds any occurrence of "bar" whatsoever, because
1093        the assertion (?!foo) is always true when the next three characters are
1094        "bar". A lookbehind assertion is needed to achieve the other effect.
1095
1096        If you want to force a matching failure at some point in a pattern, the
1097        most convenient way to do it is  with  (?!)  because  an  empty  string
1098        always  matches, so an assertion that requires there not to be an empty
1099        string must always fail.
1100
1101    Lookbehind assertions
1102
1103        Lookbehind assertions start with (?<= for positive assertions and  (?<!
1104        for negative assertions. For example,
1105
1106          (?<!foo)bar
1107
1108        does  find  an  occurrence  of "bar" that is not preceded by "foo". The
1109        contents of a lookbehind assertion are restricted  such  that  all  the
1110        strings it matches must have a fixed length. However, if there are sev-
1111        eral alternatives, they do not all have to have the same fixed  length.
1112        Thus
1113
1114          (?<=bullock|donkey)
1115
1116        is permitted, but
1117
1118          (?<!dogs?|cats?)
1119
1120        causes  an  error at compile time. Branches that match different length
1121        strings are permitted only at the top level of a lookbehind  assertion.
1122        This  is  an  extension  compared  with  Perl (at least for 5.8), which
1123        requires all branches to match the same length of string. An  assertion
1124        such as
1125
1126          (?<=ab(c|de))
1127
1128        is  not  permitted,  because  its single top-level branch can match two
1129        different lengths, but it is acceptable if rewritten to  use  two  top-
1130        level branches:
1131
1132          (?<=abc|abde)
1133
1134        The  implementation  of lookbehind assertions is, for each alternative,
1135        to temporarily move the current position back by the  fixed  width  and
1136        then try to match. If there are insufficient characters before the cur-
1137        rent position, the match is deemed to fail.
1138
1139        PCRE does not allow the \C escape (which matches a single byte in UTF-8
1140        mode)  to appear in lookbehind assertions, because it makes it impossi-
1141        ble to calculate the length of the lookbehind. The \X escape, which can
1142        match different numbers of bytes, is also not permitted.
1143
1144        Atomic  groups can be used in conjunction with lookbehind assertions to
1145        specify efficient matching at the end of the subject string. Consider a
1146        simple pattern such as
1147
1148          abcd$
1149
1150        when  applied  to  a  long string that does not match. Because matching
1151        proceeds from left to right, PCRE will look for each "a" in the subject
1152        and  then  see  if what follows matches the rest of the pattern. If the
1153        pattern is specified as
1154
1155          ^.*abcd$
1156
1157        the initial .* matches the entire string at first, but when this  fails
1158        (because there is no following "a"), it backtracks to match all but the
1159        last character, then all but the last two characters, and so  on.  Once
1160        again  the search for "a" covers the entire string, from right to left,
1161        so we are no better off. However, if the pattern is written as
1162
1163          ^(?>.*)(?<=abcd)
1164
1165        or, equivalently, using the possessive quantifier syntax,
1166
1167          ^.*+(?<=abcd)
1168
1169        there can be no backtracking for the .* item; it  can  match  only  the
1170        entire  string.  The subsequent lookbehind assertion does a single test
1171        on the last four characters. If it fails, the match fails  immediately.
1172        For  long  strings, this approach makes a significant difference to the
1173        processing time.
1174
1175    Using multiple assertions
1176
1177        Several assertions (of any sort) may occur in succession. For example,
1178
1179          (?<=\d{3})(?<!999)foo
1180
1181        matches "foo" preceded by three digits that are not "999". Notice  that
1182        each  of  the  assertions is applied independently at the same point in
1183        the subject string. First there is a  check  that  the  previous  three
1184        characters  are  all  digits,  and  then there is a check that the same
1185        three characters are not "999".  This pattern does not match "foo" pre-
1186        ceded  by  six  characters,  the first of which are digits and the last
1187        three of which are not "999". For example, it  doesn't  match  "123abc-
1188        foo". A pattern to do that is
1189
1190          (?<=\d{3}...)(?<!999)foo
1191
1192        This  time  the  first assertion looks at the preceding six characters,
1193        checking that the first three are digits, and then the second assertion
1194        checks that the preceding three characters are not "999".
1195
1196        Assertions can be nested in any combination. For example,
1197
1198          (?<=(?<!foo)bar)baz
1199
1200        matches  an occurrence of "baz" that is preceded by "bar" which in turn
1201        is not preceded by "foo", while
1202
1203          (?<=\d{3}(?!999)...)foo
1204
1205        is another pattern that matches "foo" preceded by three digits and  any
1206        three characters that are not "999".
1207
1208
1209 CONDITIONAL SUBPATTERNS
1210
1211        It  is possible to cause the matching process to obey a subpattern con-
1212        ditionally or to choose between two alternative subpatterns,  depending
1213        on  the result of an assertion, or whether a previous capturing subpat-
1214        tern matched or not. The two possible forms of  conditional  subpattern
1215        are
1216
1217          (?(condition)yes-pattern)
1218          (?(condition)yes-pattern|no-pattern)
1219
1220        If  the  condition is satisfied, the yes-pattern is used; otherwise the
1221        no-pattern (if present) is used. If there are more  than  two  alterna-
1222        tives in the subpattern, a compile-time error occurs.
1223
1224        There are three kinds of condition. If the text between the parentheses
1225        consists of a sequence of digits, the condition  is  satisfied  if  the
1226        capturing  subpattern of that number has previously matched. The number
1227        must be greater than zero. Consider the following pattern,  which  con-
1228        tains  non-significant white space to make it more readable (assume the
1229        PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
1230        discussion:
1231
1232          ( \( )?    [^()]+    (?(1) \) )
1233
1234        The  first  part  matches  an optional opening parenthesis, and if that
1235        character is present, sets it as the first captured substring. The sec-
1236        ond  part  matches one or more characters that are not parentheses. The
1237        third part is a conditional subpattern that tests whether the first set
1238        of parentheses matched or not. If they did, that is, if subject started
1239        with an opening parenthesis, the condition is true, and so the yes-pat-
1240        tern  is  executed  and  a  closing parenthesis is required. Otherwise,
1241        since no-pattern is not present, the  subpattern  matches  nothing.  In
1242        other  words,  this  pattern  matches  a  sequence  of non-parentheses,
1243        optionally enclosed in parentheses.
1244
1245        If the condition is the string (R), it is satisfied if a recursive call
1246        to  the pattern or subpattern has been made. At "top level", the condi-
1247        tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
1248        described in the next section.
1249
1250        If  the  condition  is  not  a sequence of digits or (R), it must be an
1251        assertion.  This may be a positive or negative lookahead or  lookbehind
1252        assertion.  Consider  this  pattern,  again  containing non-significant
1253        white space, and with the two alternatives on the second line:
1254
1255          (?(?=[^a-z]*[a-z])
1256          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1257
1258        The condition  is  a  positive  lookahead  assertion  that  matches  an
1259        optional  sequence of non-letters followed by a letter. In other words,
1260        it tests for the presence of at least one letter in the subject.  If  a
1261        letter  is found, the subject is matched against the first alternative;
1262        otherwise it is  matched  against  the  second.  This  pattern  matches
1263        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1264        letters and dd are digits.
1265
1266
1267 COMMENTS
1268
1269        The sequence (?# marks the start of a comment that continues up to  the
1270        next  closing  parenthesis.  Nested  parentheses are not permitted. The
1271        characters that make up a comment play no part in the pattern  matching
1272        at all.
1273
1274        If  the PCRE_EXTENDED option is set, an unescaped # character outside a
1275        character class introduces a comment that continues up to the next new-
1276        line character in the pattern.
1277
1278
1279 RECURSIVE PATTERNS
1280
1281        Consider  the problem of matching a string in parentheses, allowing for
1282        unlimited nested parentheses. Without the use of  recursion,  the  best
1283        that  can  be  done  is  to use a pattern that matches up to some fixed
1284        depth of nesting. It is not possible to  handle  an  arbitrary  nesting
1285        depth.  Perl  provides  a  facility  that allows regular expressions to
1286        recurse (amongst other things). It does this by interpolating Perl code
1287        in the expression at run time, and the code can refer to the expression
1288        itself. A Perl pattern to solve the parentheses problem can be  created
1289        like this:
1290
1291          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1292
1293        The (?p{...}) item interpolates Perl code at run time, and in this case
1294        refers recursively to the pattern in which it appears. Obviously,  PCRE
1295        cannot  support  the  interpolation  of Perl code. Instead, it supports
1296        some special syntax for recursion of the entire pattern, and  also  for
1297        individual subpattern recursion.
1298
1299        The  special item that consists of (? followed by a number greater than
1300        zero and a closing parenthesis is a recursive call of the subpattern of
1301        the  given  number, provided that it occurs inside that subpattern. (If
1302        not, it is a "subroutine" call, which is described  in  the  next  sec-
1303        tion.)  The special item (?R) is a recursive call of the entire regular
1304        expression.
1305
1306        For example, this PCRE pattern solves the  nested  parentheses  problem
1307        (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is
1308        ignored):
1309
1310          \( ( (?>[^()]+) | (?R) )* \)
1311
1312        First it matches an opening parenthesis. Then it matches any number  of
1313        substrings  which  can  either  be  a sequence of non-parentheses, or a
1314        recursive match of the pattern itself (that is  a  correctly  parenthe-
1315        sized substring).  Finally there is a closing parenthesis.
1316
1317        If  this  were  part of a larger pattern, you would not want to recurse
1318        the entire pattern, so instead you could use this:
1319
1320          ( \( ( (?>[^()]+) | (?1) )* \) )
1321
1322        We have put the pattern into parentheses, and caused the  recursion  to
1323        refer  to them instead of the whole pattern. In a larger pattern, keep-
1324        ing track of parenthesis numbers can be tricky. It may be  more  conve-
1325        nient  to use named parentheses instead. For this, PCRE uses (?P>name),
1326        which is an extension to the Python syntax that  PCRE  uses  for  named
1327        parentheses (Perl does not provide named parentheses). We could rewrite
1328        the above example as follows:
1329
1330          (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1331
1332        This particular example pattern contains nested unlimited repeats,  and
1333        so  the  use of atomic grouping for matching strings of non-parentheses
1334        is important when applying the pattern to strings that  do  not  match.
1335        For example, when this pattern is applied to
1336
1337          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1338
1339        it  yields "no match" quickly. However, if atomic grouping is not used,
1340        the match runs for a very long time indeed because there  are  so  many
1341        different  ways  the  + and * repeats can carve up the subject, and all
1342        have to be tested before failure can be reported.
1343
1344        At the end of a match, the values set for any capturing subpatterns are
1345        those from the outermost level of the recursion at which the subpattern
1346        value is set.  If you want to obtain  intermediate  values,  a  callout
1347        function can be used (see the next section and the pcrecallout documen-
1348        tation). If the pattern above is matched against
1349
1350          (ab(cd)ef)
1351
1352        the value for the capturing parentheses is  "ef",  which  is  the  last
1353        value  taken  on at the top level. If additional parentheses are added,
1354        giving
1355
1356          \( ( ( (?>[^()]+) | (?R) )* ) \)
1357             ^                        ^
1358             ^                        ^
1359
1360        the string they capture is "ab(cd)ef", the contents of  the  top  level
1361        parentheses.  If there are more than 15 capturing parentheses in a pat-
1362        tern, PCRE has to obtain extra memory to store data during a recursion,
1363        which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
1364        wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
1365        PCRE_ERROR_NOMEMORY error.
1366
1367        Do  not  confuse  the (?R) item with the condition (R), which tests for
1368        recursion.  Consider this pattern, which matches text in  angle  brack-
1369        ets,  allowing for arbitrary nesting. Only digits are allowed in nested
1370        brackets (that is, when recursing), whereas any characters are  permit-
1371        ted at the outer level.
1372
1373          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1374
1375        In  this  pattern, (?(R) is the start of a conditional subpattern, with
1376        two different alternatives for the recursive and  non-recursive  cases.
1377        The (?R) item is the actual recursive call.
1378
1379
1380 SUBPATTERNS AS SUBROUTINES
1381
1382        If the syntax for a recursive subpattern reference (either by number or
1383        by name) is used outside the parentheses to which it refers,  it  oper-
1384        ates  like  a  subroutine in a programming language. An earlier example
1385        pointed out that the pattern
1386
1387          (sens|respons)e and \1ibility
1388
1389        matches "sense and sensibility" and "response and responsibility",  but
1390        not "sense and responsibility". If instead the pattern
1391
1392          (sens|respons)e and (?1)ibility
1393
1394        is  used, it does match "sense and responsibility" as well as the other
1395        two strings. Such references must, however, follow  the  subpattern  to
1396        which they refer.
1397
1398
1399 CALLOUTS
1400
1401        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1402        Perl code to be obeyed in the middle of matching a regular  expression.
1403        This makes it possible, amongst other things, to extract different sub-
1404        strings that match the same pair of parentheses when there is a repeti-
1405        tion.
1406
1407        PCRE provides a similar feature, but of course it cannot obey arbitrary
1408        Perl code. The feature is called "callout". The caller of PCRE provides
1409        an  external function by putting its entry point in the global variable
1410        pcre_callout.  By default, this variable contains NULL, which  disables
1411        all calling out.
1412
1413        Within  a  regular  expression,  (?C) indicates the points at which the
1414        external function is to be called. If you want  to  identify  different
1415        callout  points, you can put a number less than 256 after the letter C.
1416        The default value is zero.  For example, this pattern has  two  callout
1417        points:
1418
1419          (?C1)abc(?C2)def
1420
1421        If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1422        automatically installed before each item in the pattern. They  are  all
1423        numbered 255.
1424
1425        During matching, when PCRE reaches a callout point (and pcre_callout is
1426        set), the external function is called. It is provided with  the  number
1427        of  the callout, the position in the pattern, and, optionally, one item
1428        of data originally supplied by the caller of pcre_exec().  The  callout
1429        function  may cause matching to proceed, to backtrack, or to fail alto-
1430        gether. A complete description of the interface to the callout function
1431        is given in the pcrecallout documentation.
1432
1433 Last updated: 28 February 2005
1434 Copyright (c) 1997-2005 University of Cambridge.