doc/doc-txt/pcrepattern.txt

   1 This file contains the PCRE man page that describes the regular expressions
   2 supported by PCRE version 6.2. Note that not all of the features are relevant
   3 in the context of Exim. In particular, the version of PCRE that is compiled
   4 with Exim does not include UTF-8 support, there is no mechanism for changing
   5 the options with which the PCRE functions are called, and features such as
   6 callout are not accessible.
   7 -----------------------------------------------------------------------------
   8
   9 PCREPATTERN(3)                                                  PCREPATTERN(3)
  10
  11
  12 NAME
  13        PCRE - Perl-compatible regular expressions
  14
  15
  16 PCRE REGULAR EXPRESSION DETAILS
  17
  18        The  syntax  and semantics of the regular expressions supported by PCRE
  19        are described below. Regular expressions are also described in the Perl
  20        documentation  and  in  a  number  of books, some of which have copious
  21        examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
  22        by  O'Reilly, covers regular expressions in great detail. This descrip-
  23        tion of PCRE's regular expressions is intended as reference material.
  24
  25        The original operation of PCRE was on strings of  one-byte  characters.
  26        However,  there is now also support for UTF-8 character strings. To use
  27        this, you must build PCRE to  include  UTF-8  support,  and  then  call
  28        pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
  29        matching is mentioned in several places below. There is also a  summary
  30        of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
  31        page.
  32
  33        The remainder of this document discusses the  patterns  that  are  sup-
  34        ported  by  PCRE when its main matching function, pcre_exec(), is used.
  35        From  release  6.0,   PCRE   offers   a   second   matching   function,
  36        pcre_dfa_exec(),  which matches using a different algorithm that is not
  37        Perl-compatible. The advantages and disadvantages  of  the  alternative
  38        function, and how it differs from the normal function, are discussed in
  39        the pcrematching page.
  40
  41        A regular expression is a pattern that is  matched  against  a  subject
  42        string  from  left  to right. Most characters stand for themselves in a
  43        pattern, and match the corresponding characters in the  subject.  As  a
  44        trivial example, the pattern
  45
  46          The quick brown fox
  47
  48        matches a portion of a subject string that is identical to itself. When
  49        caseless matching is specified (the PCRE_CASELESS option), letters  are
  50        matched  independently  of case. In UTF-8 mode, PCRE always understands
  51        the concept of case for characters whose values are less than  128,  so
  52        caseless  matching  is always possible. For characters with higher val-
  53        ues, the concept of case is supported if PCRE is compiled with  Unicode
  54        property  support,  but  not  otherwise.   If  you want to use caseless
  55        matching for characters 128 and above, you must  ensure  that  PCRE  is
  56        compiled with Unicode property support as well as with UTF-8 support.
  57
  58        The  power  of  regular  expressions  comes from the ability to include
  59        alternatives and repetitions in the pattern. These are encoded  in  the
  60        pattern by the use of metacharacters, which do not stand for themselves
  61        but instead are interpreted in some special way.
  62
  63        There are two different sets of metacharacters: those that  are  recog-
  64        nized  anywhere in the pattern except within square brackets, and those
  65        that are recognized in square brackets. Outside  square  brackets,  the
  66        metacharacters are as follows:
  67
  68          \      general escape character with several uses
  69          ^      assert start of string (or line, in multiline mode)
  70          $      assert end of string (or line, in multiline mode)
  71          .      match any character except newline (by default)
  72          [      start character class definition
  73          |      start of alternative branch
  74          (      start subpattern
  75          )      end subpattern
  76          ?      extends the meaning of (
  77                 also 0 or 1 quantifier
  78                 also quantifier minimizer
  79          *      0 or more quantifier
  80          +      1 or more quantifier
  81                 also "possessive quantifier"
  82          {      start min/max quantifier
  83
  84        Part  of  a  pattern  that is in square brackets is called a "character
  85        class". In a character class the only metacharacters are:
  86
  87          \      general escape character
  88          ^      negate the class, but only if the first character
  89          -      indicates character range
  90          [      POSIX character class (only if followed by POSIX
  91                   syntax)
  92          ]      terminates the character class
  93
  94        The following sections describe the use of each of the  metacharacters.
  95
  96
  97 BACKSLASH
  98
  99        The backslash character has several uses. Firstly, if it is followed by
 100        a non-alphanumeric character, it takes away any  special  meaning  that
 101        character  may  have.  This  use  of  backslash  as an escape character
 102        applies both inside and outside character classes.
 103
 104        For example, if you want to match a * character, you write  \*  in  the
 105        pattern.   This  escaping  action  applies whether or not the following
 106        character would otherwise be interpreted as a metacharacter, so  it  is
 107        always  safe  to  precede  a non-alphanumeric with backslash to specify
 108        that it stands for itself. In particular, if you want to match a  back-
 109        slash, you write \\.
 110
 111        If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
 112        the pattern (other than in a character class) and characters between  a
 113        # outside a character class and the next newline character are ignored.
 114        An escaping backslash can be used to include a whitespace or #  charac-
 115        ter as part of the pattern.
 116
 117        If  you  want  to remove the special meaning from a sequence of charac-
 118        ters, you can do so by putting them between \Q and \E. This is  differ-
 119        ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
 120        sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
 121        tion. Note the following examples:
 122
 123          Pattern            PCRE matches   Perl matches
 124
 125          \Qabc$xyz\E        abc$xyz        abc followed by the
 126                                              contents of $xyz
 127          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 128          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 129
 130        The  \Q...\E  sequence  is recognized both inside and outside character
 131        classes.
 132
 133    Non-printing characters
 134
 135        A second use of backslash provides a way of encoding non-printing char-
 136        acters  in patterns in a visible manner. There is no restriction on the
 137        appearance of non-printing characters, apart from the binary zero  that
 138        terminates  a  pattern,  but  when  a pattern is being prepared by text
 139        editing, it is usually easier  to  use  one  of  the  following  escape
 140        sequences than the binary character it represents:
 141
 142          \a        alarm, that is, the BEL character (hex 07)
 143          \cx       "control-x", where x is any character
 144          \e        escape (hex 1B)
 145          \f        formfeed (hex 0C)
 146          \n        newline (hex 0A)
 147          \r        carriage return (hex 0D)
 148          \t        tab (hex 09)
 149          \ddd      character with octal code ddd, or backreference
 150          \xhh      character with hex code hh
 151          \x{hhh..} character with hex code hhh... (UTF-8 mode only)
 152
 153        The  precise  effect of \cx is as follows: if x is a lower case letter,
 154        it is converted to upper case. Then bit 6 of the character (hex 40)  is
 155        inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
 156        becomes hex 7B.
 157
 158        After \x, from zero to two hexadecimal digits are read (letters can  be
 159        in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
 160        its may appear between \x{ and }, but the value of the  character  code
 161        must  be  less  than  2**31  (that is, the maximum hexadecimal value is
 162        7FFFFFFF). If characters other than hexadecimal digits  appear  between
 163        \x{  and }, or if there is no terminating }, this form of escape is not
 164        recognized. Instead, the initial \x will  be  interpreted  as  a  basic
 165        hexadecimal  escape, with no following digits, giving a character whose
 166        value is zero.
 167
 168        Characters whose value is less than 256 can be defined by either of the
 169        two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
 170        in the way they are handled. For example, \xdc is exactly the  same  as
 171        \x{dc}.
 172
 173        After  \0  up  to  two further octal digits are read. In both cases, if
 174        there are fewer than two digits, just those that are present are  used.
 175        Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
 176        character (code value 7). Make sure you supply  two  digits  after  the
 177        initial  zero  if the pattern character that follows is itself an octal
 178        digit.
 179
 180        The handling of a backslash followed by a digit other than 0 is compli-
 181        cated.  Outside a character class, PCRE reads it and any following dig-
 182        its as a decimal number. If the number is less than  10,  or  if  there
 183        have been at least that many previous capturing left parentheses in the
 184        expression, the entire  sequence  is  taken  as  a  back  reference.  A
 185        description  of how this works is given later, following the discussion
 186        of parenthesized subpatterns.
 187
 188        Inside a character class, or if the decimal number is  greater  than  9
 189        and  there have not been that many capturing subpatterns, PCRE re-reads
 190        up to three octal digits following the backslash, and generates a  sin-
 191        gle byte from the least significant 8 bits of the value. Any subsequent
 192        digits stand for themselves.  For example:
 193
 194          \040   is another way of writing a space
 195          \40    is the same, provided there are fewer than 40
 196                    previous capturing subpatterns
 197          \7     is always a back reference
 198          \11    might be a back reference, or another way of
 199                    writing a tab
 200          \011   is always a tab
 201          \0113  is a tab followed by the character "3"
 202          \113   might be a back reference, otherwise the
 203                    character with octal code 113
 204          \377   might be a back reference, otherwise
 205                    the byte consisting entirely of 1 bits
 206          \81    is either a back reference, or a binary zero
 207                    followed by the two characters "8" and "1"
 208
 209        Note that octal values of 100 or greater must not be  introduced  by  a
 210        leading zero, because no more than three octal digits are ever read.
 211
 212        All  the  sequences  that  define a single byte value or a single UTF-8
 213        character (in UTF-8 mode) can be used both inside and outside character
 214        classes.  In  addition,  inside  a  character class, the sequence \b is
 215        interpreted as the backspace character (hex 08), and the sequence \X is
 216        interpreted  as  the  character  "X".  Outside a character class, these
 217        sequences have different meanings (see below).
 218
 219    Generic character types
 220
 221        The third use of backslash is for specifying generic  character  types.
 222        The following are always recognized:
 223
 224          \d     any decimal digit
 225          \D     any character that is not a decimal digit
 226          \s     any whitespace character
 227          \S     any character that is not a whitespace character
 228          \w     any "word" character
 229          \W     any "non-word" character
 230
 231        Each pair of escape sequences partitions the complete set of characters
 232        into two disjoint sets. Any given character matches one, and only  one,
 233        of each pair.
 234
 235        These character type sequences can appear both inside and outside char-
 236        acter classes. They each match one character of the  appropriate  type.
 237        If  the current matching point is at the end of the subject string, all
 238        of them fail, since there is no character to match.
 239
 240        For compatibility with Perl, \s does not match the VT  character  (code
 241        11).   This makes it different from the the POSIX "space" class. The \s
 242        characters are HT (9), LF (10), FF (12), CR (13), and space (32).
 243
 244        A "word" character is an underscore or any character less than 256 that
 245        is  a  letter  or  digit.  The definition of letters and digits is con-
 246        trolled by PCRE's low-valued character tables, and may vary if  locale-
 247        specific  matching is taking place (see "Locale support" in the pcreapi
 248        page). For example, in the  "fr_FR"  (French)  locale,  some  character
 249        codes  greater  than  128  are used for accented letters, and these are
 250        matched by \w.
 251
 252        In UTF-8 mode, characters with values greater than 128 never match  \d,
 253        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
 254        code character property support is available.
 255
 256    Unicode character properties
 257
 258        When PCRE is built with Unicode character property support, three addi-
 259        tional  escape sequences to match generic character types are available
 260        when UTF-8 mode is selected. They are:
 261
 262         \p{xx}   a character with the xx property
 263         \P{xx}   a character without the xx property
 264         \X       an extended Unicode sequence
 265
 266        The property names represented by xx above are limited to  the  Unicode
 267        general  category properties. Each character has exactly one such prop-
 268        erty, specified by a two-letter abbreviation.  For  compatibility  with
 269        Perl,  negation  can be specified by including a circumflex between the
 270        opening brace and the property name. For example, \p{^Lu} is  the  same
 271        as \P{Lu}.
 272
 273        If  only  one  letter  is  specified with \p or \P, it includes all the
 274        properties that start with that letter. In this case, in the absence of
 275        negation, the curly brackets in the escape sequence are optional; these
 276        two examples have the same effect:
 277
 278          \p{L}
 279          \pL
 280
 281        The following property codes are supported:
 282
 283          C     Other
 284          Cc    Control
 285          Cf    Format
 286          Cn    Unassigned
 287          Co    Private use
 288          Cs    Surrogate
 289
 290          L     Letter
 291          Ll    Lower case letter
 292          Lm    Modifier letter
 293          Lo    Other letter
 294          Lt    Title case letter
 295          Lu    Upper case letter
 296
 297          M     Mark
 298          Mc    Spacing mark
 299          Me    Enclosing mark
 300          Mn    Non-spacing mark
 301
 302          N     Number
 303          Nd    Decimal number
 304          Nl    Letter number
 305          No    Other number
 306
 307          P     Punctuation
 308          Pc    Connector punctuation
 309          Pd    Dash punctuation
 310          Pe    Close punctuation
 311          Pf    Final punctuation
 312          Pi    Initial punctuation
 313          Po    Other punctuation
 314          Ps    Open punctuation
 315
 316          S     Symbol
 317          Sc    Currency symbol
 318          Sk    Modifier symbol
 319          Sm    Mathematical symbol
 320          So    Other symbol
 321
 322          Z     Separator
 323          Zl    Line separator
 324          Zp    Paragraph separator
 325          Zs    Space separator
 326
 327        Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-
 328        ported by PCRE.
 329
 330        Specifying  caseless  matching  does not affect these escape sequences.
 331        For example, \p{Lu} always matches only upper case letters.
 332
 333        The \X escape matches any number of Unicode  characters  that  form  an
 334        extended Unicode sequence. \X is equivalent to
 335
 336          (?>\PM\pM*)
 337
 338        That  is,  it matches a character without the "mark" property, followed
 339        by zero or more characters with the "mark"  property,  and  treats  the
 340        sequence  as  an  atomic group (see below).  Characters with the "mark"
 341        property are typically accents that affect the preceding character.
 342
 343        Matching characters by Unicode property is not fast, because  PCRE  has
 344        to  search  a  structure  that  contains data for over fifteen thousand
 345        characters. That is why the traditional escape sequences such as \d and
 346        \w do not use Unicode properties in PCRE.
 347
 348    Simple assertions
 349
 350        The fourth use of backslash is for certain simple assertions. An asser-
 351        tion specifies a condition that has to be met at a particular point  in
 352        a  match, without consuming any characters from the subject string. The
 353        use of subpatterns for more complicated assertions is described  below.
 354        The backslashed assertions are:
 355
 356          \b     matches at a word boundary
 357          \B     matches when not at a word boundary
 358          \A     matches at start of subject
 359          \Z     matches at end of subject or before newline at end
 360          \z     matches at end of subject
 361          \G     matches at first matching position in subject
 362
 363        These  assertions may not appear in character classes (but note that \b
 364        has a different meaning, namely the backspace character, inside a char-
 365        acter class).
 366
 367        A  word  boundary is a position in the subject string where the current
 368        character and the previous character do not both match \w or  \W  (i.e.
 369        one  matches  \w  and the other matches \W), or the start or end of the
 370        string if the first or last character matches \w, respectively.
 371
 372        The \A, \Z, and \z assertions differ from  the  traditional  circumflex
 373        and dollar (described in the next section) in that they only ever match
 374        at the very start and end of the subject string, whatever  options  are
 375        set.  Thus,  they are independent of multiline mode. These three asser-
 376        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
 377        affect  only the behaviour of the circumflex and dollar metacharacters.
 378        However, if the startoffset argument of pcre_exec() is non-zero,  indi-
 379        cating that matching is to start at a point other than the beginning of
 380        the subject, \A can never match. The difference between \Z  and  \z  is
 381        that  \Z  matches  before  a  newline that is the last character of the
 382        string as well as at the end of the string, whereas \z matches only  at
 383        the end.
 384
 385        The  \G assertion is true only when the current matching position is at
 386        the start point of the match, as specified by the startoffset  argument
 387        of  pcre_exec().  It  differs  from \A when the value of startoffset is
 388        non-zero. By calling pcre_exec() multiple times with appropriate  argu-
 389        ments, you can mimic Perl's /g option, and it is in this kind of imple-
 390        mentation where \G can be useful.
 391
 392        Note, however, that PCRE's interpretation of \G, as the  start  of  the
 393        current match, is subtly different from Perl's, which defines it as the
 394        end of the previous match. In Perl, these can  be  different  when  the
 395        previously  matched  string was empty. Because PCRE does just one match
 396        at a time, it cannot reproduce this behaviour.
 397
 398        If all the alternatives of a pattern begin with \G, the  expression  is
 399        anchored to the starting match position, and the "anchored" flag is set
 400        in the compiled regular expression.
 401
 402
 403 CIRCUMFLEX AND DOLLAR
 404
 405        Outside a character class, in the default matching mode, the circumflex
 406        character  is  an  assertion  that is true only if the current matching
 407        point is at the start of the subject string. If the  startoffset  argu-
 408        ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
 409        PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
 410        has an entirely different meaning (see below).
 411
 412        Circumflex  need  not be the first character of the pattern if a number
 413        of alternatives are involved, but it should be the first thing in  each
 414        alternative  in  which  it appears if the pattern is ever to match that
 415        branch. If all possible alternatives start with a circumflex, that  is,
 416        if  the  pattern  is constrained to match only at the start of the sub-
 417        ject, it is said to be an "anchored" pattern.  (There  are  also  other
 418        constructs that can cause a pattern to be anchored.)
 419
 420        A  dollar  character  is  an assertion that is true only if the current
 421        matching point is at the end of  the  subject  string,  or  immediately
 422        before a newline character that is the last character in the string (by
 423        default). Dollar need not be the last character of  the  pattern  if  a
 424        number  of alternatives are involved, but it should be the last item in
 425        any branch in which it appears.  Dollar has no  special  meaning  in  a
 426        character class.
 427
 428        The  meaning  of  dollar  can be changed so that it matches only at the
 429        very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
 430        compile time. This does not affect the \Z assertion.
 431
 432        The meanings of the circumflex and dollar characters are changed if the
 433        PCRE_MULTILINE option is set. When this is the case, they match immedi-
 434        ately  after  and  immediately  before  an  internal newline character,
 435        respectively, in addition to matching at the start and end of the  sub-
 436        ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
 437        string "def\nabc" (where \n represents a newline character)  in  multi-
 438        line mode, but not otherwise.  Consequently, patterns that are anchored
 439        in single line mode because all branches start with ^ are not  anchored
 440        in  multiline  mode,  and  a  match for circumflex is possible when the
 441        startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-
 442        LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
 443
 444        Note  that  the sequences \A, \Z, and \z can be used to match the start
 445        and end of the subject in both modes, and if all branches of a  pattern
 446        start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
 447        not.
 448
 449
 450 FULL STOP (PERIOD, DOT)
 451
 452        Outside a character class, a dot in the pattern matches any one charac-
 453        ter  in  the  subject,  including a non-printing character, but not (by
 454        default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
 455        which might be more than one byte long, except (by default) newline. If
 456        the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-
 457        dling  of dot is entirely independent of the handling of circumflex and
 458        dollar, the only relationship being  that  they  both  involve  newline
 459        characters. Dot has no special meaning in a character class.
 460
 461
 462 MATCHING A SINGLE BYTE
 463
 464        Outside a character class, the escape sequence \C matches any one byte,
 465        both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.
 466        The  feature  is provided in Perl in order to match individual bytes in
 467        UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual
 468        bytes,  what remains in the string may be a malformed UTF-8 string. For
 469        this reason, the \C escape sequence is best avoided.
 470
 471        PCRE does not allow \C to appear in  lookbehind  assertions  (described
 472        below),  because  in UTF-8 mode this would make it impossible to calcu-
 473        late the length of the lookbehind.
 474
 475
 476 SQUARE BRACKETS AND CHARACTER CLASSES
 477
 478        An opening square bracket introduces a character class, terminated by a
 479        closing square bracket. A closing square bracket on its own is not spe-
 480        cial. If a closing square bracket is required as a member of the class,
 481        it  should  be  the first data character in the class (after an initial
 482        circumflex, if present) or escaped with a backslash.
 483
 484        A character class matches a single character in the subject.  In  UTF-8
 485        mode,  the character may occupy more than one byte. A matched character
 486        must be in the set of characters defined by the class, unless the first
 487        character  in  the  class definition is a circumflex, in which case the
 488        subject character must not be in the set defined by  the  class.  If  a
 489        circumflex  is actually required as a member of the class, ensure it is
 490        not the first character, or escape it with a backslash.
 491
 492        For example, the character class [aeiou] matches any lower case  vowel,
 493        while  [^aeiou]  matches  any character that is not a lower case vowel.
 494        Note that a circumflex is just a convenient notation for specifying the
 495        characters  that  are in the class by enumerating those that are not. A
 496        class that starts with a circumflex is not an assertion: it still  con-
 497        sumes  a  character  from the subject string, and therefore it fails if
 498        the current pointer is at the end of the string.
 499
 500        In UTF-8 mode, characters with values greater than 255 can be  included
 501        in  a  class as a literal string of bytes, or by using the \x{ escaping
 502        mechanism.
 503
 504        When caseless matching is set, any letters in a  class  represent  both
 505        their  upper  case  and lower case versions, so for example, a caseless
 506        [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
 507        match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
 508        understands the concept of case for characters whose  values  are  less
 509        than  128, so caseless matching is always possible. For characters with
 510        higher values, the concept of case is supported  if  PCRE  is  compiled
 511        with  Unicode  property support, but not otherwise.  If you want to use
 512        caseless matching for characters 128 and above, you  must  ensure  that
 513        PCRE  is  compiled  with Unicode property support as well as with UTF-8
 514        support.
 515
 516        The newline character is never treated in any special way in  character
 517        classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
 518        options is. A class such as [^a] will always match a newline.
 519
 520        The minus (hyphen) character can be used to specify a range of  charac-
 521        ters  in  a  character  class.  For  example,  [d-m] matches any letter
 522        between d and m, inclusive. If a  minus  character  is  required  in  a
 523        class,  it  must  be  escaped  with a backslash or appear in a position
 524        where it cannot be interpreted as indicating a range, typically as  the
 525        first or last character in the class.
 526
 527        It is not possible to have the literal character "]" as the end charac-
 528        ter of a range. A pattern such as [W-]46] is interpreted as a class  of
 529        two  characters ("W" and "-") followed by a literal string "46]", so it
 530        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
 531        backslash  it is interpreted as the end of range, so [W-\]46] is inter-
 532        preted as a class containing a range followed by two other  characters.
 533        The  octal or hexadecimal representation of "]" can also be used to end
 534        a range.
 535
 536        Ranges operate in the collating sequence of character values. They  can
 537        also   be  used  for  characters  specified  numerically,  for  example
 538        [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
 539        are greater than 255, for example [\x{100}-\x{2ff}].
 540
 541        If a range that includes letters is used when caseless matching is set,
 542        it matches the letters in either case. For example, [W-c] is equivalent
 543        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
 544        character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
 545        accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
 546        concept of case for characters with values greater than 128  only  when
 547        it is compiled with Unicode property support.
 548
 549        The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
 550        in a character class, and add the characters that  they  match  to  the
 551        class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
 552        flex can conveniently be used with the upper case  character  types  to
 553        specify  a  more  restricted  set of characters than the matching lower
 554        case type. For example, the class [^\W_] matches any letter  or  digit,
 555        but not underscore.
 556
 557        The  only  metacharacters  that are recognized in character classes are
 558        backslash, hyphen (only where it can be  interpreted  as  specifying  a
 559        range),  circumflex  (only  at the start), opening square bracket (only
 560        when it can be interpreted as introducing a POSIX class name - see  the
 561        next  section),  and  the  terminating closing square bracket. However,
 562        escaping other non-alphanumeric characters does no harm.
 563
 564
 565 POSIX CHARACTER CLASSES
 566
 567        Perl supports the POSIX notation for character classes. This uses names
 568        enclosed  by  [: and :] within the enclosing square brackets. PCRE also
 569        supports this notation. For example,
 570
 571          [01[:alpha:]%]
 572
 573        matches "0", "1", any alphabetic character, or "%". The supported class
 574        names are
 575
 576          alnum    letters and digits
 577          alpha    letters
 578          ascii    character codes 0 - 127
 579          blank    space or tab only
 580          cntrl    control characters
 581          digit    decimal digits (same as \d)
 582          graph    printing characters, excluding space
 583          lower    lower case letters
 584          print    printing characters, including space
 585          punct    printing characters, excluding letters and digits
 586          space    white space (not quite the same as \s)
 587          upper    upper case letters
 588          word     "word" characters (same as \w)
 589          xdigit   hexadecimal digits
 590
 591        The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
 592        and space (32). Notice that this list includes the VT  character  (code
 593        11). This makes "space" different to \s, which does not include VT (for
 594        Perl compatibility).
 595
 596        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
 597        from  Perl  5.8. Another Perl extension is negation, which is indicated
 598        by a ^ character after the colon. For example,
 599
 600          [12[:^digit:]]
 601
 602        matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
 603        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 604        these are not supported, and an error is given if they are encountered.
 605
 606        In UTF-8 mode, characters with values greater than 128 do not match any
 607        of the POSIX character classes.
 608
 609
 610 VERTICAL BAR
 611
 612        Vertical bar characters are used to separate alternative patterns.  For
 613        example, the pattern
 614
 615          gilbert|sullivan
 616
 617        matches  either "gilbert" or "sullivan". Any number of alternatives may
 618        appear, and an empty  alternative  is  permitted  (matching  the  empty
 619        string).   The  matching  process  tries each alternative in turn, from
 620        left to right, and the first one that succeeds is used. If the alterna-
 621        tives  are within a subpattern (defined below), "succeeds" means match-
 622        ing the rest of the main pattern as well as the alternative in the sub-
 623        pattern.
 624
 625
 626 INTERNAL OPTION SETTING
 627
 628        The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
 629        PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
 630        sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
 631        option letters are
 632
 633          i  for PCRE_CASELESS
 634          m  for PCRE_MULTILINE
 635          s  for PCRE_DOTALL
 636          x  for PCRE_EXTENDED
 637
 638        For example, (?im) sets caseless, multiline matching. It is also possi-
 639        ble to unset these options by preceding the letter with a hyphen, and a
 640        combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
 641        LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
 642        is also permitted. If a  letter  appears  both  before  and  after  the
 643        hyphen, the option is unset.
 644
 645        When  an option change occurs at top level (that is, not inside subpat-
 646        tern parentheses), the change applies to the remainder of  the  pattern
 647        that follows.  If the change is placed right at the start of a pattern,
 648        PCRE extracts it into the global options (and it will therefore show up
 649        in data extracted by the pcre_fullinfo() function).
 650
 651        An option change within a subpattern affects only that part of the cur-
 652        rent pattern that follows it, so
 653
 654          (a(?i)b)c
 655
 656        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
 657        used).   By  this means, options can be made to have different settings
 658        in different parts of the pattern. Any changes made in one  alternative
 659        do  carry  on  into subsequent branches within the same subpattern. For
 660        example,
 661
 662          (a(?i)b|c)
 663
 664        matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
 665        first  branch  is  abandoned before the option setting. This is because
 666        the effects of option settings happen at compile time. There  would  be
 667        some very weird behaviour otherwise.
 668
 669        The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
 670        in the same way as the Perl-compatible options by using the  characters
 671        U  and X respectively. The (?X) flag setting is special in that it must
 672        always occur earlier in the pattern than any of the additional features
 673        it  turns on, even when it is at top level. It is best to put it at the
 674        start.
 675
 676
 677 SUBPATTERNS
 678
 679        Subpatterns are delimited by parentheses (round brackets), which can be
 680        nested.  Turning part of a pattern into a subpattern does two things:
 681
 682        1. It localizes a set of alternatives. For example, the pattern
 683
 684          cat(aract|erpillar|)
 685
 686        matches  one  of the words "cat", "cataract", or "caterpillar". Without
 687        the parentheses, it would match "cataract",  "erpillar"  or  the  empty
 688        string.
 689
 690        2.  It  sets  up  the  subpattern as a capturing subpattern. This means
 691        that, when the whole pattern  matches,  that  portion  of  the  subject
 692        string that matched the subpattern is passed back to the caller via the
 693        ovector argument of pcre_exec(). Opening parentheses are  counted  from
 694        left  to  right  (starting  from 1) to obtain numbers for the capturing
 695        subpatterns.
 696
 697        For example, if the string "the red king" is matched against  the  pat-
 698        tern
 699
 700          the ((red|white) (king|queen))
 701
 702        the captured substrings are "red king", "red", and "king", and are num-
 703        bered 1, 2, and 3, respectively.
 704
 705        The fact that plain parentheses fulfil  two  functions  is  not  always
 706        helpful.   There are often times when a grouping subpattern is required
 707        without a capturing requirement. If an opening parenthesis is  followed
 708        by  a question mark and a colon, the subpattern does not do any captur-
 709        ing, and is not counted when computing the  number  of  any  subsequent
 710        capturing  subpatterns. For example, if the string "the white queen" is
 711        matched against the pattern
 712
 713          the ((?:red|white) (king|queen))
 714
 715        the captured substrings are "white queen" and "queen", and are numbered
 716        1  and 2. The maximum number of capturing subpatterns is 65535, and the
 717        maximum depth of nesting of all subpatterns, both  capturing  and  non-
 718        capturing, is 200.
 719
 720        As  a  convenient shorthand, if any option settings are required at the
 721        start of a non-capturing subpattern,  the  option  letters  may  appear
 722        between the "?" and the ":". Thus the two patterns
 723
 724          (?i:saturday|sunday)
 725          (?:(?i)saturday|sunday)
 726
 727        match exactly the same set of strings. Because alternative branches are
 728        tried from left to right, and options are not reset until  the  end  of
 729        the  subpattern is reached, an option setting in one branch does affect
 730        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
 731        "Saturday".
 732
 733
 734 NAMED SUBPATTERNS
 735
 736        Identifying  capturing  parentheses  by number is simple, but it can be
 737        very hard to keep track of the numbers in complicated  regular  expres-
 738        sions.  Furthermore,  if  an  expression  is  modified, the numbers may
 739        change. To help with this difficulty, PCRE supports the naming of  sub-
 740        patterns,  something  that  Perl  does  not  provide. The Python syntax
 741        (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
 742        underscores, and must be unique within a pattern.
 743
 744        Named  capturing  parentheses  are  still  allocated numbers as well as
 745        names. The PCRE API provides function calls for extracting the name-to-
 746        number  translation table from a compiled pattern. There is also a con-
 747        venience function for extracting a captured substring by name. For fur-
 748        ther details see the pcreapi documentation.
 749
 750
 751 REPETITION
 752
 753        Repetition  is  specified  by  quantifiers, which can follow any of the
 754        following items:
 755
 756          a literal data character
 757          the . metacharacter
 758          the \C escape sequence
 759          the \X escape sequence (in UTF-8 mode with Unicode properties)
 760          an escape such as \d that matches a single character
 761          a character class
 762          a back reference (see next section)
 763          a parenthesized subpattern (unless it is an assertion)
 764
 765        The general repetition quantifier specifies a minimum and maximum  num-
 766        ber  of  permitted matches, by giving the two numbers in curly brackets
 767        (braces), separated by a comma. The numbers must be  less  than  65536,
 768        and the first must be less than or equal to the second. For example:
 769
 770          z{2,4}
 771
 772        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
 773        special character. If the second number is omitted, but  the  comma  is
 774        present,  there  is  no upper limit; if the second number and the comma
 775        are both omitted, the quantifier specifies an exact number of  required
 776        matches. Thus
 777
 778          [aeiou]{3,}
 779
 780        matches at least 3 successive vowels, but may match many more, while
 781
 782          \d{8}
 783
 784        matches  exactly  8  digits. An opening curly bracket that appears in a
 785        position where a quantifier is not allowed, or one that does not  match
 786        the  syntax of a quantifier, is taken as a literal character. For exam-
 787        ple, {,6} is not a quantifier, but a literal string of four characters.
 788
 789        In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
 790        individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
 791        acters, each of which is represented by a two-byte sequence. Similarly,
 792        when Unicode property support is available, \X{3} matches three Unicode
 793        extended  sequences,  each of which may be several bytes long (and they
 794        may be of different lengths).
 795
 796        The quantifier {0} is permitted, causing the expression to behave as if
 797        the previous item and the quantifier were not present.
 798
 799        For  convenience  (and  historical compatibility) the three most common
 800        quantifiers have single-character abbreviations:
 801
 802          *    is equivalent to {0,}
 803          +    is equivalent to {1,}
 804          ?    is equivalent to {0,1}
 805
 806        It is possible to construct infinite loops by  following  a  subpattern
 807        that can match no characters with a quantifier that has no upper limit,
 808        for example:
 809
 810          (a?)*
 811
 812        Earlier versions of Perl and PCRE used to give an error at compile time
 813        for  such  patterns. However, because there are cases where this can be
 814        useful, such patterns are now accepted, but if any  repetition  of  the
 815        subpattern  does in fact match no characters, the loop is forcibly bro-
 816        ken.
 817
 818        By default, the quantifiers are "greedy", that is, they match  as  much
 819        as  possible  (up  to  the  maximum number of permitted times), without
 820        causing the rest of the pattern to fail. The classic example  of  where
 821        this gives problems is in trying to match comments in C programs. These
 822        appear between /* and */ and within the comment,  individual  *  and  /
 823        characters  may  appear. An attempt to match C comments by applying the
 824        pattern
 825
 826          /\*.*\*/
 827
 828        to the string
 829
 830          /* first comment */  not comment  /* second comment */
 831
 832        fails, because it matches the entire string owing to the greediness  of
 833        the .*  item.
 834
 835        However,  if  a quantifier is followed by a question mark, it ceases to
 836        be greedy, and instead matches the minimum number of times possible, so
 837        the pattern
 838
 839          /\*.*?\*/
 840
 841        does  the  right  thing with the C comments. The meaning of the various
 842        quantifiers is not otherwise changed,  just  the  preferred  number  of
 843        matches.   Do  not  confuse this use of question mark with its use as a
 844        quantifier in its own right. Because it has two uses, it can  sometimes
 845        appear doubled, as in
 846
 847          \d??\d
 848
 849        which matches one digit by preference, but can match two if that is the
 850        only way the rest of the pattern matches.
 851
 852        If the PCRE_UNGREEDY option is set (an option which is not available in
 853        Perl),  the  quantifiers are not greedy by default, but individual ones
 854        can be made greedy by following them with a  question  mark.  In  other
 855        words, it inverts the default behaviour.
 856
 857        When  a  parenthesized  subpattern  is quantified with a minimum repeat
 858        count that is greater than 1 or with a limited maximum, more memory  is
 859        required  for  the  compiled  pattern, in proportion to the size of the
 860        minimum or maximum.
 861
 862        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
 863        alent  to Perl's /s) is set, thus allowing the . to match newlines, the
 864        pattern is implicitly anchored, because whatever follows will be  tried
 865        against  every character position in the subject string, so there is no
 866        point in retrying the overall match at any position  after  the  first.
 867        PCRE normally treats such a pattern as though it were preceded by \A.
 868
 869        In  cases  where  it  is known that the subject string contains no new-
 870        lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
 871        mization, or alternatively using ^ to indicate anchoring explicitly.
 872
 873        However,  there is one situation where the optimization cannot be used.
 874        When .*  is inside capturing parentheses that  are  the  subject  of  a
 875        backreference  elsewhere in the pattern, a match at the start may fail,
 876        and a later one succeed. Consider, for example:
 877
 878          (.*)abc\1
 879
 880        If the subject is "xyz123abc123" the match point is the fourth  charac-
 881        ter. For this reason, such a pattern is not implicitly anchored.
 882
 883        When a capturing subpattern is repeated, the value captured is the sub-
 884        string that matched the final iteration. For example, after
 885
 886          (tweedle[dume]{3}\s*)+
 887
 888        has matched "tweedledum tweedledee" the value of the captured substring
 889        is  "tweedledee".  However,  if there are nested capturing subpatterns,
 890        the corresponding captured values may have been set in previous  itera-
 891        tions. For example, after
 892
 893          /(a|(b))+/
 894
 895        matches "aba" the value of the second captured substring is "b".
 896
 897
 898 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
 899
 900        With both maximizing and minimizing repetition, failure of what follows
 901        normally causes the repeated item to be re-evaluated to see if  a  dif-
 902        ferent number of repeats allows the rest of the pattern to match. Some-
 903        times it is useful to prevent this, either to change the nature of  the
 904        match,  or  to  cause it fail earlier than it otherwise might, when the
 905        author of the pattern knows there is no point in carrying on.
 906
 907        Consider, for example, the pattern \d+foo when applied to  the  subject
 908        line
 909
 910          123456bar
 911
 912        After matching all 6 digits and then failing to match "foo", the normal
 913        action of the matcher is to try again with only 5 digits  matching  the
 914        \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
 915        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
 916        the  means for specifying that once a subpattern has matched, it is not
 917        to be re-evaluated in this way.
 918
 919        If we use atomic grouping for the previous example, the  matcher  would
 920        give up immediately on failing to match "foo" the first time. The nota-
 921        tion is a kind of special parenthesis, starting with  (?>  as  in  this
 922        example:
 923
 924          (?>\d+)foo
 925
 926        This  kind  of  parenthesis "locks up" the  part of the pattern it con-
 927        tains once it has matched, and a failure further into  the  pattern  is
 928        prevented  from  backtracking into it. Backtracking past it to previous
 929        items, however, works as normal.
 930
 931        An alternative description is that a subpattern of  this  type  matches
 932        the  string  of  characters  that an identical standalone pattern would
 933        match, if anchored at the current point in the subject string.
 934
 935        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
 936        such as the above example can be thought of as a maximizing repeat that
 937        must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
 938        pared  to  adjust  the number of digits they match in order to make the
 939        rest of the pattern match, (?>\d+) can only match an entire sequence of
 940        digits.
 941
 942        Atomic  groups in general can of course contain arbitrarily complicated
 943        subpatterns, and can be nested. However, when  the  subpattern  for  an
 944        atomic group is just a single repeated item, as in the example above, a
 945        simpler notation, called a "possessive quantifier" can  be  used.  This
 946        consists  of  an  additional  + character following a quantifier. Using
 947        this notation, the previous example can be rewritten as
 948
 949          \d++foo
 950
 951        Possessive  quantifiers  are  always  greedy;  the   setting   of   the
 952        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
 953        simpler forms of atomic group. However, there is no difference  in  the
 954        meaning  or  processing  of  a possessive quantifier and the equivalent
 955        atomic group.
 956
 957        The possessive quantifier syntax is an extension to the Perl syntax. It
 958        originates in Sun's Java package.
 959
 960        When  a  pattern  contains an unlimited repeat inside a subpattern that
 961        can itself be repeated an unlimited number of  times,  the  use  of  an
 962        atomic  group  is  the  only way to avoid some failing matches taking a
 963        very long time indeed. The pattern
 964
 965          (\D+|<\d+>)*[!?]
 966
 967        matches an unlimited number of substrings that either consist  of  non-
 968        digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
 969        matches, it runs quickly. However, if it is applied to
 970
 971          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 972
 973        it takes a long time before reporting  failure.  This  is  because  the
 974        string  can be divided between the internal \D+ repeat and the external
 975        * repeat in a large number of ways, and all  have  to  be  tried.  (The
 976        example  uses  [!?]  rather than a single character at the end, because
 977        both PCRE and Perl have an optimization that allows  for  fast  failure
 978        when  a single character is used. They remember the last single charac-
 979        ter that is required for a match, and fail early if it is  not  present
 980        in  the  string.)  If  the pattern is changed so that it uses an atomic
 981        group, like this:
 982
 983          ((?>\D+)|<\d+>)*[!?]
 984
 985        sequences of non-digits cannot be broken, and failure happens  quickly.
 986
 987
 988 BACK REFERENCES
 989
 990        Outside a character class, a backslash followed by a digit greater than
 991        0 (and possibly further digits) is a back reference to a capturing sub-
 992        pattern  earlier  (that is, to its left) in the pattern, provided there
 993        have been that many previous capturing left parentheses.
 994
 995        However, if the decimal number following the backslash is less than 10,
 996        it  is  always  taken  as a back reference, and causes an error only if
 997        there are not that many capturing left parentheses in the  entire  pat-
 998        tern.  In  other words, the parentheses that are referenced need not be
 999        to the left of the reference for numbers less than 10. See the  subsec-
1000        tion  entitled  "Non-printing  characters" above for further details of
1001        the handling of digits following a backslash.
1002
1003        A back reference matches whatever actually matched the  capturing  sub-
1004        pattern  in  the  current subject string, rather than anything matching
1005        the subpattern itself (see "Subpatterns as subroutines" below for a way
1006        of doing that). So the pattern
1007
1008          (sens|respons)e and \1ibility
1009
1010        matches  "sense and sensibility" and "response and responsibility", but
1011        not "sense and responsibility". If caseful matching is in force at  the
1012        time  of the back reference, the case of letters is relevant. For exam-
1013        ple,
1014
1015          ((?i)rah)\s+\1
1016
1017        matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1018        original capturing subpattern is matched caselessly.
1019
1020        Back  references  to named subpatterns use the Python syntax (?P=name).
1021        We could rewrite the above example as follows:
1022
1023          (?<p1>(?i)rah)\s+(?P=p1)
1024
1025        There may be more than one back reference to the same subpattern. If  a
1026        subpattern  has  not actually been used in a particular match, any back
1027        references to it always fail. For example, the pattern
1028
1029          (a|(bc))\2
1030
1031        always fails if it starts to match "a" rather than "bc". Because  there
1032        may  be  many  capturing parentheses in a pattern, all digits following
1033        the backslash are taken as part of a potential back  reference  number.
1034        If the pattern continues with a digit character, some delimiter must be
1035        used to terminate the back reference. If the  PCRE_EXTENDED  option  is
1036        set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
1037        ments" below) can be used.
1038
1039        A back reference that occurs inside the parentheses to which it  refers
1040        fails  when  the subpattern is first used, so, for example, (a\1) never
1041        matches.  However, such references can be useful inside  repeated  sub-
1042        patterns. For example, the pattern
1043
1044          (a|b\1)+
1045
1046        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1047        ation of the subpattern,  the  back  reference  matches  the  character
1048        string  corresponding  to  the previous iteration. In order for this to
1049        work, the pattern must be such that the first iteration does  not  need
1050        to  match the back reference. This can be done using alternation, as in
1051        the example above, or by a quantifier with a minimum of zero.
1052
1053
1054 ASSERTIONS
1055
1056        An assertion is a test on the characters  following  or  preceding  the
1057        current  matching  point that does not actually consume any characters.
1058        The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
1059        described above.
1060
1061        More  complicated  assertions  are  coded as subpatterns. There are two
1062        kinds: those that look ahead of the current  position  in  the  subject
1063        string,  and  those  that  look  behind  it. An assertion subpattern is
1064        matched in the normal way, except that it does not  cause  the  current
1065        matching position to be changed.
1066
1067        Assertion  subpatterns  are  not  capturing subpatterns, and may not be
1068        repeated, because it makes no sense to assert the  same  thing  several
1069        times.  If  any kind of assertion contains capturing subpatterns within
1070        it, these are counted for the purposes of numbering the capturing  sub-
1071        patterns in the whole pattern.  However, substring capturing is carried
1072        out only for positive assertions, because it does not  make  sense  for
1073        negative assertions.
1074
1075    Lookahead assertions
1076
1077        Lookahead assertions start with (?= for positive assertions and (?! for
1078        negative assertions. For example,
1079
1080          \w+(?=;)
1081
1082        matches a word followed by a semicolon, but does not include the  semi-
1083        colon in the match, and
1084
1085          foo(?!bar)
1086
1087        matches  any  occurrence  of  "foo" that is not followed by "bar". Note
1088        that the apparently similar pattern
1089
1090          (?!foo)bar
1091
1092        does not find an occurrence of "bar"  that  is  preceded  by  something
1093        other  than "foo"; it finds any occurrence of "bar" whatsoever, because
1094        the assertion (?!foo) is always true when the next three characters are
1095        "bar". A lookbehind assertion is needed to achieve the other effect.
1096
1097        If you want to force a matching failure at some point in a pattern, the
1098        most convenient way to do it is  with  (?!)  because  an  empty  string
1099        always  matches, so an assertion that requires there not to be an empty
1100        string must always fail.
1101
1102    Lookbehind assertions
1103
1104        Lookbehind assertions start with (?<= for positive assertions and  (?<!
1105        for negative assertions. For example,
1106
1107          (?<!foo)bar
1108
1109        does  find  an  occurrence  of "bar" that is not preceded by "foo". The
1110        contents of a lookbehind assertion are restricted  such  that  all  the
1111        strings it matches must have a fixed length. However, if there are sev-
1112        eral alternatives, they do not all have to have the same fixed  length.
1113        Thus
1114
1115          (?<=bullock|donkey)
1116
1117        is permitted, but
1118
1119          (?<!dogs?|cats?)
1120
1121        causes  an  error at compile time. Branches that match different length
1122        strings are permitted only at the top level of a lookbehind  assertion.
1123        This  is  an  extension  compared  with  Perl (at least for 5.8), which
1124        requires all branches to match the same length of string. An  assertion
1125        such as
1126
1127          (?<=ab(c|de))
1128
1129        is  not  permitted,  because  its single top-level branch can match two
1130        different lengths, but it is acceptable if rewritten to  use  two  top-
1131        level branches:
1132
1133          (?<=abc|abde)
1134
1135        The  implementation  of lookbehind assertions is, for each alternative,
1136        to temporarily move the current position back by the  fixed  width  and
1137        then try to match. If there are insufficient characters before the cur-
1138        rent position, the match is deemed to fail.
1139
1140        PCRE does not allow the \C escape (which matches a single byte in UTF-8
1141        mode)  to appear in lookbehind assertions, because it makes it impossi-
1142        ble to calculate the length of the lookbehind. The \X escape, which can
1143        match different numbers of bytes, is also not permitted.
1144
1145        Atomic  groups can be used in conjunction with lookbehind assertions to
1146        specify efficient matching at the end of the subject string. Consider a
1147        simple pattern such as
1148
1149          abcd$
1150
1151        when  applied  to  a  long string that does not match. Because matching
1152        proceeds from left to right, PCRE will look for each "a" in the subject
1153        and  then  see  if what follows matches the rest of the pattern. If the
1154        pattern is specified as
1155
1156          ^.*abcd$
1157
1158        the initial .* matches the entire string at first, but when this  fails
1159        (because there is no following "a"), it backtracks to match all but the
1160        last character, then all but the last two characters, and so  on.  Once
1161        again  the search for "a" covers the entire string, from right to left,
1162        so we are no better off. However, if the pattern is written as
1163
1164          ^(?>.*)(?<=abcd)
1165
1166        or, equivalently, using the possessive quantifier syntax,
1167
1168          ^.*+(?<=abcd)
1169
1170        there can be no backtracking for the .* item; it  can  match  only  the
1171        entire  string.  The subsequent lookbehind assertion does a single test
1172        on the last four characters. If it fails, the match fails  immediately.
1173        For  long  strings, this approach makes a significant difference to the
1174        processing time.
1175
1176    Using multiple assertions
1177
1178        Several assertions (of any sort) may occur in succession. For example,
1179
1180          (?<=\d{3})(?<!999)foo
1181
1182        matches "foo" preceded by three digits that are not "999". Notice  that
1183        each  of  the  assertions is applied independently at the same point in
1184        the subject string. First there is a  check  that  the  previous  three
1185        characters  are  all  digits,  and  then there is a check that the same
1186        three characters are not "999".  This pattern does not match "foo" pre-
1187        ceded  by  six  characters,  the first of which are digits and the last
1188        three of which are not "999". For example, it  doesn't  match  "123abc-
1189        foo". A pattern to do that is
1190
1191          (?<=\d{3}...)(?<!999)foo
1192
1193        This  time  the  first assertion looks at the preceding six characters,
1194        checking that the first three are digits, and then the second assertion
1195        checks that the preceding three characters are not "999".
1196
1197        Assertions can be nested in any combination. For example,
1198
1199          (?<=(?<!foo)bar)baz
1200
1201        matches  an occurrence of "baz" that is preceded by "bar" which in turn
1202        is not preceded by "foo", while
1203
1204          (?<=\d{3}(?!999)...)foo
1205
1206        is another pattern that matches "foo" preceded by three digits and  any
1207        three characters that are not "999".
1208
1209
1210 CONDITIONAL SUBPATTERNS
1211
1212        It  is possible to cause the matching process to obey a subpattern con-
1213        ditionally or to choose between two alternative subpatterns,  depending
1214        on  the result of an assertion, or whether a previous capturing subpat-
1215        tern matched or not. The two possible forms of  conditional  subpattern
1216        are
1217
1218          (?(condition)yes-pattern)
1219          (?(condition)yes-pattern|no-pattern)
1220
1221        If  the  condition is satisfied, the yes-pattern is used; otherwise the
1222        no-pattern (if present) is used. If there are more  than  two  alterna-
1223        tives in the subpattern, a compile-time error occurs.
1224
1225        There are three kinds of condition. If the text between the parentheses
1226        consists of a sequence of digits, the condition  is  satisfied  if  the
1227        capturing  subpattern of that number has previously matched. The number
1228        must be greater than zero. Consider the following pattern,  which  con-
1229        tains  non-significant white space to make it more readable (assume the
1230        PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
1231        discussion:
1232
1233          ( \( )?    [^()]+    (?(1) \) )
1234
1235        The  first  part  matches  an optional opening parenthesis, and if that
1236        character is present, sets it as the first captured substring. The sec-
1237        ond  part  matches one or more characters that are not parentheses. The
1238        third part is a conditional subpattern that tests whether the first set
1239        of parentheses matched or not. If they did, that is, if subject started
1240        with an opening parenthesis, the condition is true, and so the yes-pat-
1241        tern  is  executed  and  a  closing parenthesis is required. Otherwise,
1242        since no-pattern is not present, the  subpattern  matches  nothing.  In
1243        other  words,  this  pattern  matches  a  sequence  of non-parentheses,
1244        optionally enclosed in parentheses.
1245
1246        If the condition is the string (R), it is satisfied if a recursive call
1247        to  the pattern or subpattern has been made. At "top level", the condi-
1248        tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
1249        described in the next section.
1250
1251        If  the  condition  is  not  a sequence of digits or (R), it must be an
1252        assertion.  This may be a positive or negative lookahead or  lookbehind
1253        assertion.  Consider  this  pattern,  again  containing non-significant
1254        white space, and with the two alternatives on the second line:
1255
1256          (?(?=[^a-z]*[a-z])
1257          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
1258
1259        The condition  is  a  positive  lookahead  assertion  that  matches  an
1260        optional  sequence of non-letters followed by a letter. In other words,
1261        it tests for the presence of at least one letter in the subject.  If  a
1262        letter  is found, the subject is matched against the first alternative;
1263        otherwise it is  matched  against  the  second.  This  pattern  matches
1264        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1265        letters and dd are digits.
1266
1267
1268 COMMENTS
1269
1270        The sequence (?# marks the start of a comment that continues up to  the
1271        next  closing  parenthesis.  Nested  parentheses are not permitted. The
1272        characters that make up a comment play no part in the pattern  matching
1273        at all.
1274
1275        If  the PCRE_EXTENDED option is set, an unescaped # character outside a
1276        character class introduces a comment that continues up to the next new-
1277        line character in the pattern.
1278
1279
1280 RECURSIVE PATTERNS
1281
1282        Consider  the problem of matching a string in parentheses, allowing for
1283        unlimited nested parentheses. Without the use of  recursion,  the  best
1284        that  can  be  done  is  to use a pattern that matches up to some fixed
1285        depth of nesting. It is not possible to  handle  an  arbitrary  nesting
1286        depth.  Perl  provides  a  facility  that allows regular expressions to
1287        recurse (amongst other things). It does this by interpolating Perl code
1288        in the expression at run time, and the code can refer to the expression
1289        itself. A Perl pattern to solve the parentheses problem can be  created
1290        like this:
1291
1292          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1293
1294        The (?p{...}) item interpolates Perl code at run time, and in this case
1295        refers recursively to the pattern in which it appears. Obviously,  PCRE
1296        cannot  support  the  interpolation  of Perl code. Instead, it supports
1297        some special syntax for recursion of the entire pattern, and  also  for
1298        individual subpattern recursion.
1299
1300        The  special item that consists of (? followed by a number greater than
1301        zero and a closing parenthesis is a recursive call of the subpattern of
1302        the  given  number, provided that it occurs inside that subpattern. (If
1303        not, it is a "subroutine" call, which is described  in  the  next  sec-
1304        tion.)  The special item (?R) is a recursive call of the entire regular
1305        expression.
1306
1307        For example, this PCRE pattern solves the  nested  parentheses  problem
1308        (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is
1309        ignored):
1310
1311          \( ( (?>[^()]+) | (?R) )* \)
1312
1313        First it matches an opening parenthesis. Then it matches any number  of
1314        substrings  which  can  either  be  a sequence of non-parentheses, or a
1315        recursive match of the pattern itself (that is  a  correctly  parenthe-
1316        sized substring).  Finally there is a closing parenthesis.
1317
1318        If  this  were  part of a larger pattern, you would not want to recurse
1319        the entire pattern, so instead you could use this:
1320
1321          ( \( ( (?>[^()]+) | (?1) )* \) )
1322
1323        We have put the pattern into parentheses, and caused the  recursion  to
1324        refer  to them instead of the whole pattern. In a larger pattern, keep-
1325        ing track of parenthesis numbers can be tricky. It may be  more  conve-
1326        nient  to use named parentheses instead. For this, PCRE uses (?P>name),
1327        which is an extension to the Python syntax that  PCRE  uses  for  named
1328        parentheses (Perl does not provide named parentheses). We could rewrite
1329        the above example as follows:
1330
1331          (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
1332
1333        This particular example pattern contains nested unlimited repeats,  and
1334        so  the  use of atomic grouping for matching strings of non-parentheses
1335        is important when applying the pattern to strings that  do  not  match.
1336        For example, when this pattern is applied to
1337
1338          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1339
1340        it  yields "no match" quickly. However, if atomic grouping is not used,
1341        the match runs for a very long time indeed because there  are  so  many
1342        different  ways  the  + and * repeats can carve up the subject, and all
1343        have to be tested before failure can be reported.
1344
1345        At the end of a match, the values set for any capturing subpatterns are
1346        those from the outermost level of the recursion at which the subpattern
1347        value is set.  If you want to obtain  intermediate  values,  a  callout
1348        function can be used (see the next section and the pcrecallout documen-
1349        tation). If the pattern above is matched against
1350
1351          (ab(cd)ef)
1352
1353        the value for the capturing parentheses is  "ef",  which  is  the  last
1354        value  taken  on at the top level. If additional parentheses are added,
1355        giving
1356
1357          \( ( ( (?>[^()]+) | (?R) )* ) \)
1358             ^                        ^
1359             ^                        ^
1360
1361        the string they capture is "ab(cd)ef", the contents of  the  top  level
1362        parentheses.  If there are more than 15 capturing parentheses in a pat-
1363        tern, PCRE has to obtain extra memory to store data during a recursion,
1364        which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
1365        wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
1366        PCRE_ERROR_NOMEMORY error.
1367
1368        Do  not  confuse  the (?R) item with the condition (R), which tests for
1369        recursion.  Consider this pattern, which matches text in  angle  brack-
1370        ets,  allowing for arbitrary nesting. Only digits are allowed in nested
1371        brackets (that is, when recursing), whereas any characters are  permit-
1372        ted at the outer level.
1373
1374          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
1375
1376        In  this  pattern, (?(R) is the start of a conditional subpattern, with
1377        two different alternatives for the recursive and  non-recursive  cases.
1378        The (?R) item is the actual recursive call.
1379
1380
1381 SUBPATTERNS AS SUBROUTINES
1382
1383        If the syntax for a recursive subpattern reference (either by number or
1384        by name) is used outside the parentheses to which it refers,  it  oper-
1385        ates  like  a  subroutine in a programming language. An earlier example
1386        pointed out that the pattern
1387
1388          (sens|respons)e and \1ibility
1389
1390        matches "sense and sensibility" and "response and responsibility",  but
1391        not "sense and responsibility". If instead the pattern
1392
1393          (sens|respons)e and (?1)ibility
1394
1395        is  used, it does match "sense and responsibility" as well as the other
1396        two strings. Such references must, however, follow  the  subpattern  to
1397        which they refer.
1398
1399
1400 CALLOUTS
1401
1402        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1403        Perl code to be obeyed in the middle of matching a regular  expression.
1404        This makes it possible, amongst other things, to extract different sub-
1405        strings that match the same pair of parentheses when there is a repeti-
1406        tion.
1407
1408        PCRE provides a similar feature, but of course it cannot obey arbitrary
1409        Perl code. The feature is called "callout". The caller of PCRE provides
1410        an  external function by putting its entry point in the global variable
1411        pcre_callout.  By default, this variable contains NULL, which  disables
1412        all calling out.
1413
1414        Within  a  regular  expression,  (?C) indicates the points at which the
1415        external function is to be called. If you want  to  identify  different
1416        callout  points, you can put a number less than 256 after the letter C.
1417        The default value is zero.  For example, this pattern has  two  callout
1418        points:
1419
1420          (?C1)abc(?C2)def
1421
1422        If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
1423        automatically installed before each item in the pattern. They  are  all
1424        numbered 255.
1425
1426        During matching, when PCRE reaches a callout point (and pcre_callout is
1427        set), the external function is called. It is provided with  the  number
1428        of  the callout, the position in the pattern, and, optionally, one item
1429        of data originally supplied by the caller of pcre_exec().  The  callout
1430        function  may cause matching to proceed, to backtrack, or to fail alto-
1431        gether. A complete description of the interface to the callout function
1432        is given in the pcrecallout documentation.
1433
1434 Last updated: 28 February 2005
1435 Copyright (c) 1997-2005 University of Cambridge.