Apply John Jetmore's patch to allow tls-on-connect and STARTTLS to be

[users/jgh/exim.git] / doc / doc-txt / pcrepattern.txt
diff --git a/doc/doc-txt/pcrepattern.txt b/doc/doc-txt/pcrepattern.txt

index 1dc800af4e3fd35882b0ba8d9c69d98a714900f6..9712c86b41d24070ed528080610b078e8f507886 100644 (file)
--- a/doc/doc-txt/pcrepattern.txt
+++ b/doc/doc-txt/pcrepattern.txt
@@ -1,18 +1,18 @@
-This file contains the PCRE man page that describes the regular expressions 
-supported by PCRE version 5.0. Note that not all of the features are relevant 
+This file contains the PCRE man page that describes the regular expressions
+supported by PCRE version 6.7. Note that not all of the features are relevant
  in the context of Exim. In particular, the version of PCRE that is compiled
  with Exim does not include UTF-8 support, there is no mechanism for changing
  the options with which the PCRE functions are called, and features such as
  callout are not accessible.
  -----------------------------------------------------------------------------
  
-PCRE(3)                                                                PCRE(3)
-
+PCREPATTERN(3)                                                  PCREPATTERN(3)
  
  
  NAME
         PCRE - Perl-compatible regular expressions
  
+
  PCRE REGULAR EXPRESSION DETAILS
  
         The  syntax  and semantics of the regular expressions supported by PCRE
@@ -30,6 +30,14 @@ PCRE REGULAR EXPRESSION DETAILS
         of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
         page.
  
+       The remainder of this document discusses the  patterns  that  are  sup-
+       ported  by  PCRE when its main matching function, pcre_exec(), is used.
+       From  release  6.0,   PCRE   offers   a   second   matching   function,
+       pcre_dfa_exec(),  which matches using a different algorithm that is not
+       Perl-compatible. The advantages and disadvantages  of  the  alternative
+       function, and how it differs from the normal function, are discussed in
+       the pcrematching page.
+
         A regular expression is a pattern that is  matched  against  a  subject
         string  from  left  to right. Most characters stand for themselves in a
         pattern, and match the corresponding characters in the  subject.  As  a
@@ -37,15 +45,24 @@ PCRE REGULAR EXPRESSION DETAILS
  
           The quick brown fox
  
-       matches  a portion of a subject string that is identical to itself. The
-       power of regular expressions comes from the ability to include alterna-
-       tives  and repetitions in the pattern. These are encoded in the pattern
-       by the use of metacharacters, which do not  stand  for  themselves  but
-       instead are interpreted in some special way.
-
-       There  are  two different sets of metacharacters: those that are recog-
-       nized anywhere in the pattern except within square brackets, and  those
-       that  are  recognized  in square brackets. Outside square brackets, the
+       matches a portion of a subject string that is identical to itself. When
+       caseless matching is specified (the PCRE_CASELESS option), letters  are
+       matched  independently  of case. In UTF-8 mode, PCRE always understands
+       the concept of case for characters whose values are less than  128,  so
+       caseless  matching  is always possible. For characters with higher val-
+       ues, the concept of case is supported if PCRE is compiled with  Unicode
+       property  support,  but  not  otherwise.   If  you want to use caseless
+       matching for characters 128 and above, you must  ensure  that  PCRE  is
+       compiled with Unicode property support as well as with UTF-8 support.
+
+       The  power  of  regular  expressions  comes from the ability to include
+       alternatives and repetitions in the pattern. These are encoded  in  the
+       pattern by the use of metacharacters, which do not stand for themselves
+       but instead are interpreted in some special way.
+
+       There are two different sets of metacharacters: those that  are  recog-
+       nized  anywhere in the pattern except within square brackets, and those
+       that are recognized in square brackets. Outside  square  brackets,  the
         metacharacters are as follows:
  
           \      general escape character with several uses
@@ -64,7 +81,7 @@ PCRE REGULAR EXPRESSION DETAILS
                  also "possessive quantifier"
           {      start min/max quantifier
  
-       Part of a pattern that is in square brackets  is  called  a  "character
+       Part  of  a  pattern  that is in square brackets is called a "character
         class". In a character class the only metacharacters are:
  
           \      general escape character
@@ -74,33 +91,33 @@ PCRE REGULAR EXPRESSION DETAILS
                    syntax)
           ]      terminates the character class
  
-       The  following sections describe the use of each of the metacharacters.
+       The following sections describe the use of each of the  metacharacters.
  
  
  BACKSLASH
  
         The backslash character has several uses. Firstly, if it is followed by
-       a  non-alphanumeric  character,  it takes away any special meaning that
-       character may have. This  use  of  backslash  as  an  escape  character
+       a non-alphanumeric character, it takes away any  special  meaning  that
+       character  may  have.  This  use  of  backslash  as an escape character
         applies both inside and outside character classes.
  
-       For  example,  if  you want to match a * character, you write \* in the
-       pattern.  This escaping action applies whether  or  not  the  following
-       character  would  otherwise be interpreted as a metacharacter, so it is
-       always safe to precede a non-alphanumeric  with  backslash  to  specify
-       that  it stands for itself. In particular, if you want to match a back-
+       For example, if you want to match a * character, you write  \*  in  the
+       pattern.   This  escaping  action  applies whether or not the following
+       character would otherwise be interpreted as a metacharacter, so  it  is
+       always  safe  to  precede  a non-alphanumeric with backslash to specify
+       that it stands for itself. In particular, if you want to match a  back-
         slash, you write \\.
  
-       If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
-       the  pattern (other than in a character class) and characters between a
-       # outside a character class and the next newline character are ignored.
-       An  escaping backslash can be used to include a whitespace or # charac-
-       ter as part of the pattern.
+       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
+       the pattern (other than in a character class) and characters between  a
+       # outside a character class and the next newline are ignored. An escap-
+       ing backslash can be used to include a whitespace  or  #  character  as
+       part of the pattern.
  
-       If you want to remove the special meaning from a  sequence  of  charac-
-       ters,  you can do so by putting them between \Q and \E. This is differ-
-       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
-       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
+       If  you  want  to remove the special meaning from a sequence of charac-
+       ters, you can do so by putting them between \Q and \E. This is  differ-
+       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
+       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
         tion. Note the following examples:
  
           Pattern            PCRE matches   Perl matches
@@ -110,16 +127,16 @@ BACKSLASH
           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
  
-       The \Q...\E sequence is recognized both inside  and  outside  character
+       The  \Q...\E  sequence  is recognized both inside and outside character
         classes.
  
     Non-printing characters
  
         A second use of backslash provides a way of encoding non-printing char-
-       acters in patterns in a visible manner. There is no restriction on  the
-       appearance  of non-printing characters, apart from the binary zero that
-       terminates a pattern, but when a pattern  is  being  prepared  by  text
-       editing,  it  is  usually  easier  to  use  one of the following escape
+       acters  in patterns in a visible manner. There is no restriction on the
+       appearance of non-printing characters, apart from the binary zero  that
+       terminates  a  pattern,  but  when  a pattern is being prepared by text
+       editing, it is usually easier  to  use  one  of  the  following  escape
         sequences than the binary character it represents:
  
           \a        alarm, that is, the BEL character (hex 07)
@@ -131,48 +148,48 @@ BACKSLASH
           \t        tab (hex 09)
           \ddd      character with octal code ddd, or backreference
           \xhh      character with hex code hh
-         \x{hhh..} character with hex code hhh... (UTF-8 mode only)
+         \x{hhh..} character with hex code hhh..
  
-       The precise effect of \cx is as follows: if x is a lower  case  letter,
-       it  is converted to upper case. Then bit 6 of the character (hex 40) is
-       inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
+       The  precise  effect of \cx is as follows: if x is a lower case letter,
+       it is converted to upper case. Then bit 6 of the character (hex 40)  is
+       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
         becomes hex 7B.
  
-       After  \x, from zero to two hexadecimal digits are read (letters can be
-       in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-
-       its  may  appear between \x{ and }, but the value of the character code
-       must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is
-       7FFFFFFF).  If  characters other than hexadecimal digits appear between
-       \x{ and }, or if there is no terminating }, this form of escape is  not
-       recognized. Instead, the initial \x will be interpreted as a basic hex-
-       adecimal escape, with no following digits,  giving  a  character  whose
-       value is zero.
+       After \x, from zero to two hexadecimal digits are read (letters can  be
+       in  upper  or  lower case). Any number of hexadecimal digits may appear
+       between \x{ and }, but the value of the character  code  must  be  less
+       than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
+       the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than
+       hexadecimal  digits  appear between \x{ and }, or if there is no termi-
+       nating }, this form of escape is not recognized.  Instead, the  initial
+       \x will be interpreted as a basic hexadecimal escape, with no following
+       digits, giving a character whose value is zero.
  
         Characters whose value is less than 256 can be defined by either of the
-       two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference
-       in  the  way they are handled. For example, \xdc is exactly the same as
-       \x{dc}.
+       two  syntaxes  for  \x. There is no difference in the way they are han-
+       dled. For example, \xdc is exactly the same as \x{dc}.
  
-       After \0 up to two further octal digits are read.  In  both  cases,  if
-       there  are fewer than two digits, just those that are present are used.
-       Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL
-       character  (code  value  7).  Make sure you supply two digits after the
-       initial zero if the pattern character that follows is itself  an  octal
-       digit.
+       After \0 up to two further octal digits are read. If  there  are  fewer
+       than  two  digits,  just  those  that  are  present  are used. Thus the
+       sequence \0\x\07 specifies two binary zeros followed by a BEL character
+       (code  value 7). Make sure you supply two digits after the initial zero
+       if the pattern character that follows is itself an octal digit.
  
         The handling of a backslash followed by a digit other than 0 is compli-
         cated.  Outside a character class, PCRE reads it and any following dig-
-       its  as  a  decimal  number. If the number is less than 10, or if there
+       its as a decimal number. If the number is less than  10,  or  if  there
         have been at least that many previous capturing left parentheses in the
-       expression,  the  entire  sequence  is  taken  as  a  back reference. A
-       description of how this works is given later, following the  discussion
+       expression, the entire  sequence  is  taken  as  a  back  reference.  A
+       description  of how this works is given later, following the discussion
         of parenthesized subpatterns.
  
-       Inside  a  character  class, or if the decimal number is greater than 9
-       and there have not been that many capturing subpatterns, PCRE  re-reads
-       up  to three octal digits following the backslash, and generates a sin-
-       gle byte from the least significant 8 bits of the value. Any subsequent
-       digits stand for themselves.  For example:
+       Inside a character class, or if the decimal number is  greater  than  9
+       and  there have not been that many capturing subpatterns, PCRE re-reads
+       up to three octal digits following the backslash, ane uses them to gen-
+       erate  a data character. Any subsequent digits stand for themselves. In
+       non-UTF-8 mode, the value of a character specified  in  octal  must  be
+       less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
+       example:
  
           \040   is another way of writing a space
           \40    is the same, provided there are fewer than 40
@@ -189,15 +206,14 @@ BACKSLASH
           \81    is either a back reference, or a binary zero
                     followed by the two characters "8" and "1"
  
-       Note  that  octal  values of 100 or greater must not be introduced by a
+       Note that octal values of 100 or greater must not be  introduced  by  a
         leading zero, because no more than three octal digits are ever read.
  
-       All the sequences that define a single byte value  or  a  single  UTF-8
-       character (in UTF-8 mode) can be used both inside and outside character
-       classes. In addition, inside a character  class,  the  sequence  \b  is
-       interpreted as the backspace character (hex 08), and the sequence \X is
-       interpreted as the character "X".  Outside  a  character  class,  these
-       sequences have different meanings (see below).
+       All the sequences that define a single character value can be used both
+       inside and outside character classes. In addition, inside  a  character
+       class,  the  sequence \b is interpreted as the backspace character (hex
+       08), and the sequence \X is interpreted as the character "X". Outside a
+       character class, these sequences have different meanings (see below).
  
     Generic character types
  
@@ -222,7 +238,9 @@ BACKSLASH
  
         For  compatibility  with Perl, \s does not match the VT character (code
         11).  This makes it different from the the POSIX "space" class. The  \s
-       characters are HT (9), LF (10), FF (12), CR (13), and space (32).
+       characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If
+       "use locale;" is included in a Perl script, \s may match the VT charac-
+       ter. In PCRE, it never does.)
  
         A "word" character is an underscore or any character less than 256 that
         is a letter or digit. The definition of  letters  and  digits  is  con-
@@ -234,34 +252,59 @@ BACKSLASH
  
         In  UTF-8 mode, characters with values greater than 128 never match \d,
         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
-       code character property support is available.
+       code  character  property support is available. The use of locales with
+       Unicode is discouraged.
  
     Unicode character properties
  
         When PCRE is built with Unicode character property support, three addi-
-       tional escape sequences to match generic character types are  available
+       tional  escape  sequences  to  match character properties are available
         when UTF-8 mode is selected. They are:
  
-        \p{xx}   a character with the xx property
-        \P{xx}   a character without the xx property
-        \X       an extended Unicode sequence
-
-       The  property  names represented by xx above are limited to the Unicode
-       general category properties. Each character has exactly one such  prop-
-       erty,  specified  by  a two-letter abbreviation. For compatibility with
-       Perl, negation can be specified by including a circumflex  between  the
-       opening  brace  and the property name. For example, \p{^Lu} is the same
-       as \P{Lu}.
-
-       If only one letter is specified with \p or  \P,  it  includes  all  the
-       properties that start with that letter. In this case, in the absence of
-       negation, the curly brackets in the escape sequence are optional; these
-       two examples have the same effect:
+         \p{xx}   a character with the xx property
+         \P{xx}   a character without the xx property
+         \X       an extended Unicode sequence
+
+       The property names represented by xx above are limited to  the  Unicode
+       script names, the general category properties, and "Any", which matches
+       any character (including newline). Other properties such as "InMusical-
+       Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
+       not match any characters, so always causes a match failure.
+
+       Sets of Unicode characters are defined as belonging to certain scripts.
+       A  character from one of these sets can be matched using a script name.
+       For example:
+
+         \p{Greek}
+         \P{Han}
+
+       Those that are not part of an identified script are lumped together  as
+       "Common". The current list of scripts is:
+
+       Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
+       dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
+       Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
+       Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
+       Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
+       Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
+       Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
+       banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
+       Ugaritic, Yi.
+
+       Each  character has exactly one general category property, specified by
+       a two-letter abbreviation. For compatibility with Perl, negation can be
+       specified  by  including a circumflex between the opening brace and the
+       property name. For example, \p{^Lu} is the same as \P{Lu}.
+
+       If only one letter is specified with \p or \P, it includes all the gen-
+       eral  category properties that start with that letter. In this case, in
+       the absence of negation, the curly brackets in the escape sequence  are
+       optional; these two examples have the same effect:
  
           \p{L}
           \pL
  
-       The following property codes are supported:
+       The following general category property codes are supported:
  
           C     Other
           Cc    Control
@@ -307,33 +350,42 @@ BACKSLASH
           Zp    Paragraph separator
           Zs    Space separator
  
-       Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-
-       ported by PCRE.
+       The  special property L& is also supported: it matches a character that
+       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
+       classified as a modifier or "other".
+
+       The  long  synonyms  for  these  properties that Perl supports (such as
+       \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
+       any of these properties with "Is".
+
+       No character that is in the Unicode table has the Cn (unassigned) prop-
+       erty.  Instead, this property is assumed for any code point that is not
+       in the Unicode table.
  
-       Specifying caseless matching does not affect  these  escape  sequences.
+       Specifying  caseless  matching  does not affect these escape sequences.
         For example, \p{Lu} always matches only upper case letters.
  
-       The  \X  escape  matches  any number of Unicode characters that form an
+       The \X escape matches any number of Unicode  characters  that  form  an
         extended Unicode sequence. \X is equivalent to
  
           (?>\PM\pM*)
  
-       That is, it matches a character without the "mark"  property,  followed
-       by  zero  or  more  characters with the "mark" property, and treats the
-       sequence as an atomic group (see below).  Characters  with  the  "mark"
+       That  is,  it matches a character without the "mark" property, followed
+       by zero or more characters with the "mark"  property,  and  treats  the
+       sequence  as  an  atomic group (see below).  Characters with the "mark"
         property are typically accents that affect the preceding character.
  
-       Matching  characters  by Unicode property is not fast, because PCRE has
-       to search a structure that contains  data  for  over  fifteen  thousand
+       Matching characters by Unicode property is not fast, because  PCRE  has
+       to  search  a  structure  that  contains data for over fifteen thousand
         characters. That is why the traditional escape sequences such as \d and
         \w do not use Unicode properties in PCRE.
  
     Simple assertions
  
         The fourth use of backslash is for certain simple assertions. An asser-
-       tion  specifies a condition that has to be met at a particular point in
-       a match, without consuming any characters from the subject string.  The
-       use  of subpatterns for more complicated assertions is described below.
+       tion specifies a condition that has to be met at a particular point  in
+       a  match, without consuming any characters from the subject string. The
+       use of subpatterns for more complicated assertions is described  below.
         The backslashed assertions are:
  
           \b     matches at a word boundary
@@ -343,27 +395,26 @@ BACKSLASH
           \z     matches at end of subject
           \G     matches at first matching position in subject
  
-       These assertions may not appear in character classes (but note that  \b
+       These  assertions may not appear in character classes (but note that \b
         has a different meaning, namely the backspace character, inside a char-
         acter class).
  
-       A word boundary is a position in the subject string where  the  current
-       character  and  the previous character do not both match \w or \W (i.e.
-       one matches \w and the other matches \W), or the start or  end  of  the
+       A  word  boundary is a position in the subject string where the current
+       character and the previous character do not both match \w or  \W  (i.e.
+       one  matches  \w  and the other matches \W), or the start or end of the
         string if the first or last character matches \w, respectively.
  
-       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
+       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
         and dollar (described in the next section) in that they only ever match
-       at  the  very start and end of the subject string, whatever options are
-       set. Thus, they are independent of multiline mode. These  three  asser-
+       at the very start and end of the subject string, whatever  options  are
+       set.  Thus,  they are independent of multiline mode. These three asser-
         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
-       affect only the behaviour of the circumflex and dollar  metacharacters.
-       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
+       affect  only the behaviour of the circumflex and dollar metacharacters.
+       However, if the startoffset argument of pcre_exec() is non-zero,  indi-
         cating that matching is to start at a point other than the beginning of
-       the  subject,  \A  can never match. The difference between \Z and \z is
-       that \Z matches before a newline that is  the  last  character  of  the
-       string  as well as at the end of the string, whereas \z matches only at
-       the end.
+       the subject, \A can never match. The difference between \Z  and  \z  is
+       that \Z matches before a newline at the end of the string as well as at
+       the very end, whereas \z matches only at the end.
  
         The \G assertion is true only when the current matching position is  at
         the  start point of the match, as specified by the startoffset argument
@@ -402,57 +453,70 @@ CIRCUMFLEX AND DOLLAR
  
         A dollar character is an assertion that is true  only  if  the  current
         matching  point  is  at  the  end of the subject string, or immediately
-       before a newline character that is the last character in the string (by
-       default).  Dollar  need  not  be the last character of the pattern if a
-       number of alternatives are involved, but it should be the last item  in
-       any  branch  in  which  it appears.  Dollar has no special meaning in a
-       character class.
-
-       The meaning of dollar can be changed so that it  matches  only  at  the
-       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
+       before a newline at the end of the string (by default). Dollar need not
+       be  the  last  character of the pattern if a number of alternatives are
+       involved, but it should be the last item in  any  branch  in  which  it
+       appears. Dollar has no special meaning in a character class.
+
+       The  meaning  of  dollar  can be changed so that it matches only at the
+       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
         compile time. This does not affect the \Z assertion.
  
         The meanings of the circumflex and dollar characters are changed if the
-       PCRE_MULTILINE option is set. When this is the case, they match immedi-
-       ately after and  immediately  before  an  internal  newline  character,
-       respectively,  in addition to matching at the start and end of the sub-
-       ject string. For example,  the  pattern  /^abc$/  matches  the  subject
-       string  "def\nabc"  (where \n represents a newline character) in multi-
-       line mode, but not otherwise.  Consequently, patterns that are anchored
-       in  single line mode because all branches start with ^ are not anchored
-       in multiline mode, and a match for  circumflex  is  possible  when  the
-       startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-
-       LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
+       PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
+       matches  immediately after internal newlines as well as at the start of
+       the subject string. It does not match after a  newline  that  ends  the
+       string.  A dollar matches before any newlines in the string, as well as
+       at the very end, when PCRE_MULTILINE is set. When newline is  specified
+       as  the  two-character  sequence CRLF, isolated CR and LF characters do
+       not indicate newlines.
+
+       For example, the pattern /^abc$/ matches the subject string  "def\nabc"
+       (where  \n  represents a newline) in multiline mode, but not otherwise.
+       Consequently, patterns that are anchored in single  line  mode  because
+       all  branches  start  with  ^ are not anchored in multiline mode, and a
+       match for circumflex is  possible  when  the  startoffset  argument  of
+       pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+       PCRE_MULTILINE is set.
  
         Note that the sequences \A, \Z, and \z can be used to match  the  start
         and  end of the subject in both modes, and if all branches of a pattern
-       start with \A it is always anchored, whether PCRE_MULTILINE is  set  or
-       not.
+       start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
+       set.
  
  
  FULL STOP (PERIOD, DOT)
  
         Outside a character class, a dot in the pattern matches any one charac-
-       ter in the subject, including a non-printing  character,  but  not  (by
-       default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,
-       which might be more than one byte long, except (by default) newline. If
-       the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-
-       dling of dot is entirely independent of the handling of circumflex  and
-       dollar,  the  only  relationship  being  that they both involve newline
-       characters. Dot has no special meaning in a character class.
+       ter in the subject string except (by default) a character  that  signi-
+       fies  the  end  of  a line. In UTF-8 mode, the matched character may be
+       more than one byte long. When a line ending  is  defined  as  a  single
+       character  (CR  or LF), dot never matches that character; when the two-
+       character sequence CRLF is used, dot does not match CR if it is immedi-
+       ately  followed by LF, but otherwise it matches all characters (includ-
+       ing isolated CRs and LFs).
+
+       The behaviour of dot with regard to newlines can  be  changed.  If  the
+       PCRE_DOTALL  option  is  set,  a dot matches any one character, without
+       exception. If newline is defined as the two-character sequence CRLF, it
+       takes two dots to match it.
+
+       The  handling of dot is entirely independent of the handling of circum-
+       flex and dollar, the only relationship being  that  they  both  involve
+       newlines. Dot has no special meaning in a character class.
  
  
  MATCHING A SINGLE BYTE
  
         Outside a character class, the escape sequence \C matches any one byte,
-       both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.
-       The feature is provided in Perl in order to match individual  bytes  in
-       UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual
-       bytes, what remains in the string may be a malformed UTF-8 string.  For
+       both in and out of UTF-8 mode. Unlike a dot, it always matches  CR  and
+       LF.  The feature is provided in Perl in order to match individual bytes
+       in UTF-8 mode.  Because it breaks up UTF-8 characters  into  individual
+       bytes,  what remains in the string may be a malformed UTF-8 string. For
         this reason, the \C escape sequence is best avoided.
  
-       PCRE  does  not  allow \C to appear in lookbehind assertions (described
-       below), because in UTF-8 mode this would make it impossible  to  calcu-
+       PCRE does not allow \C to appear in  lookbehind  assertions  (described
+       below),  because  in UTF-8 mode this would make it impossible to calcu-
         late the length of the lookbehind.
  
  
@@ -461,39 +525,46 @@ SQUARE BRACKETS AND CHARACTER CLASSES
         An opening square bracket introduces a character class, terminated by a
         closing square bracket. A closing square bracket on its own is not spe-
         cial. If a closing square bracket is required as a member of the class,
-       it should be the first data character in the class  (after  an  initial
+       it  should  be  the first data character in the class (after an initial
         circumflex, if present) or escaped with a backslash.
  
-       A  character  class matches a single character in the subject. In UTF-8
-       mode, the character may occupy more than one byte. A matched  character
+       A character class matches a single character in the subject.  In  UTF-8
+       mode,  the character may occupy more than one byte. A matched character
         must be in the set of characters defined by the class, unless the first
-       character in the class definition is a circumflex, in  which  case  the
-       subject  character  must  not  be in the set defined by the class. If a
-       circumflex is actually required as a member of the class, ensure it  is
+       character  in  the  class definition is a circumflex, in which case the
+       subject character must not be in the set defined by  the  class.  If  a
+       circumflex  is actually required as a member of the class, ensure it is
         not the first character, or escape it with a backslash.
  
-       For  example, the character class [aeiou] matches any lower case vowel,
-       while [^aeiou] matches any character that is not a  lower  case  vowel.
+       For example, the character class [aeiou] matches any lower case  vowel,
+       while  [^aeiou]  matches  any character that is not a lower case vowel.
         Note that a circumflex is just a convenient notation for specifying the
-       characters that are in the class by enumerating those that are  not.  A
-       class  that starts with a circumflex is not an assertion: it still con-
-       sumes a character from the subject string, and therefore  it  fails  if
+       characters  that  are in the class by enumerating those that are not. A
+       class that starts with a circumflex is not an assertion: it still  con-
+       sumes  a  character  from the subject string, and therefore it fails if
         the current pointer is at the end of the string.
  
-       In  UTF-8 mode, characters with values greater than 255 can be included
-       in a class as a literal string of bytes, or by using the  \x{  escaping
+       In UTF-8 mode, characters with values greater than 255 can be  included
+       in  a  class as a literal string of bytes, or by using the \x{ escaping
         mechanism.
  
-       When  caseless  matching  is set, any letters in a class represent both
-       their upper case and lower case versions, so for  example,  a  caseless
-       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
-       match "A", whereas a caseful version would. When running in UTF-8 mode,
-       PCRE  supports  the  concept of case for characters with values greater
-       than 128 only when it is compiled with Unicode property support.
-
-       The newline character is never treated in any special way in  character
-       classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
-       options is. A class such as [^a] will always match a newline.
+       When caseless matching is set, any letters in a  class  represent  both
+       their  upper  case  and lower case versions, so for example, a caseless
+       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
+       match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
+       understands the concept of case for characters whose  values  are  less
+       than  128, so caseless matching is always possible. For characters with
+       higher values, the concept of case is supported  if  PCRE  is  compiled
+       with  Unicode  property support, but not otherwise.  If you want to use
+       caseless matching for characters 128 and above, you  must  ensure  that
+       PCRE  is  compiled  with Unicode property support as well as with UTF-8
+       support.
+
+       Characters that might indicate  line  breaks  (CR  and  LF)  are  never
+       treated  in  any  special way when matching character classes, whatever
+       line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
+       and PCRE_MULTILINE options is used. A class such as [^a] always matches
+       one of these characters.
  
         The minus (hyphen) character can be used to specify a range of  charac-
         ters  in  a  character  class.  For  example,  [d-m] matches any letter
@@ -594,11 +665,10 @@ VERTICAL BAR
  
         matches  either "gilbert" or "sullivan". Any number of alternatives may
         appear, and an empty  alternative  is  permitted  (matching  the  empty
-       string).   The  matching  process  tries each alternative in turn, from
-       left to right, and the first one that succeeds is used. If the alterna-
-       tives  are within a subpattern (defined below), "succeeds" means match-
-       ing the rest of the main pattern as well as the alternative in the sub-
-       pattern.
+       string). The matching process tries each alternative in turn, from left
+       to right, and the first one that succeeds is used. If the  alternatives
+       are  within a subpattern (defined below), "succeeds" means matching the
+       rest of the main pattern as well as the alternative in the  subpattern.
  
  
  INTERNAL OPTION SETTING
@@ -644,12 +714,9 @@ INTERNAL OPTION SETTING
         the effects of option settings happen at compile time. There  would  be
         some very weird behaviour otherwise.
  
-       The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
-       in the same way as the Perl-compatible options by using the  characters
-       U  and X respectively. The (?X) flag setting is special in that it must
-       always occur earlier in the pattern than any of the additional features
-       it  turns on, even when it is at top level. It is best to put it at the
-       start.
+       The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
+       can be changed in the same way as the Perl-compatible options by  using
+       the characters J, U and X respectively.
  
  
  SUBPATTERNS
@@ -661,18 +728,18 @@ SUBPATTERNS
  
           cat(aract|erpillar|)
  
-       matches  one  of the words "cat", "cataract", or "caterpillar". Without
-       the parentheses, it would match "cataract",  "erpillar"  or  the  empty
+       matches one of the words "cat", "cataract", or  "caterpillar".  Without
+       the  parentheses,  it  would  match "cataract", "erpillar" or the empty
         string.
  
-       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
-       that, when the whole pattern  matches,  that  portion  of  the  subject
+       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
+       that,  when  the  whole  pattern  matches,  that portion of the subject
         string that matched the subpattern is passed back to the caller via the
-       ovector argument of pcre_exec(). Opening parentheses are  counted  from
-       left  to  right  (starting  from 1) to obtain numbers for the capturing
+       ovector  argument  of pcre_exec(). Opening parentheses are counted from
+       left to right (starting from 1) to obtain  numbers  for  the  capturing
         subpatterns.
  
-       For example, if the string "the red king" is matched against  the  pat-
+       For  example,  if the string "the red king" is matched against the pat-
         tern
  
           the ((red|white) (king|queen))
@@ -680,50 +747,75 @@ SUBPATTERNS
         the captured substrings are "red king", "red", and "king", and are num-
         bered 1, 2, and 3, respectively.
  
-       The fact that plain parentheses fulfil  two  functions  is  not  always
-       helpful.   There are often times when a grouping subpattern is required
-       without a capturing requirement. If an opening parenthesis is  followed
-       by  a question mark and a colon, the subpattern does not do any captur-
-       ing, and is not counted when computing the  number  of  any  subsequent
-       capturing  subpatterns. For example, if the string "the white queen" is
+       The  fact  that  plain  parentheses  fulfil two functions is not always
+       helpful.  There are often times when a grouping subpattern is  required
+       without  a capturing requirement. If an opening parenthesis is followed
+       by a question mark and a colon, the subpattern does not do any  captur-
+       ing,  and  is  not  counted when computing the number of any subsequent
+       capturing subpatterns. For example, if the string "the white queen"  is
         matched against the pattern
  
           the ((?:red|white) (king|queen))
  
         the captured substrings are "white queen" and "queen", and are numbered
-       1  and 2. The maximum number of capturing subpatterns is 65535, and the
-       maximum depth of nesting of all subpatterns, both  capturing  and  non-
+       1 and 2. The maximum number of capturing subpatterns is 65535, and  the
+       maximum  depth  of  nesting of all subpatterns, both capturing and non-
         capturing, is 200.
  
-       As  a  convenient shorthand, if any option settings are required at the
-       start of a non-capturing subpattern,  the  option  letters  may  appear
+       As a convenient shorthand, if any option settings are required  at  the
+       start  of  a  non-capturing  subpattern,  the option letters may appear
         between the "?" and the ":". Thus the two patterns
  
           (?i:saturday|sunday)
           (?:(?i)saturday|sunday)
  
         match exactly the same set of strings. Because alternative branches are
-       tried from left to right, and options are not reset until  the  end  of
-       the  subpattern is reached, an option setting in one branch does affect
-       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
+       tried  from  left  to right, and options are not reset until the end of
+       the subpattern is reached, an option setting in one branch does  affect
+       subsequent  branches,  so  the above patterns match "SUNDAY" as well as
         "Saturday".
  
  
  NAMED SUBPATTERNS
  
-       Identifying  capturing  parentheses  by number is simple, but it can be
-       very hard to keep track of the numbers in complicated  regular  expres-
-       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
-       change. To help with this difficulty, PCRE supports the naming of  sub-
-       patterns,  something  that  Perl  does  not  provide. The Python syntax
-       (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
-       underscores, and must be unique within a pattern.
-
-       Named  capturing  parentheses  are  still  allocated numbers as well as
+       Identifying capturing parentheses by number is simple, but  it  can  be
+       very  hard  to keep track of the numbers in complicated regular expres-
+       sions. Furthermore, if an  expression  is  modified,  the  numbers  may
+       change.  To help with this difficulty, PCRE supports the naming of sub-
+       patterns, something that Perl  does  not  provide.  The  Python  syntax
+       (?P<name>...)  is  used. References to capturing parentheses from other
+       parts of the pattern, such as  backreferences,  recursion,  and  condi-
+       tions, can be made by name as well as by number.
+
+       Names  consist  of  up  to  32 alphanumeric characters and underscores.
+       Named capturing parentheses are still  allocated  numbers  as  well  as
         names. The PCRE API provides function calls for extracting the name-to-
-       number  translation table from a compiled pattern. There is also a con-
-       venience function for extracting a captured substring by name. For fur-
-       ther details see the pcreapi documentation.
+       number translation table from a compiled pattern. There is also a  con-
+       venience function for extracting a captured substring by name.
+
+       By  default, a name must be unique within a pattern, but it is possible
+       to relax this constraint by setting the PCRE_DUPNAMES option at compile
+       time.  This  can  be useful for patterns where only one instance of the
+       named parentheses can match. Suppose you want to match the  name  of  a
+       weekday,  either as a 3-letter abbreviation or as the full name, and in
+       both cases you want to extract the abbreviation. This pattern (ignoring
+       the line breaks) does the job:
+
+         (?P<DN>Mon|Fri|Sun)(?:day)?|
+         (?P<DN>Tue)(?:sday)?|
+         (?P<DN>Wed)(?:nesday)?|
+         (?P<DN>Thu)(?:rsday)?|
+         (?P<DN>Sat)(?:urday)?
+
+       There  are  five capturing substrings, but only one is ever set after a
+       match.  The convenience  function  for  extracting  the  data  by  name
+       returns  the  substring  for  the first, and in this example, the only,
+       subpattern of that name that matched.  This  saves  searching  to  find
+       which  numbered  subpattern  it  was. If you make a reference to a non-
+       unique named subpattern from elsewhere in the  pattern,  the  one  that
+       corresponds  to  the  lowest number is used. For further details of the
+       interfaces for handling named subpatterns, see the  pcreapi  documenta-
+       tion.
  
  
  REPETITION
@@ -932,8 +1024,10 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
         meaning  or  processing  of  a possessive quantifier and the equivalent
         atomic group.
  
-       The possessive quantifier syntax is an extension to the Perl syntax. It
-       originates in Sun's Java package.
+       The possessive quantifier syntax is an extension to  the  Perl  syntax.
+       Jeffrey  Friedl originated the idea (and the name) in the first edition
+       of his book.  Mike McCloskey liked it, so implemented it when he  built
+       Sun's Java package, and PCRE copied it from there.
  
         When  a  pattern  contains an unlimited repeat inside a subpattern that
         can itself be repeated an unlimited number of  times,  the  use  of  an
@@ -974,31 +1068,41 @@ BACK REFERENCES
         it  is  always  taken  as a back reference, and causes an error only if
         there are not that many capturing left parentheses in the  entire  pat-
         tern.  In  other words, the parentheses that are referenced need not be
-       to the left of the reference for numbers less than 10. See the  subsec-
-       tion  entitled  "Non-printing  characters" above for further details of
-       the handling of digits following a backslash.
+       to the left of the reference for numbers less than 10. A "forward  back
+       reference"  of  this  type can make sense when a repetition is involved
+       and the subpattern to the right has participated in an  earlier  itera-
+       tion.
+
+       It is not possible to have a numerical "forward back reference" to sub-
+       pattern whose number is 10 or more. However, a back  reference  to  any
+       subpattern  is  possible  using named parentheses (see below). See also
+       the subsection entitled "Non-printing  characters"  above  for  further
+       details of the handling of digits following a backslash.
  
-       A back reference matches whatever actually matched the  capturing  sub-
-       pattern  in  the  current subject string, rather than anything matching
+       A  back  reference matches whatever actually matched the capturing sub-
+       pattern in the current subject string, rather  than  anything  matching
         the subpattern itself (see "Subpatterns as subroutines" below for a way
         of doing that). So the pattern
  
           (sens|respons)e and \1ibility
  
-       matches  "sense and sensibility" and "response and responsibility", but
-       not "sense and responsibility". If caseful matching is in force at  the
-       time  of the back reference, the case of letters is relevant. For exam-
+       matches "sense and sensibility" and "response and responsibility",  but
+       not  "sense and responsibility". If caseful matching is in force at the
+       time of the back reference, the case of letters is relevant. For  exam-
         ple,
  
           ((?i)rah)\s+\1
  
-       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
+       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
         original capturing subpattern is matched caselessly.
  
-       Back  references  to named subpatterns use the Python syntax (?P=name).
+       Back references to named subpatterns use the Python  syntax  (?P=name).
         We could rewrite the above example as follows:
  
-         (?<p1>(?i)rah)\s+(?P=p1)
+         (?P<p1>(?i)rah)\s+(?P=p1)
+
+       A  subpattern  that  is  referenced  by  name may appear in the pattern
+       before or after the reference.
  
         There may be more than one back reference to the same subpattern. If  a
         subpattern  has  not actually been used in a particular match, any back
@@ -1087,8 +1191,8 @@ ASSERTIONS
         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
         contents of a lookbehind assertion are restricted  such  that  all  the
         strings it matches must have a fixed length. However, if there are sev-
-       eral alternatives, they do not all have to have the same fixed  length.
-       Thus
+       eral top-level alternatives, they do not all  have  to  have  the  same
+       fixed length. Thus
  
           (?<=bullock|donkey)
  
@@ -1201,12 +1305,18 @@ CONDITIONAL SUBPATTERNS
         tives in the subpattern, a compile-time error occurs.
  
         There are three kinds of condition. If the text between the parentheses
-       consists of a sequence of digits, the condition  is  satisfied  if  the
-       capturing  subpattern of that number has previously matched. The number
-       must be greater than zero. Consider the following pattern,  which  con-
-       tains  non-significant white space to make it more readable (assume the
-       PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
-       discussion:
+       consists of a sequence of digits, or a sequence of alphanumeric charac-
+       ters  and underscores, the condition is satisfied if the capturing sub-
+       pattern of that number or name has previously matched. There is a  pos-
+       sible  ambiguity here, because subpattern names may consist entirely of
+       digits. PCRE looks first for a named subpattern; if it cannot find  one
+       and  the text consists entirely of digits, it looks for a subpattern of
+       that number, which must be greater than zero.  Using  subpattern  names
+       that consist entirely of digits is not recommended.
+
+       Consider  the  following  pattern, which contains non-significant white
+       space to make it more readable (assume the PCRE_EXTENDED option) and to
+       divide it into three parts for ease of discussion:
  
           ( \( )?    [^()]+    (?(1) \) )
  
@@ -1219,12 +1329,16 @@ CONDITIONAL SUBPATTERNS
         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
         since no-pattern is not present, the  subpattern  matches  nothing.  In
         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
-       optionally enclosed in parentheses.
+       optionally enclosed in parentheses. Rewriting it to use a named subpat-
+       tern gives this:
+
+         (?P<OPEN> \( )?    [^()]+    (?(OPEN) \) )
  
-       If the condition is the string (R), it is satisfied if a recursive call
-       to  the pattern or subpattern has been made. At "top level", the condi-
-       tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
-       described in the next section.
+       If the condition is the string (R), and there is no subpattern with the
+       name R, the condition is satisfied if a recursive call to  the  pattern
+       or  subpattern  has  been made. At "top level", the condition is false.
+       This is a PCRE extension.  Recursive patterns are described in the next
+       section.
  
         If  the  condition  is  not  a sequence of digits or (R), it must be an
         assertion.  This may be a positive or negative lookahead or  lookbehind
@@ -1251,8 +1365,8 @@ COMMENTS
         at all.
  
         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
-       character class introduces a comment that continues up to the next new-
-       line character in the pattern.
+       character class introduces a  comment  that  continues  to  immediately
+       after the next newline in the pattern.
  
  
  RECURSIVE PATTERNS
@@ -1282,15 +1396,19 @@ RECURSIVE PATTERNS
         tion.)  The special item (?R) is a recursive call of the entire regular
         expression.
  
-       For example, this PCRE pattern solves the  nested  parentheses  problem
-       (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is
-       ignored):
+       A recursive subpattern call is always treated as an atomic group.  That
+       is,  once  it  has  matched some of the subject string, it is never re-
+       entered, even if it contains untried alternatives and there is a subse-
+       quent matching failure.
+
+       This  PCRE  pattern  solves  the nested parentheses problem (assume the
+       PCRE_EXTENDED option is set so that white space is ignored):
  
           \( ( (?>[^()]+) | (?R) )* \)
  
         First it matches an opening parenthesis. Then it matches any number  of
         substrings  which  can  either  be  a sequence of non-parentheses, or a
-       recursive match of the pattern itself (that is  a  correctly  parenthe-
+       recursive match of the pattern itself (that is, a  correctly  parenthe-
         sized substring).  Finally there is a closing parenthesis.
  
         If  this  were  part of a larger pattern, you would not want to recurse
@@ -1371,8 +1489,14 @@ SUBPATTERNS AS SUBROUTINES
           (sens|respons)e and (?1)ibility
  
         is  used, it does match "sense and responsibility" as well as the other
-       two strings. Such references must, however, follow  the  subpattern  to
-       which they refer.
+       two strings. Such references, if given  numerically,  must  follow  the
+       subpattern  to which they refer. However, named references can refer to
+       later subpatterns.
+
+       Like recursive subpatterns, a "subroutine" call is always treated as an
+       atomic  group. That is, once it has matched some of the subject string,
+       it is never re-entered, even if it contains  untried  alternatives  and
+       there is a subsequent matching failure.
  
  
  CALLOUTS
@@ -1409,5 +1533,5 @@ CALLOUTS
         gether. A complete description of the interface to the callout function
         is given in the pcrecallout documentation.
  
-Last updated: 09 September 2004
-Copyright (c) 1997-2004 University of Cambridge.
+Last updated: 06 June 2006
+Copyright (c) 1997-2006 University of Cambridge.