X-Git-Url: https://git.exim.org/users/heiko/exim.git/blobdiff_plain/495ae4b01f36d0d8bb0e34a1d7263c2b8224aa4a..ad26813496addda838a0512075cacd58dca01b30:/doc/doc-txt/pcretest.txt?ds=sidebyside diff --git a/doc/doc-txt/pcretest.txt b/doc/doc-txt/pcretest.txt index 9e9b70ef4..dfa03b80b 100644 --- a/doc/doc-txt/pcretest.txt +++ b/doc/doc-txt/pcretest.txt @@ -1,19 +1,18 @@ -This file contains the PCRE man page that described the pcretest program. Note -that not all of the features of PCRE are available in the limited version that +This file contains the PCRE man page that described the pcretest program. Note +that not all of the features of PCRE are available in the limited version that is built with Exim. ------------------------------------------------------------------------------- PCRETEST(1) PCRETEST(1) - NAME pcretest - a program for testing Perl-compatible regular expressions. + SYNOPSIS - pcretest [-C] [-d] [-i] [-m] [-o osize] [-p] [-t] [source] - [destination] + pcretest [options] [source] [destination] pcretest was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular @@ -29,55 +28,67 @@ OPTIONS able information about the optional features that are included, and then exit. - -d Behave as if each regex had the /D (debug) modifier; the + -d Behave as if each regex has the /D (debug) modifier; the internal form is output after compilation. - -i Behave as if each regex had the /I modifier; information + -dfa Behave as if each data line contains the \D escape sequence; + this causes the alternative matching function, + pcre_dfa_exec(), to be used instead of the standard + pcre_exec() function (more detail is given below). + + -i Behave as if each regex has the /I modifier; information about the compiled pattern is given after compilation. - -m Output the size of each compiled pattern after it has been - compiled. This is equivalent to adding /M to each regular - expression. For compatibility with earlier versions of + -m Output the size of each compiled pattern after it has been + compiled. This is equivalent to adding /M to each regular + expression. For compatibility with earlier versions of pcretest, -s is a synonym for -m. - -o osize Set the number of elements in the output vector that is used - when calling pcre_exec() to be osize. The default value is + -o osize Set the number of elements in the output vector that is used + when calling pcre_exec() to be osize. The default value is 45, which is enough for 14 capturing subexpressions. The vec- - tor size can be changed for individual matching calls by + tor size can be changed for individual matching calls by including \O in the data line (see below). - -p Behave as if each regex has /P modifier; the POSIX wrapper - API is used to call PCRE. None of the other options has any - effect when -p is set. + -p Behave as if each regex has the /P modifier; the POSIX wrap- + per API is used to call PCRE. None of the other options has + any effect when -p is set. + + -q Do not output the version number of pcretest at the start of + execution. - -t Run each compile, study, and match many times with a timer, - and output resulting time per compile or match (in millisec- - onds). Do not set -m with -t, because you will then get the - size output a zillion times, and the timing will be dis- + -S size On Unix-like systems, set the size of the runtime stack to + size megabytes. + + -t Run each compile, study, and match many times with a timer, + and output resulting time per compile or match (in millisec- + onds). Do not set -m with -t, because you will then get the + size output a zillion times, and the timing will be dis- torted. DESCRIPTION - If pcretest is given two filename arguments, it reads from the first + If pcretest is given two filename arguments, it reads from the first and writes to the second. If it is given only one filename argument, it - reads from that file and writes to stdout. Otherwise, it reads from - stdin and writes to stdout, and prompts for each line of input, using + reads from that file and writes to stdout. Otherwise, it reads from + stdin and writes to stdout, and prompts for each line of input, using "re>" to prompt for regular expressions, and "data>" to prompt for data lines. The program handles any number of sets of input on a single input file. - Each set starts with a regular expression, and continues with any num- + Each set starts with a regular expression, and continues with any num- ber of data lines to be matched against the pattern. - Each data line is matched separately and independently. If you want to - do multiple-line matches, you have to use the \n escape sequence in a - single line of input to encode the newline characters. The maximum - length of data line is 30,000 characters. + Each data line is matched separately and independently. If you want to + do multi-line matches, you have to use the \n escape sequence (or \r or + \r\n, depending on the newline setting) in a single line of input to + encode the newline characters. There is no limit on the length of data + lines; the input buffer is automatically extended if it is too small. An empty line signals the end of the data lines, at which point a new regular expression is read. The regular expressions are given enclosed - in any non-alphanumeric delimiters other than backslash, for example + in any non-alphanumeric delimiters other than backslash, for example: /(a|bc)x+yz/ @@ -125,12 +136,23 @@ PATTERN MODIFIERS The following table shows additional modifiers for setting PCRE options that do not correspond to anything in Perl: - /A PCRE_ANCHORED - /C PCRE_AUTO_CALLOUT - /E PCRE_DOLLAR_ENDONLY - /N PCRE_NO_AUTO_CAPTURE - /U PCRE_UNGREEDY - /X PCRE_EXTRA + /A PCRE_ANCHORED + /C PCRE_AUTO_CALLOUT + /E PCRE_DOLLAR_ENDONLY + /f PCRE_FIRSTLINE + /J PCRE_DUPNAMES + /N PCRE_NO_AUTO_CAPTURE + /U PCRE_UNGREEDY + /X PCRE_EXTRA + / PCRE_NEWLINE_CR + / PCRE_NEWLINE_LF + / PCRE_NEWLINE_CRLF + + Those specifying line endings are literal strings as shown. Details of + the meanings of these PCRE options are given in the pcreapi documenta- + tion. + + Finding all matches in a string Searching for all possible matches within each subject string can be requested by the /g or /G modifier. After finding a match, PCRE is @@ -149,6 +171,8 @@ PATTERN MODIFIERS one, and the normal match is retried. This imitates the way Perl han- dles such cases when using the /g modifier or the split() function. + Other modifiers + There are yet more modifiers for controlling the way pcretest operates. The /+ modifier requests that as well as outputting the substring that @@ -224,6 +248,8 @@ DATA LINES \e escape \f formfeed \n newline + \qdd set the PCRE_MATCH_LIMIT limit to dd + (any number of digits) \r carriage return \t tab \v vertical tab @@ -232,7 +258,9 @@ DATA LINES \x{hh...} hexadecimal character, any number of digits in UTF-8 mode \A pass the PCRE_ANCHORED option to pcre_exec() + or pcre_dfa_exec() \B pass the PCRE_NOTBOL option to pcre_exec() + or pcre_dfa_exec() \Cdd call pcre_copy_substring() for substring dd after a successful match (number less than 32) \Cname call pcre_copy_named_substring() for substring @@ -247,6 +275,8 @@ DATA LINES reached for the nth time \C*n pass the number n (may be negative) as callout data; this is used as the callout return value + \D use the pcre_dfa_exec() match function + \F only shortest match for pcre_dfa_exec() \Gdd call pcre_get_substring() for substring dd after a successful match (number less than 32) \Gname call pcre_get_named_substring() for substring @@ -254,47 +284,84 @@ DATA LINES ated by next non-alphanumeric character) \L call pcre_get_substringlist() after a successful match - \M discover the minimum MATCH_LIMIT setting + \M discover the minimum MATCH_LIMIT and + MATCH_LIMIT_RECURSION settings \N pass the PCRE_NOTEMPTY option to pcre_exec() + or pcre_dfa_exec() \Odd set the size of the output vector passed to pcre_exec() to dd (any number of digits) \P pass the PCRE_PARTIAL option to pcre_exec() + or pcre_dfa_exec() + \Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd + (any number of digits) + \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec() \S output details of memory get/free calls during matching \Z pass the PCRE_NOTEOL option to pcre_exec() + or pcre_dfa_exec() \? pass the PCRE_NO_UTF8_CHECK option to - pcre_exec() + pcre_exec() or pcre_dfa_exec() \>dd start the match at offset dd (any number of digits); this sets the startoffset argument for pcre_exec() - - A backslash followed by anything else just escapes the anything else. - If the very last character is a backslash, it is ignored. This gives a - way of passing an empty line as data, since a real empty line termi- - nates the data input. - - If \M is present, pcretest calls pcre_exec() several times, with dif- - ferent values in the match_limit field of the pcre_extra data struc- - ture, until it finds the minimum number that is needed for pcre_exec() - to complete. This number is a measure of the amount of recursion and - backtracking that takes place, and checking it out can be instructive. - For most simple matches, the number is quite small, but for patterns - with very large numbers of matching possibilities, it can become large - very quickly with increasing length of subject string. + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_CR option to pcre_exec() + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_LF option to pcre_exec() + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_CRLF option to pcre_exec() + or pcre_dfa_exec() + + The escapes that specify line endings are literal strings, exactly as + shown. A backslash followed by anything else just escapes the anything + else. If the very last character is a backslash, it is ignored. This + gives a way of passing an empty line as data, since a real empty line + terminates the data input. + + If \M is present, pcretest calls pcre_exec() several times, with dif- + ferent values in the match_limit and match_limit_recursion fields of + the pcre_extra data structure, until it finds the minimum numbers for + each parameter that allow pcre_exec() to complete. The match_limit num- + ber is a measure of the amount of backtracking that takes place, and + checking it out can be instructive. For most simple matches, the number + is quite small, but for patterns with very large numbers of matching + possibilities, it can become large very quickly with increasing length + of subject string. The match_limit_recursion number is a measure of how + much stack (or, if PCRE is compiled with NO_RECURSE, how much heap) + memory is needed to complete the match attempt. When \O is used, the value specified may be higher or lower than the size set by the -O command line option (or defaulted to 45); \O applies only to the call of pcre_exec() for the line in which it appears. If the /P modifier was present on the pattern, causing the POSIX wrap- - per API to be used, only \B and \Z have any effect, causing REG_NOTBOL - and REG_NOTEOL to be passed to regexec() respectively. + per API to be used, the only option-setting sequences that have any + effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, + to be passed to regexec(). + + The use of \x{hh...} to represent UTF-8 characters is not dependent on + the use of the /8 modifier on the pattern. It is recognized always. + There may be any number of hexadecimal digits inside the braces. The + result is from one to six bytes, encoded according to the UTF-8 rules. + + +THE ALTERNATIVE MATCHING FUNCTION + + By default, pcretest uses the standard PCRE matching function, + pcre_exec() to match each data line. From release 6.0, PCRE supports an + alternative matching function, pcre_dfa_test(), which operates in a + different way, and has some restrictions. The differences between the + two functions are described in the pcrematching documentation. - The use of \x{hh...} to represent UTF-8 characters is not dependent on - the use of the /8 modifier on the pattern. It is recognized always. - There may be any number of hexadecimal digits inside the braces. The - result is from one to six bytes, encoded according to the UTF-8 rules. + If a data line contains the \D escape sequence, or if the command line + contains the -dfa option, the alternative matching function is called. + This function finds all possible matches at a given point. If, however, + the \F escape sequence is present in the data line, it stops after the + first match is found. This is always the shortest possible match. -OUTPUT FROM PCRETEST +DEFAULT OUTPUT FROM PCRETEST + + This section describes the output when the normal matching function, + pcre_exec(), is being used. When a match succeeds, pcretest outputs the list of captured substrings that pcre_exec() returns, starting with number 0 for the string that @@ -347,15 +414,67 @@ OUTPUT FROM PCRETEST Note that while patterns can be continued over several lines (a plain ">" prompt is used for continuations), data lines may not. However new- - lines can be included in data by means of the \n escape. + lines can be included in data by means of the \n escape (or \r or \r\n + for those newline settings). + + +OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION + + When the alternative matching function, pcre_dfa_exec(), is used (by + means of the \D escape sequence or the -dfa command line option), the + output consists of a list of all the matches that start at the first + point in the subject where there is at least one match. For example: + + re> /(tang|tangerine|tan)/ + data> yellow tangerine\D + 0: tangerine + 1: tang + 2: tan + + (Using the normal matching function on this data finds only "tang".) + The longest matching string is always given first (and numbered zero). + + If /gP is present on the pattern, the search for further matches + resumes at the end of the longest match. For example: + + re> /(tang|tangerine|tan)/g + data> yellow tangerine and tangy sultana\D + 0: tangerine + 1: tang + 2: tan + 0: tang + 1: tan + 0: tan + + Since the matching function does not support substring capture, the + escape sequences that are concerned with captured substrings are not + relevant. + + +RESTARTING AFTER A PARTIAL MATCH + + When the alternative matching function has given the PCRE_ERROR_PARTIAL + return, indicating that the subject partially matched the pattern, you + can restart the match with additional subject data by means of the \R + escape sequence. For example: + + re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ + data> 23ja\P\D + Partial match: 23ja + data> n05\R\D + 0: n05 + + For further information about partial matching, see the pcrepartial + documentation. CALLOUTS - If the pattern contains any callout requests, pcretest's callout func- - tion is called during matching. By default, it displays the callout - number, the start and current positions in the text at the callout - time, and the next pattern item to be tested. For example, the output + If the pattern contains any callout requests, pcretest's callout func- + tion is called during matching. This works with both matching func- + tions. By default, the called function displays the callout number, the + start and current positions in the text at the callout time, and the + next pattern item to be tested. For example, the output --->pqrabcdef 0 ^ ^ \d @@ -381,7 +500,7 @@ CALLOUTS 0: E* The callout function in pcretest returns zero (carry on matching) by - default, but you can use an \C item in a data line (as described above) + default, but you can use a \C item in a data line (as described above) to change this. Inserting callouts can be helpful when using pcretest to check compli- @@ -447,9 +566,9 @@ SAVING AND RELOADING COMPILED PATTERNS AUTHOR - Philip Hazel + Philip Hazel University Computing Service, Cambridge CB2 3QG, England. -Last updated: 10 September 2004 -Copyright (c) 1997-2004 University of Cambridge. +Last updated: 29 June 2006 +Copyright (c) 1997-2006 University of Cambridge.