X-Git-Url: https://git.exim.org/users/jgh/exim.git/blobdiff_plain/8ac170f35ed82789928f9e94beaa38991761a88c..56f5d9bd6bb563f4f0eab011ed665da234d93e37:/doc/doc-txt/pcretest.txt diff --git a/doc/doc-txt/pcretest.txt b/doc/doc-txt/pcretest.txt index 384a6c38f..dfa03b80b 100644 --- a/doc/doc-txt/pcretest.txt +++ b/doc/doc-txt/pcretest.txt @@ -6,14 +6,13 @@ is built with Exim. PCRETEST(1) PCRETEST(1) - NAME pcretest - a program for testing Perl-compatible regular expressions. + SYNOPSIS - pcretest [-C] [-d] [-dfa] [-i] [-m] [-o osize] [-p] [-t] [source] - [destination] + pcretest [options] [source] [destination] pcretest was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular @@ -55,6 +54,12 @@ OPTIONS per API is used to call PCRE. None of the other options has any effect when -p is set. + -q Do not output the version number of pcretest at the start of + execution. + + -S size On Unix-like systems, set the size of the runtime stack to + size megabytes. + -t Run each compile, study, and match many times with a timer, and output resulting time per compile or match (in millisec- onds). Do not set -m with -t, because you will then get the @@ -76,53 +81,54 @@ DESCRIPTION ber of data lines to be matched against the pattern. Each data line is matched separately and independently. If you want to - do multiple-line matches, you have to use the \n escape sequence in a - single line of input to encode the newline characters. The maximum - length of data line is 30,000 characters. + do multi-line matches, you have to use the \n escape sequence (or \r or + \r\n, depending on the newline setting) in a single line of input to + encode the newline characters. There is no limit on the length of data + lines; the input buffer is automatically extended if it is too small. - An empty line signals the end of the data lines, at which point a new - regular expression is read. The regular expressions are given enclosed - in any non-alphanumeric delimiters other than backslash, for example + An empty line signals the end of the data lines, at which point a new + regular expression is read. The regular expressions are given enclosed + in any non-alphanumeric delimiters other than backslash, for example: /(a|bc)x+yz/ - White space before the initial delimiter is ignored. A regular expres- - sion may be continued over several input lines, in which case the new- - line characters are included within it. It is possible to include the + White space before the initial delimiter is ignored. A regular expres- + sion may be continued over several input lines, in which case the new- + line characters are included within it. It is possible to include the delimiter within the pattern by escaping it, for example /abc\/def/ - If you do so, the escape and the delimiter form part of the pattern, - but since delimiters are always non-alphanumeric, this does not affect - its interpretation. If the terminating delimiter is immediately fol- + If you do so, the escape and the delimiter form part of the pattern, + but since delimiters are always non-alphanumeric, this does not affect + its interpretation. If the terminating delimiter is immediately fol- lowed by a backslash, for example, /abc/\ - then a backslash is added to the end of the pattern. This is done to - provide a way of testing the error condition that arises if a pattern + then a backslash is added to the end of the pattern. This is done to + provide a way of testing the error condition that arises if a pattern finishes with a backslash, because /abc\/ - is interpreted as the first line of a pattern that starts with "abc/", + is interpreted as the first line of a pattern that starts with "abc/", causing pcretest to read the next line as a continuation of the regular expression. PATTERN MODIFIERS - A pattern may be followed by any number of modifiers, which are mostly - single characters. Following Perl usage, these are referred to below - as, for example, "the /i modifier", even though the delimiter of the - pattern need not always be a slash, and no slash is used when writing - modifiers. Whitespace may appear between the final pattern delimiter + A pattern may be followed by any number of modifiers, which are mostly + single characters. Following Perl usage, these are referred to below + as, for example, "the /i modifier", even though the delimiter of the + pattern need not always be a slash, and no slash is used when writing + modifiers. Whitespace may appear between the final pattern delimiter and the first modifier, and between the modifiers themselves. The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE, - PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com- - pile() is called. These four modifier letters have the same effect as + PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com- + pile() is called. These four modifier letters have the same effect as they do in Perl. For example: /caseless/i @@ -130,99 +136,111 @@ PATTERN MODIFIERS The following table shows additional modifiers for setting PCRE options that do not correspond to anything in Perl: - /A PCRE_ANCHORED - /C PCRE_AUTO_CALLOUT - /E PCRE_DOLLAR_ENDONLY - /f PCRE_FIRSTLINE - /N PCRE_NO_AUTO_CAPTURE - /U PCRE_UNGREEDY - /X PCRE_EXTRA - - Searching for all possible matches within each subject string can be - requested by the /g or /G modifier. After finding a match, PCRE is + /A PCRE_ANCHORED + /C PCRE_AUTO_CALLOUT + /E PCRE_DOLLAR_ENDONLY + /f PCRE_FIRSTLINE + /J PCRE_DUPNAMES + /N PCRE_NO_AUTO_CAPTURE + /U PCRE_UNGREEDY + /X PCRE_EXTRA + / PCRE_NEWLINE_CR + / PCRE_NEWLINE_LF + / PCRE_NEWLINE_CRLF + + Those specifying line endings are literal strings as shown. Details of + the meanings of these PCRE options are given in the pcreapi documenta- + tion. + + Finding all matches in a string + + Searching for all possible matches within each subject string can be + requested by the /g or /G modifier. After finding a match, PCRE is called again to search the remainder of the subject string. The differ- ence between /g and /G is that the former uses the startoffset argument - to pcre_exec() to start searching at a new point within the entire - string (which is in effect what Perl does), whereas the latter passes - over a shortened substring. This makes a difference to the matching + to pcre_exec() to start searching at a new point within the entire + string (which is in effect what Perl does), whereas the latter passes + over a shortened substring. This makes a difference to the matching process if the pattern begins with a lookbehind assertion (including \b or \B). - If any call to pcre_exec() in a /g or /G sequence matches an empty - string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED - flags set in order to search for another, non-empty, match at the same - point. If this second match fails, the start offset is advanced by - one, and the normal match is retried. This imitates the way Perl han- + If any call to pcre_exec() in a /g or /G sequence matches an empty + string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED + flags set in order to search for another, non-empty, match at the same + point. If this second match fails, the start offset is advanced by + one, and the normal match is retried. This imitates the way Perl han- dles such cases when using the /g modifier or the split() function. + Other modifiers + There are yet more modifiers for controlling the way pcretest operates. - The /+ modifier requests that as well as outputting the substring that - matched the entire pattern, pcretest should in addition output the - remainder of the subject string. This is useful for tests where the + The /+ modifier requests that as well as outputting the substring that + matched the entire pattern, pcretest should in addition output the + remainder of the subject string. This is useful for tests where the subject contains multiple copies of the same substring. - The /L modifier must be followed directly by the name of a locale, for + The /L modifier must be followed directly by the name of a locale, for example, /pattern/Lfr_FR For this reason, it must be the last modifier. The given locale is set, - pcre_maketables() is called to build a set of character tables for the - locale, and this is then passed to pcre_compile() when compiling the - regular expression. Without an /L modifier, NULL is passed as the - tables pointer; that is, /L applies only to the expression on which it + pcre_maketables() is called to build a set of character tables for the + locale, and this is then passed to pcre_compile() when compiling the + regular expression. Without an /L modifier, NULL is passed as the + tables pointer; that is, /L applies only to the expression on which it appears. - The /I modifier requests that pcretest output information about the - compiled pattern (whether it is anchored, has a fixed first character, - and so on). It does this by calling pcre_fullinfo() after compiling a - pattern. If the pattern is studied, the results of that are also out- + The /I modifier requests that pcretest output information about the + compiled pattern (whether it is anchored, has a fixed first character, + and so on). It does this by calling pcre_fullinfo() after compiling a + pattern. If the pattern is studied, the results of that are also out- put. The /D modifier is a PCRE debugging feature, which also assumes /I. It - causes the internal form of compiled regular expressions to be output + causes the internal form of compiled regular expressions to be output after compilation. If the pattern was studied, the information returned is also output. The /F modifier causes pcretest to flip the byte order of the fields in - the compiled pattern that contain 2-byte and 4-byte numbers. This - facility is for testing the feature in PCRE that allows it to execute + the compiled pattern that contain 2-byte and 4-byte numbers. This + facility is for testing the feature in PCRE that allows it to execute patterns that were compiled on a host with a different endianness. This - feature is not available when the POSIX interface to PCRE is being - used, that is, when the /P pattern modifier is specified. See also the + feature is not available when the POSIX interface to PCRE is being + used, that is, when the /P pattern modifier is specified. See also the section about saving and reloading compiled patterns below. - The /S modifier causes pcre_study() to be called after the expression + The /S modifier causes pcre_study() to be called after the expression has been compiled, and the results used when the expression is matched. - The /M modifier causes the size of memory block used to hold the com- + The /M modifier causes the size of memory block used to hold the com- piled pattern to be output. - The /P modifier causes pcretest to call PCRE via the POSIX wrapper API - rather than its native API. When this is done, all other modifiers - except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present, - and REG_NEWLINE is set if /m is present. The wrapper functions force - PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set. + The /P modifier causes pcretest to call PCRE via the POSIX wrapper API + rather than its native API. When this is done, all other modifiers + except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present, + and REG_NEWLINE is set if /m is present. The wrapper functions force + PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set. - The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option - set. This turns on support for UTF-8 character handling in PCRE, pro- - vided that it was compiled with this support enabled. This modifier + The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option + set. This turns on support for UTF-8 character handling in PCRE, pro- + vided that it was compiled with this support enabled. This modifier also causes any non-printing characters in output strings to be printed using the \x{hh...} notation if they are valid UTF-8 sequences. - If the /? modifier is used with /8, it causes pcretest to call - pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the + If the /? modifier is used with /8, it causes pcretest to call + pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the checking of the string for UTF-8 validity. DATA LINES - Before each data line is passed to pcre_exec(), leading and trailing - whitespace is removed, and it is then scanned for \ escapes. Some of - these are pretty esoteric features, intended for checking out some of - the more complicated features of PCRE. If you are just testing "ordi- - nary" regular expressions, you probably don't need any of these. The + Before each data line is passed to pcre_exec(), leading and trailing + whitespace is removed, and it is then scanned for \ escapes. Some of + these are pretty esoteric features, intended for checking out some of + the more complicated features of PCRE. If you are just testing "ordi- + nary" regular expressions, you probably don't need any of these. The following escapes are recognized: \a alarm (= BEL) @@ -230,6 +248,8 @@ DATA LINES \e escape \f formfeed \n newline + \qdd set the PCRE_MATCH_LIMIT limit to dd + (any number of digits) \r carriage return \t tab \v vertical tab @@ -238,7 +258,9 @@ DATA LINES \x{hh...} hexadecimal character, any number of digits in UTF-8 mode \A pass the PCRE_ANCHORED option to pcre_exec() + or pcre_dfa_exec() \B pass the PCRE_NOTBOL option to pcre_exec() + or pcre_dfa_exec() \Cdd call pcre_copy_substring() for substring dd after a successful match (number less than 32) \Cname call pcre_copy_named_substring() for substring @@ -262,41 +284,58 @@ DATA LINES ated by next non-alphanumeric character) \L call pcre_get_substringlist() after a successful match - \M discover the minimum MATCH_LIMIT setting + \M discover the minimum MATCH_LIMIT and + MATCH_LIMIT_RECURSION settings \N pass the PCRE_NOTEMPTY option to pcre_exec() + or pcre_dfa_exec() \Odd set the size of the output vector passed to pcre_exec() to dd (any number of digits) \P pass the PCRE_PARTIAL option to pcre_exec() or pcre_dfa_exec() + \Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd + (any number of digits) \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec() \S output details of memory get/free calls during matching \Z pass the PCRE_NOTEOL option to pcre_exec() + or pcre_dfa_exec() \? pass the PCRE_NO_UTF8_CHECK option to - pcre_exec() + pcre_exec() or pcre_dfa_exec() \>dd start the match at offset dd (any number of digits); this sets the startoffset argument for pcre_exec() + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_CR option to pcre_exec() + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_LF option to pcre_exec() + or pcre_dfa_exec() + \ pass the PCRE_NEWLINE_CRLF option to pcre_exec() + or pcre_dfa_exec() - A backslash followed by anything else just escapes the anything else. - If the very last character is a backslash, it is ignored. This gives a - way of passing an empty line as data, since a real empty line termi- - nates the data input. + The escapes that specify line endings are literal strings, exactly as + shown. A backslash followed by anything else just escapes the anything + else. If the very last character is a backslash, it is ignored. This + gives a way of passing an empty line as data, since a real empty line + terminates the data input. If \M is present, pcretest calls pcre_exec() several times, with dif- - ferent values in the match_limit field of the pcre_extra data struc- - ture, until it finds the minimum number that is needed for pcre_exec() - to complete. This number is a measure of the amount of recursion and - backtracking that takes place, and checking it out can be instructive. - For most simple matches, the number is quite small, but for patterns - with very large numbers of matching possibilities, it can become large - very quickly with increasing length of subject string. - - When \O is used, the value specified may be higher or lower than the + ferent values in the match_limit and match_limit_recursion fields of + the pcre_extra data structure, until it finds the minimum numbers for + each parameter that allow pcre_exec() to complete. The match_limit num- + ber is a measure of the amount of backtracking that takes place, and + checking it out can be instructive. For most simple matches, the number + is quite small, but for patterns with very large numbers of matching + possibilities, it can become large very quickly with increasing length + of subject string. The match_limit_recursion number is a measure of how + much stack (or, if PCRE is compiled with NO_RECURSE, how much heap) + memory is needed to complete the match attempt. + + When \O is used, the value specified may be higher or lower than the size set by the -O command line option (or defaulted to 45); \O applies only to the call of pcre_exec() for the line in which it appears. - If the /P modifier was present on the pattern, causing the POSIX wrap- - per API to be used, only \B and \Z have any effect, causing REG_NOTBOL - and REG_NOTEOL to be passed to regexec() respectively. + If the /P modifier was present on the pattern, causing the POSIX wrap- + per API to be used, the only option-setting sequences that have any + effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, + to be passed to regexec(). The use of \x{hh...} to represent UTF-8 characters is not dependent on the use of the /8 modifier on the pattern. It is recognized always. @@ -375,14 +414,15 @@ DEFAULT OUTPUT FROM PCRETEST Note that while patterns can be continued over several lines (a plain ">" prompt is used for continuations), data lines may not. However new- - lines can be included in data by means of the \n escape. + lines can be included in data by means of the \n escape (or \r or \r\n + for those newline settings). OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION - When the alternative matching function, pcre_dfa_exec(), is used (by - means of the \D escape sequence or the -dfa command line option), the - output consists of a list of all the matches that start at the first + When the alternative matching function, pcre_dfa_exec(), is used (by + means of the \D escape sequence or the -dfa command line option), the + output consists of a list of all the matches that start at the first point in the subject where there is at least one match. For example: re> /(tang|tangerine|tan)/ @@ -391,10 +431,10 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tang 2: tan - (Using the normal matching function on this data finds only "tang".) - The longest matching string is always given first (and numbered zero). + (Using the normal matching function on this data finds only "tang".) + The longest matching string is always given first (and numbered zero). - If /gP is present on the pattern, the search for further matches + If /gP is present on the pattern, the search for further matches resumes at the end of the longest match. For example: re> /(tang|tangerine|tan)/g @@ -406,16 +446,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1: tan 0: tan - Since the matching function does not support substring capture, the - escape sequences that are concerned with captured substrings are not + Since the matching function does not support substring capture, the + escape sequences that are concerned with captured substrings are not relevant. RESTARTING AFTER A PARTIAL MATCH When the alternative matching function has given the PCRE_ERROR_PARTIAL - return, indicating that the subject partially matched the pattern, you - can restart the match with additional subject data by means of the \R + return, indicating that the subject partially matched the pattern, you + can restart the match with additional subject data by means of the \R escape sequence. For example: re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ @@ -424,30 +464,30 @@ RESTARTING AFTER A PARTIAL MATCH data> n05\R\D 0: n05 - For further information about partial matching, see the pcrepartial + For further information about partial matching, see the pcrepartial documentation. CALLOUTS - If the pattern contains any callout requests, pcretest's callout func- - tion is called during matching. This works with both matching func- + If the pattern contains any callout requests, pcretest's callout func- + tion is called during matching. This works with both matching func- tions. By default, the called function displays the callout number, the - start and current positions in the text at the callout time, and the + start and current positions in the text at the callout time, and the next pattern item to be tested. For example, the output --->pqrabcdef 0 ^ ^ \d - indicates that callout number 0 occurred for a match attempt starting - at the fourth character of the subject string, when the pointer was at - the seventh character of the data, and when the next pattern item was - \d. Just one circumflex is output if the start and current positions + indicates that callout number 0 occurred for a match attempt starting + at the fourth character of the subject string, when the pointer was at + the seventh character of the data, and when the next pattern item was + \d. Just one circumflex is output if the start and current positions are the same. Callouts numbered 255 are assumed to be automatic callouts, inserted as - a result of the /C pattern modifier. In this case, instead of showing - the callout number, the offset in the pattern, preceded by a plus, is + a result of the /C pattern modifier. In this case, instead of showing + the callout number, the offset in the pattern, preceded by a plus, is output. For example: re> /\d?[A-E]\*/C @@ -459,68 +499,68 @@ CALLOUTS +10 ^ ^ 0: E* - The callout function in pcretest returns zero (carry on matching) by - default, but you can use a \C item in a data line (as described above) + The callout function in pcretest returns zero (carry on matching) by + default, but you can use a \C item in a data line (as described above) to change this. - Inserting callouts can be helpful when using pcretest to check compli- - cated regular expressions. For further information about callouts, see + Inserting callouts can be helpful when using pcretest to check compli- + cated regular expressions. For further information about callouts, see the pcrecallout documentation. SAVING AND RELOADING COMPILED PATTERNS - The facilities described in this section are not available when the + The facilities described in this section are not available when the POSIX inteface to PCRE is being used, that is, when the /P pattern mod- ifier is specified. When the POSIX interface is not in use, you can cause pcretest to write - a compiled pattern to a file, by following the modifiers with > and a + a compiled pattern to a file, by following the modifiers with > and a file name. For example: /pattern/im >/some/file - See the pcreprecompile documentation for a discussion about saving and + See the pcreprecompile documentation for a discussion about saving and re-using compiled patterns. - The data that is written is binary. The first eight bytes are the - length of the compiled pattern data followed by the length of the - optional study data, each written as four bytes in big-endian order - (most significant byte first). If there is no study data (either the + The data that is written is binary. The first eight bytes are the + length of the compiled pattern data followed by the length of the + optional study data, each written as four bytes in big-endian order + (most significant byte first). If there is no study data (either the pattern was not studied, or studying did not return any data), the sec- - ond length is zero. The lengths are followed by an exact copy of the + ond length is zero. The lengths are followed by an exact copy of the compiled pattern. If there is additional study data, this follows imme- - diately after the compiled pattern. After writing the file, pcretest + diately after the compiled pattern. After writing the file, pcretest expects to read a new pattern. A saved pattern can be reloaded into pcretest by specifing < and a file - name instead of a pattern. The name of the file must not contain a < - character, as otherwise pcretest will interpret the line as a pattern + name instead of a pattern. The name of the file must not contain a < + character, as otherwise pcretest will interpret the line as a pattern delimited by < characters. For example: re>