pcrecpp (original) (raw)

PCRECPP(3) Library Functions Manual PCRECPP(3)

NAME PCRE - Perl-compatible regular expressions.

SYNOPSIS OF C++ WRAPPER

   #include <pcrecpp.h>

DESCRIPTION

   The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
   functionality was added by Giuseppe Maxia. This brief man page was con-
   structed  from  the  notes  in the pcrecpp.h file, which should be con-
   sulted for further details. Note that the C++ wrapper supports only the
   original  8-bit  PCRE  library. There is no 16-bit or 32-bit support at
   present.

MATCHING INTERFACE

   The "FullMatch" operation checks that supplied text matches a  supplied
   pattern  exactly.  If pointer arguments are supplied, it copies matched
   sub-strings that match sub-patterns into them.

     Example: successful match
        pcrecpp::RE re("h.*o");
        re.FullMatch("hello");

     Example: unsuccessful match (requires full match):
        pcrecpp::RE re("e");
        !re.FullMatch("hello");

     Example: creating a temporary RE object:
        pcrecpp::RE("h.*o").FullMatch("hello");

   You can pass in a "const char*" or a "string" for "text". The  examples
   below  tend to use a const char*. You can, as in the different examples
   above, store the RE object explicitly in a variable or use a  temporary
   RE  object.  The  examples below use one mode or the other arbitrarily.
   Either could correctly be used for any of these examples.

   You must supply extra pointer arguments to extract matched subpieces.

     Example: extracts "ruby" into "s" and 1234 into "i"
        int i;
        string s;
        pcrecpp::RE re("(\\w+):(\\d+)");
        re.FullMatch("ruby:1234", &s, &i);

     Example: does not try to extract any extra sub-patterns
        re.FullMatch("ruby:1234", &s);

     Example: does not try to extract into NULL
        re.FullMatch("ruby:1234", NULL, &i);

     Example: integer overflow causes failure
        !re.FullMatch("ruby:1234567891234", NULL, &i);

     Example: fails because there aren't enough sub-patterns:
        !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);

     Example: fails because string cannot be stored in integer
        !pcrecpp::RE("(.*)").FullMatch("ruby", &i);

   The provided pointer arguments can be pointers to  any  scalar  numeric
   type, or one of:

      string        (matched piece is copied to string)
      StringPiece   (StringPiece is mutated to point to matched piece)
      T             (where "bool T::ParseFrom(const char*, int)" exists)
      NULL          (the corresponding matched sub-pattern is not copied)

   The  function returns true iff all of the following conditions are sat-
   isfied:

     a. "text" matches "pattern" exactly;

     b. The number of matched sub-patterns is >= number of supplied
        pointers;

     c. The "i"th argument has a suitable type for holding the
        string captured as the "i"th sub-pattern. If you pass in
        void * NULL for the "i"th argument, or a non-void * NULL
        of the correct type, or pass fewer arguments than the
        number of sub-patterns, "i"th captured sub-pattern is
        ignored.

   CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
   string  is assigned the empty string. Therefore, the following will re-
   turn false (because the empty string is not a valid number):

      int number;
      pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);

   The matching interface supports at most 16 arguments per call.  If  you
   need  more,  consider using the more general interface pcrecpp::RE::Do-
   Match. See pcrecpp.h for the signature for DoMatch.

   NOTE: Do not use no_arg, which is used internally to mark the end of  a
   list  of optional arguments, as a placeholder for missing arguments, as
   this can lead to segfaults.

QUOTING METACHARACTERS

   You can use the "QuoteMeta" operation to insert backslashes before  all
   potentially  meaningful  characters  in  a string. The returned string,
   used as a regular expression, will exactly match the original string.

     Example:
        string quoted = RE::QuoteMeta(unquoted);

   Note that it's legal to escape a character even if it  has  no  special
   meaning  in  a  regular expression -- so this function does that. (This
   also makes it identical to the perl function  of  the  same  name;  see
   "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
   "1\.5\-2\.0\?".

PARTIAL MATCHES

   You can use the "PartialMatch" operation when you want the  pattern  to
   match any substring of the text.

     Example: simple search for a string:
        pcrecpp::RE("ell").PartialMatch("hello");

     Example: find first number in a string:
        int number;
        pcrecpp::RE re("(\\d+)");
        re.PartialMatch("x*100 + 20", &number);
        assert(number == 100);

UTF-8 AND THE MATCHING INTERFACE

   By  default,  pattern  and text are plain text, one byte per character.
   The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
   string to be treated as UTF-8 text, still a byte stream but potentially
   multiple bytes per character. In practice, the text is likelier  to  be
   UTF-8  than  the pattern, but the match returned may depend on the UTF8
   flag, so always use it when matching UTF8 text. For example,  "."  will
   match  one  byte normally but with UTF8 set may match up to three bytes
   of a multi-byte character.

     Example:
        pcrecpp::RE_Options options;
        options.set_utf8();
        pcrecpp::RE re(utf8_pattern, options);
        re.FullMatch(utf8_string);

     Example: using the convenience function UTF8():
        pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
        re.FullMatch(utf8_string);

   NOTE: The UTF8 flag is ignored if pcre was not configured with the
         --enable-utf8 flag.

PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE

   PCRE defines some modifiers to change the behavior of the  regular  ex-
   pression  engine.  The  C++  wrapper defines an auxiliary class, RE_Op-
   tions, as a vehicle to pass such modifiers to a  RE  class.  Currently,
   the following modifiers are supported:

      modifier              description               Perl corresponding

      PCRE_CASELESS         case insensitive match      /i
      PCRE_MULTILINE        multiple lines match        /m
      PCRE_DOTALL           dot matches newlines        /s
      PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
      PCRE_EXTRA            strict escape parsing       N/A
      PCRE_EXTENDED         ignore white spaces         /x
      PCRE_UTF8             handles UTF8 chars          built-in
      PCRE_UNGREEDY         reverses * and *?           N/A
      PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)

   (*)  Both Perl and PCRE allow non capturing parentheses by means of the
   "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
   ture, while (ab|cd) does.

   For  a  full  account on how each modifier works, please check the PCRE
   API reference page.

   For each modifier, there are two member functions whose  name  is  made
   out  of  the modifier in lowercase, without the "PCRE_" prefix. For in-
   stance, PCRE_CASELESS is handled by

     bool caseless()

   which returns true if the modifier is set, and

     RE_Options & set_caseless(bool)

   which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
   be  accessed  through  the  set_match_limit()  and match_limit() member
   functions. Setting match_limit to a non-zero value will limit the  exe-
   cution  of pcre to keep it from doing bad things like blowing the stack
   or taking an eternity to return a result.  A  value  of  5000  is  good
   enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
   to  zero  disables  match  limiting.  Alternatively,   you   can   call
   match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
   limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
   matches PCRE does; match_limit_recursion() limits the depth of internal
   recursion, and therefore the amount of stack that is used.

   Normally, to pass one or more modifiers to a RE class,  you  declare  a
   RE_Options object, set the appropriate options, and pass this object to
   a RE constructor. Example:

      RE_Options opt;
      opt.set_caseless(true);
      if (RE("HELLO", opt).PartialMatch("hello world")) ...

   RE_options has two constructors. The default constructor takes no argu-
   ments  and creates a set of flags that are off by default. The optional
   parameter option_flags is to facilitate transfer of legacy code from  C
   programs.  This lets you do

      RE(pattern,
        RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);

   However, new code is better off doing

      RE(pattern,
        RE_Options().set_caseless(true).set_multiline(true))
          .PartialMatch(str);

   If you are going to pass one of the most used modifiers, there are some
   convenience functions that return a RE_Options class with the appropri-
   ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
   and EXTENDED().

   If you need to set several options at once, and you don't  want  to  go
   through  the pains of declaring a RE_Options object and setting several
   options, there is a parallel method that give you such ability  on  the
   fly.  You  can  concatenate several set_xxxxx() member functions, since
   each of them returns a reference to its class object. For  example,  to
   pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
   statement, you may write:

      RE(" ^ xyz \\s+ .* blah$",
        RE_Options()
          .set_caseless(true)
          .set_extended(true)
          .set_multiline(true)).PartialMatch(sometext);

SCANNING TEXT INCREMENTALLY

   The "Consume" operation may be useful if you want to  repeatedly  match
   regular expressions at the front of a string and skip over them as they
   match. This requires use of the "StringPiece" type, which represents  a
   sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
   pcrecpp namespace.

     Example: read lines of the form "var = value" from a string.
        string contents = ...;                 // Fill string somehow
        pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece

        string var;
        int value;
        pcrecpp::RE re("(\\w+) = (\\d+)\n");
        while (re.Consume(&input, &var, &value)) {
          ...;
        }

   Each successful call to "Consume" will set "var/value",  and  also  ad-
   vance "input" so it points past the matched text.

   The "FindAndConsume" operation is similar to "Consume" but does not an-
   chor your match at the beginning of the string. For example, you  could
   extract all words from a string by repeatedly calling

     pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)

PARSING HEX/OCTAL/C-RADIX NUMBERS

   By default, if you pass a pointer to a numeric value, the corresponding
   text is interpreted as a base-10  number.  You  can  instead  wrap  the
   pointer with a call to one of the operators Hex(), Octal(), or CRadix()
   to interpret the text in another base. The CRadix  operator  interprets
   C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
   base-10.

     Example:
       int a, b, c, d;
       pcrecpp::RE re("(.*) (.*) (.*) (.*)");
       re.FullMatch("100 40 0100 0x40",
                    pcrecpp::Octal(&a), pcrecpp::Hex(&b),
                    pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));

   will leave 64 in a, b, c, and d.

REPLACING PARTS OF STRINGS

   You can replace the first match of "pattern" in "str"  with  "rewrite".
   Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
   insert text matching corresponding parenthesized group  from  the  pat-
   tern. \0 in "rewrite" refers to the entire matching text. For example:

     string s = "yabba dabba doo";
     pcrecpp::RE("b+").Replace("d", &s);

   will  leave  "s" containing "yada dabba doo". The result is true if the
   pattern matches and a replacement occurs, false otherwise.

   GlobalReplace is like Replace except that it replaces  all  occurrences
   of  the  pattern  in  the string with the rewrite. Replacements are not
   subject to re-matching. For example:

     string s = "yabba dabba doo";
     pcrecpp::RE("b+").GlobalReplace("d", &s);

   will leave "s" containing "yada dada doo". It returns the number of re-
   placements made.

   Extract  is like Replace, except that if the pattern matches, "rewrite"
   is copied into "out" (an additional argument) with substitutions.   The
   non-matching  portions  of "text" are ignored. Returns true iff a match
   occurred and the extraction happened successfully;  if no match occurs,
   the string is left unaffected.

AUTHOR

   The C++ wrapper was contributed by Google Inc.
   Copyright (c) 2007 Google Inc.

REVISION

   Last updated: 08 January 2012

PCRE 8.30 08 January 2012 PCRECPP(3)