An introduction to the re2 regular expression library for OCaml
| Updated:
Hi! I'm Ryan Moore, NBA fan & PhD candidate in Eric Wommack's viral ecology lab @ UD. Follow me on Twitter!
In this tutorial, we will talk about re2, an OCaml library providing bindings to RE2, Google’s regular expression library.
This post is intended for newer OCaml programmers, or those who want to use the re2
library, but could use a couple of examples to help get started. This is not a general introduction to regular expressions, however. If you have never used regular expressions before, read up a little bit on the syntax before tackling this post.
Contents
Overview
The there are few choices for regular expression libraries available for OCaml on Opam. Some of the most popular include
- re, a pure OCaml library (installed 7667 times last month),
- pcre, bindings to the Perl Compatibility Regular Expressions library (PCRE), (installed 1115 times last month), and
- re2, OCaml bindings for RE2, Google’s regular expression library (installed 114 times last month).
The first two are by far the most popular in terms of raw Opam install counts. However, re2
integrates nicely into the Jane Street Base/Core/Async ecosystem (it’s a Jane Street package after all!), and is covered under the MIT license rather than the LGPL with OCaml linking exception, which may be appealing depending on your situation.
Note: According to this blog post and this GitHub issue, Jane Street is phasing out its use of re2. The re2 GitHub does have recent commits, though, so your mileage may vary.
One issue that newcomers may face when getting started with the re2
library is the slightly terse API documentation. While it is detailed and thorough, it can be hard to get started with if you’re not already used to reading Jane Street mli
files and source code.
Note: if you want to follow along, you can paste the examples into the toplevel (or utop). However, don’t paste in lines starting with - :
. These lines show the type of the expression as reported by utop
.
Creating regular expressions
You create regular expressions with Re2.create
and Re2.create_exn
. The former returns Re2.t Or_error.t
and the latter Re2.t
.
Matching options
You can control how regular expression matching works by passing the options
argument to the create
and create_exn
functions. If you omit this argument, the default options will be passed. Here they are:
For a more detailed description of these options, see the re2.h header filer.
By default, re2
uses case-sensitive matching. To create a case-insensitive regex, pass in an options map like so.
Checking for a match
Perhaps the most basic regex task is to check if a string matches a given regular expression. You can use Re2.matches
for this.
Finding matches
To find all matches of a regular expression in a string, you can use the find_*
functions.
Find first match
To return the first match in the query string, use find_first
or find_first_exn
. These functions return matched string rather than the underlying Re2.Match.t
.
Find all matches
While find_first
returns the first match in a query string, find_all
and find_all_exn
return lists of all non-overlapping matches in the query string.
Submatches and capturing groups
You can use the sub
argument to return submatches defined by capturing groups rather than the whole match.
Be aware that passing index greater than the amount of capturing groups will raise an error.
Or_error returning vs. Exception raising
Like most of the functions in the Re2
module, the find
functions come in both Or_error.t
returning and exception raising versions. If the regular expression doesn’t match, find_all
returns a Result.Error.t
whereas find_all_exn
raises an exception.
It is important to remember that the find_all
functions return non-overlapping matches.
Finding submatches
If you need a bit more control than provided by find_all
with the sub
argument (e.g., find_all ~sub:(` Index 1)
), the you may need to use find_submatches
or find_submatches_exn
. These return the first match in the query string. The match is returned as a string option array
, where the first element is the whole match, and subsequent elements are submatches as defined by any capturing groups.
You may wonder why find_submatches_exn
returns a string option array
and not simply a string array
. find_submatches_exn
uses Match.get
under-the-hood. Basically, find_submatches_exn
processes a Match.t Sequence.t
of matches, calling get
on each one. And the Match.get
function returns a string option
.
This little code snippet will hopefully give you an idea of what’s going on.
If the Index
you pass to ~sub
is higher than the of capturing groups plus one (e.g., the number returned from Re2.num_submatches
), None
is returned.
More complicated submatch interface
If you want to work with the Re2.Match.t
directly, you can use functions from the complicated interface like first_match and get_matches.
If you need to work with submatches of every match in a string rather than just the first, and you need direct access to the Match.t
, you will want to use get_matches
or get_matches_exn
. Let’s try it out with a weird, little example.
Say we have a string made up of chunks. Each chunk is a number followed by an A
(for add) or an S
(for subtract) (e.g., 50A
and 3S
). The chunk describes an arithmetic operation: 12A
means add 12 to the previous total; 3S
means subtract 3 from the previous total.
A full string then might look something like this: 10A5S2S3A
, which represents the following sequence of operations: 0 + 10 - 5 - 2 + 3
.
One way to solve this little problem using regexes and the get_matches
function. Let’s see how it might go.
Note: This weird format is actually loosely based on the CIGAR strings found in SAM files describing biological sequence alignments.
Controlling submatches
In the last two examples, we used the sub
argument along with a polymorphic variant to select capture groups. Let’s take a closer look at the type used for that.
To select submatches, we use id_t, which looks like this:
This type is used to refer to submatches. E.g., ` Index 1
would be the result of first capturing group, ` Index 2
the 2nd, etc. Remember that ` Index 0
refers to the whole match.
In addition to referring to submatches/capturing groups by index, you can refer to them by name.
When using a complicated regular expression with multiple capturing groups, it is often less error prone to use named submatches rather than numbered ones.
Note: It is not a compile-error to try an access a capturing group that doesn’t exist in the regular expression. Depending on the function, you may get None
or raise an exception.
Using id_t
to control match efficiency
Many of the regex matching functions take a ?sub:id_t
argument.
In some cases, you can increase the efficiency of matching by restricting the number of submatches. If you only care about whether a pattern matches, and not about submatches, you could pass in ~sub:(` Index -1)
to many of the above functions.
You can get increasingly more information by increasing the n
to the index.
This section of the documentation has more info on how specifying the sub
argument can have an impact on regex performance, and which functions are affected by its usage.
Splitting strings
Another common regex task is splitting an input string based on a regular expression pattern. Re2
provides the split
function for this purpose.
If you need to include the actual matches in the output, you can. Passing ~include_matches:true
ensures the “separators” are in there with the rest of the output.
Just be aware of that final empty string at the end!
You can also limit the number of matches with the max
argument. You could use this to get the first value separated from the remaining values in a string of tab-separated values, for example.
If the regular expression has no matches in the query string, then a one element list is returned.
Replacing
Using rewrite
The simpler interface for regex replacement consists of the rewrite
and rewrite_exn
functions. The template
argument defines how you want to replace any matches in the query string. In this case, we replace any matches with a capital A.
You can reference the submatches in the template string using the syntax \\n
. Check it out.
If you have multiple submatches, just keep referring to them in the same way: \\1 ... \\2 ...
etc.
If you need to check if your rewrite template is valid before running rewrite
, use valid_rewrite_template
function.
Using replace
The re2
library also provides more powerful replacing functions: replace
and replace_exn
. You can use them if you need direct access to the Match.t
.
Here is a silly example that picks a different replacement value depending on the match.
While the replace
function is more complicated than rewrite
, it gives you more control and has a few other options you may find useful.
Miscellaneous info
Escaping strings for regular expressions
Properly escaping regular expressions can sometimes be tricky, especially if you want to avoid illegal backslash characters in your strings.
Re2
provides a function escape
that escapes its input in such a way that if you create a regex from the resulting escaped string, it would match the original string. Here’s how it works.
Depending on how many special characters are in the string you use to build the regex, escaping can be pretty noisy! In these cases, escape
is especially useful.
Infix matching operator
If you’re feeling nostalgic for Perl, feel free to use the =~
infix operator!
“Precompiling” your regular expressions
Unless you have a good reason not to, you will probably want to create your regular expression outside of the function that will be using it.
To see why, let’s check out this little benchmark program that compares two functions. The first one reuses a regex that is created outside of the function, whereas the second one creates a new regex each time the function is called.
Note: This benchmark program uses Jane Street’s core_bench micro-benchmarking library.
Name | Time/Run | mWd/Run | Percentage |
---|---|---|---|
outside | 272.60 ns | 2.00 w | 3.74% |
inside | 7_281.55 ns | 91.00 w | 100.00% |
As you can see, reusing a regex rather than creating a new one each time a function is called makes a big difference in this benchmark. Keep in mind that this is a micro-benchmark, and that this difference may not be that important to the run time of your program as a whole. That said, if you had the slow version of the above function in a hot loop, it could really be wasting a lot of CPU cycles.
Wrap up
Hopefully this overview helps you get started with using re2
!
To get more info about using re2
, check out the API docs. Additionally, the re2
source code is quite readable. I encourage you to take a look at how the functions are defined–it may help clear up any additional questions you have!
If you enjoyed this post, consider sharing it on Twitter and subscribing to the RSS feed! If you have questions or comments, you can find me on Twitter or send me an email directly.
← Go back