Regular expressions: a short tutorial

Also known as I typed an expression and the replaced text is not at all what I expected! and What’s a regular expression?

Regular expressions are a powerful way of expressing what you’re looking for. Many people get confused by regular expressions, so if you’re feeling confused, don’t despair, and read on.

Regular expressions are extremely useful (and heavily used) to process text. That’s because they are the most powerful way of specifying what you’re looking for. In addition to that, you can use regular expressions to transform any occurrences of the found text: transposing words, changing plurals to singulars, incrementing a number, changing prefixes, the sky’s the limit.

Thus, you’ll be hearing a lot about “matches”. A “match” is just an elegant and short way of describing any text found by your regular expression. If you’re replacing text instead of just searching (as is the case with WordPress search and replace), any text that matches your expression will get replaced by the replacement text you specified.

How to compose a regular expression

…details, details…

This tutorial documents Perl-compatible regular expressions (as implemented by PHP), which are very similar to extended regular expressions. Basic sed and grep expressions have much less functionality.

What is a regular expression? A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. The patterns described by regular expressions are used to search strings in your content and replace all matches of the expression with the replacement text.

The simplest regexp is simply a word, or more generally, a string of characters. A regexp consisting of a word catdog matches any string that contains “catdog” (and, if you’re doing search and replace, all instances of the word “catdog” get replaced by your replacement text). Your word can contain spaces, so cat chases dog would match “cat chases dog” in your content. Beware that a word does not necessarily match a complete word in the English sense: it can also match parts of a word (for example, cat would match “catdog” and “catsup”). To learn how to match an actual word instead of substrings, keep reading.

Metacharacters

Some characters have special meaning in a regular expression: {}[]()^$.|*+?\. These are called metacharacters. If you need to look for anything that has these metacharacters, you need to escape them: prefix them with a backslash (\). So if you need to search and replace “What is it?”, your regular expression would need to be What is it\?.

In addition to the metacharacters, there are some characters which don’t have printable character equivalents and can only be represented by escape sequences instead. Common examples are \t for a tab and \n for a newline.

The almighty period and his friends

Simply put, a period matches any character. comp.ter would match “computer”, “compater”, “competer”, and so on. This means that, to actually look for a period, you need to escape it with the backslash character (\.).

The expression \s matches a whitespace character (a tab or a space). In contrast, the expression \S matches anything that is not a whitespace character. \d matches a single digit (0 to 9). \w matches a word character (alphanumeric).

The ^ character matches the beginning of a line (or string). In other words, ^Mom would match the word “Mom”, but only if it was at the beginning of a line. The $ character has the exact opposite meaning: it matches the end of the line.

The \b character matches a word boundary. Put in another way, \bdog\b will match the full word “dog”, whether it’s at the beginning of the line, at the end of the line, followed or preceded by whitespace or punctuation, but it will never match “dogs” or “catdog”, because of the word boundaries at the sides of \bdog\b.

Examples:

   /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format
   /[\d\s]/;         # matches any digit or whitespace character
   /\w\W\w/;         # matches a word char, followed by a
                     # non-word char, followed by a word char
   /..rt/;           # matches any two chars, followed by ’rt’
   /end./;          # matches ’end.’
   /end[.]/;         # same thing, matches ’end.’

Character classes

You can express single-character alternatives by using a character class. Character classes are denoted by brackets [...], with the set of characters to be possibly matched inside. Hence, a regexp [abc]ar would match “aar”, “bar” and “car”. [yY][eE][sS] would match any possible combination of upper and lowercase characters of the word “yes”. [0-9] would, in turn, match any number.

An expression of the form [^0-9], in contrast, would actually match a single character that is not between zero and nine. The initial caret inverts the character class meaning, in effect saying match anything that’s not in this list.

Grouping and replacement identifiers

You can define groups of matches using parentheses (...), and within those groups, alternatives, with the pipe character |. (at most|at least|exactly) would match “at most 30 seconds”, “at least 10 kilometers” and “arrive exactly at ten o’clock”.

You should know that Perl uses \0 ... \9 as identifiers. This tutorial documents a PHP-specific behavior which is of importance to WordPress search and replace.

What’s most interesting about groups is that the contents of these groups can be specified as replacement text. Say you have a regular expression My wife has (one|two|[1-9]) (cat|dog)s as a search expression. Using My mistress ignores that my wife has $1 $2s as a replacement, you can ensure that your mistress will keep ignoring your wife’s animals, whether they are “one”, “two”, any number from 1 to 9, or whether they are dogs or cats.

In other words, $1 would represent “the contents of the first parenthesis”, $2 represents “the contents of the second parenthesis”, and so on.

Repetition

Now that you know the basic rules for composing regular expressions, it’s time to touch the subject of repetition. It’s kind of hard to explain theoretically, so I’ll use examples:

  • a*: match “a” any number of times
    a* would match “”, “a”, “aabanhy”, “aardvark”
  • a+: match “a” at least one time
    a+ would match “a”, “aabanhy”, “aardvark”, but it would not match “”
  • a{3}: match “a” three times
    6{3} would match “666″
  • a{1,5}: match “a” from one to five times
    a{1,2}rdvark would match “aardvark” and “ardvark”
  • a?: match “a” zero or one times
    neighbou?r would match “neighbor” and “neighbour”

These characters ? * {x} {x,y} + are called quantifiers. You can place them after characters, character classes ([...]) or groups. For example, [0-9]{3} matches any three-digit string. (dog|cat)+ would match any three-consecutive “dog” or “cat” strings, whether it’s “dogcatdog”, “catcatcat” or any other combination.

Greediness in regular expressions

Greediness is a big deal when dealing with regular expressions. In short: regular expressions are “greedy” (for lack of a better word). This means that an expression which contains repetitions will attempt to “swallow up” as many characters as possible.

Let’s use an example. If the original text is an HTML snippet that says “<b>hey</b> dude <b>how you doin’</b>” and you use <b>.*</b> as an expression and “<i>replaced!</i>” as replacement, what could you expect as final text? Immediate common sense says “<i>replaced!</i> dude <i>replaced!<i>”, right? Wrong. The final text would be “<i>replaced!</i>”.

Why is that? Let’s see. The .* in the expression means match anything except a newline, any number of times. Since regular expressions are greedy, they interpret “any number of times” to be “as much text as possible, damnit!”. Thus, that regular expression would actually match the entire text, instead of just the first portion.

How can greediness be disabled? Simple. Any quantifier can be “un-greedinized” by adding a ? (question mark) after the quantifier. In other words, the more “correct” expression in the last example would be <b>.*?</b>. In a sense, the question mark is saying to the expression “please eat up as few characters as possible!”.

In summary

You’re done! I just gave you a lot of rope to hang yourself with. Go try it. After you’ve used regular expressions, I guarantee you’ll miss them where you can’t use them.

If you want an in-depth explanation of regular expressions, please consult the perlretut and perlre man pages – if you don’t have Perl installed, try googling for perlretut or perlre.

Copyright attribution: portions of the following text are derived from the Perl regular expressions manual, in good faith that this action falls under fair use.