Beginner’s guide to regular expressions

Borrowed from CARL ALEXANDER excellent tutorial here:

Beginner’s guide to regular expressions

No programming concept frightens programmers more than regular expressions. For a lot of us, seeing code with regular expressions in it can bring a sense of dread and anxiety. We often have no idea what’s going or how a regular expression does what it does.

That said, regular expressions are a really powerful tool. (That’s why they’re used so much.) There’s always a problem that you can solve with a regular expression around the corner. (Or you can always imagine one!) On top of that, you can use them almost anywhere.

The good news is that you can get over that fear of regular expressions! Learning the basics is often enough to solve a wide range of problems. So let’s go over them together!

Using regular expressions is using a different programming language

But first, let’s start by talking about the biggest mistake people do with regular expressions. It’s that they don’t view using regular expressions as using a separate programming language. This, more often than not, is the source of the problems that people have with them.

If you don’t think that you’re using a different programming language, you won’t look for tools to use with regular expressions. You’ll try to use the same programming tools that you’re using to view the code that the regular expression is in. And that makes things a lot more complicated for you.

The easiest way to improve your ability to use regular expressions is by using a specialized IDE. There’s a lot of them that you can use right in your browser like regex101 or RegExr. If you’re on Windows, there’s RegexBuddy which is just fantastic. (It’s the only tool I miss from using Windows!)

These IDEs can help you debug your regular expressions. They can also explain what your regular expression is doing as well. This second feature is really useful while you’re familiarizing yourself with regular expressions. (That’s why using an IDE with regular expressions is so useful!)

So make sure to pick an IDE before you continue on! It’ll help make the rest of this article easier to process and understand. And you’ll be able to test things as we go along.

A bit of history

Alright, so what are regular expressions? And why do we use them so much? At its core, a regular expression is a search pattern that you use to find substrings inside a string. You can also use it to validate whether a string follows a specific pattern or not.

Regular expressions didn’t start off as a computer science tool like we think of them today. In the 1950s, the mathematician Stephen Cole Kleene created regular expressions as a way to describe regular languages. That’s a language that you can express as a regular expression. (So meta!)

It’s only later on that computer scientists adopted regular expressions. That’s because regular expressions were great at searching strings. This made them super useful to solve problems revolving around that.

Syntax

The hardest part of regular expressions is understanding the syntax. Everyone always sees regular expressions for the first time and wonders, “What are all these weird symbols and what do they mean!?” That’s, without a doubt, the biggest hurdle at first. It’s also what the rest of this article is going to be about!

Delimiters

The first element of a regular expression is the delimiters. These are the boundaries of your regular expressions. The most common delimiter that you’ll see with regular expressions is the slash (/) or forward slash.

That said, there’s more than one type of delimiter that you can use. It depends on the syntax that your programming language uses. For example, PCRE (which is the syntax that PHP uses) also supports the following delimiters:

  • Hashtags (#)
  • Percentage signs (%)
  • Plus signs (+)
  • Tildes (~)

You can also use brackets (such as (){}[] and <>) as delimiters. But this isn’t something that you’ll see very often or even at all. We’re only seeing it for sake of completeness.

Why use different delimiters?

All these delimiters might feel a bit like overkill. (And they kinda are.) Why would you even want to use different delimiters in the first place? The main reason to use a different delimiter is readability.

If your regular expression contains a lot of slashes, you’re stuck having to escape them. For example, let’s say that you want to check if a request uses HTTPS. Your regular expression with the default delimiter would be /https:\/\//. As you can see, this is hard to read since we have to escape every slash using a backslash.

But we can fix this by switching to another delimiter like the percentage sign. If we do that, our regular expression then becomes %https://%. And that’s a lot easier to read!

Pattern, atoms and metacharacters

Inside our delimiters, we have the pattern that we want our regular expression to look for. This pattern is made up of what we call atoms. An atom can either be a character literal or a metacharacter.

Metacharacters are by far the most confusing part of regular expressions. They’re characters that have a special meaning when used in regular expressions. That said, understanding the meaning of these metacharacters is also super important.

In fact, it’s the essence of learning how to use regular expressions. A regular expression without metacharacters isn’t even a regular expression anymore. It’s just string literal (also known as just a string!) like any other. (If you’re not comfortable with the idea of string literal, you can check out this article on strings in PHP.)

This is why you’ll have to learn the meaning of these metacharacters. Most regular expression processors support at least fourteen metacharacters. These common metacharacters are {}[]()^$.|*+?\.

Defining the meaning of metacharacters

Alright, so you now know how important metacharacters are to regular expressions. The next step is to go over their meanings. These meanings fall into four broad groups: escaping, grouping, matching literal characters and quantifiers.

Escaping

The first meaning that we’ll look at is escaping. The backslash (\) is the escape character used by regular expressions. We saw an example of it earlier when we explained why we would use different delimiters.

In the example, we use the \/ character sequence to escape the slash delimiter. This told our regular expression processor that the slash wasn’t a delimiter. Instead, it told the regular expression processor to look for a literal slash character.

This example applies to all metacharacters that we listed earlier. If you want to use one of them as a literal character in your regular expression, you need to escape it with a backslash. That’s why you can sometimes see a regular expression with a lot of backslashes. It just means that it had to escape a lot of metacharacters.

Escaping is also used to convert literal characters into character classes. So for example, you can have w which is just a literal w. But you can also escape it so that it becomes \w\w is a shortcut character class which represents all alphanumeric characters as well as _. (We’ll talk more about the \w shortcut character class soon!)

Matching character literals

Talking about the \w character class is a good way to bring up our next group of metacharacters! It’s the one used to match character literals. Knowing when to use these metacharacters instead of character literals is an important part of knowing how to use regular expressions.

If it’s not clear what a character literal is, it just means a letter, number or a symbol like a percentage sign. Let’s imagine that you have the following regular expression: /http:/. The four letters and colon inside the regular expression delimiters are character literals.

Character classes

Now, let’s say that you want to represent a single number between 0 and 9. (We also call these digits.) There doesn’t exist a character literal that represents all those numbers since each number is itself a character literal. (Ok, that might have been a bit of a confusing explanation!) That’s where the concept of character classes comes into play.

Character classes (also referred to sometimes as “character sets”) are a way to represent these multiple character literals that you want to match. We use square brackets ([]) to contain these character literals. This tells the regular expression to look for one of these character literals when searching a string.

Here’s a small example of character literal usage. In English, you can write the word demoralize with both an s and a z. If you’re using a regular expression to look for that word with both spelling, you would use this regular expression: demora[sz]e.

So what’s going on in this regular expression? Well, most of it is just character literals. So the regular expression processor wants to match a d, then an e, then a mand so on.

Well, that’s up until it hits the opening square bracket. At that point, it’ll pause and read what’s inside the square brackets which are our s and z character literals. Once it’s done that it’ll try to match one of the two character literals. After that, it’ll finish by trying to match the e character literal.

Representing a range of character literals

But this doesn’t quite solve our initial problem. Should we just represent numbers as [0123456789] in our regular expression? Well, you could, but you don’t have to!

Character classes also let us represent ranges of character literals. Instead of representing a number as [0123456789], you can represent it as [0-9]. This makes our character class take less space in our regular expression.

You can even represent letters of the alphabet using a range! It would look like this: [A-Za-z]. This character class covers both lower case and upper case letters of the alphabet. But if you only want lower case letter, you can just use [a-z]. Same with only uppercase letters, you can use [A-Z].

Shortcut character classes

Using [A-Za-z0-9] as a character class whenever you want to match a letter or number is pretty cumbersome. Lucky for us, a lot of the regular expression processors have shortcuts for common character classes. For example, you can use \w(which we mentioned earlier) in your regular expression instead of the [A-Za-z0-9_] character class.

Each regular expression processor can have different shortcuts. That said, there are a few (like \w) that are common to most of them. The other common ones are \d and \s\d is a shortcut for the [0-9] character class which you use to match a number. Meanwhile, \s is a shortcut for the [ \t\r\n\v\f] character class which you use to match a whitespace character.

Metacharacters inside character classes

This is a good time to point out that most metacharacters don’t work inside a character class. The only ones that do work are the backslash (\), the closing square bracket (]) (for obvious reasons!) and the caret (^). We’re going to talk about the caret in just a moment.

If you want to use one of those metacharacters as a character literal, you have to escape them using a backslash. That’s why you can use the backlash in a character class. So, for example, [\^] would be a character class that would match a caret as a character literal.

But you can also use the backslash to escape literal characters. We saw this in the previous section when we talked about the \s shortcut character class. The \t is how we can represent a tab in a regular expression.

Negated character classes

Here’s another cool thing that you can do with character classes! You can define character classes that match anything but what’s in them. We call them negated character classes.

For example, we now know that we can use the [0-9] character class to match a number. So how would we rewrite that class if we wanted to say that we wanted to match anything but a number? Well, it’s simple! You just need to write your character class as [^0-9].

The [^ is how we define the negated character class. When your character class starts with a [^, it reverses the meaning of the character class. (That’s why we call it negated character class!) You’re telling the regular expression processor that you want to match any character literal but the ones inside the character class.

There also exists shortcut character classes for our negated character classes. In general, the shortcut character class for a negated character class is the capitalized version of the normal one. So, if we go back to three shortcut character classes that we saw earlier, the negated character classes would be:

  • \W which matches anything but an alphanumeric character literal.
  • \D which matches anything but a number.
  • \S which matches anything but a whitespace character.

Matching (almost) any character literal

There will be times where you don’t care about the character literal that you want your regular expression to match. This is where the famous dot (.) metacharacter comes in. You use the dot when you want to match a character literal without caring what that character literal is.

The only character literals that a dot won’t match by default are line break characters. That’s because, early on, regular expression processors worked line by line and not on whole files. This meant that it wasn’t necessary for the dot to match line breaks.

Nowadays, it’s pretty common to want to apply a regular expression to a whole file or large block of text. Most regular expression processors adapted to this need. They have an option to make the dot match all character literals including the line break ones.

Alright, so that’s it for the history lesson! But, as you can guess, the dot is a very powerful metacharacter. You’re going to see it and use it a lot. That said, as with most powerful things, it’s also easy to misuse it.

The main reason to be careful with the dot is that it can cause unforeseen bugs. That’s because the dot can match anything. This can cause your regular expression to match something you hadn’t expected.

A better alternative to the dot is the negated character class that we just saw. What’s good about the negated character class is that it forces you not to be lazy. You have to take a moment and think about the character literals that you don’t want your character class to match. We’ll see examples of this throughout the rest of the article!

Quantifiers

So far, we’ve only discussed how to represent a character literal or a group of character literals. We haven’t looked at how to represent if we want to match multiple of those in a row. This is where we need to introduce the concept of quantifiers.

What are quantifiers? They’re a way for you to tell the regular expression processor how many times you want to match an atom. (In case you forgot, an atom can be a character literal, a metacharacter or a character class.) There are three main quantifier metacharacters.

Quantifier metacharacters

The first quantifier metacharacter is the question mark (?). It tells the regular expression processor that the preceding atom is optional. Or, to put it another way, the question mark is a way to say that you want to match the preceding atom 0 or 1 times.

For example, you can use the question mark to deal with the “o” vs “ou” in US vs non-US English. The regular expression /behaviou?r/ will match both behavior and behaviour.

The next quantifier metacharacter is the star (*). It tells the regular expression processor that you want to match the preceding atom 0 or more times. An atom and quantifier combo that you’ll encounter a lot is the (in)famous .*.* tells a regular expression processor that you want to match anything for as long as possible.

The last quantifier metacharacter is the plus sign (+). It’s very similar to the star. It tells the regular expression processor that you want to match the preceding atom 1 or more times.

Greediness

An important concept with quantifiers is greediness. Quantifiers give regular expression processor the choice whether two choices. It can keep matching the repeating atom or it can stop.

By default, a regular expression processor is always greedy. It’ll always try to match as a repeating atom as long as it can. It will only stop matching a repeating atom when continuing to do so would prevent the regular expression processor from finding a match.

That’s why using the .* combo is so dangerous. By default, the regular expression processor will match any character as long as it can. This often leads to bugs because we underestimated how long the regular expression processor would keep matching .*!

An example of greed

Let’s look at a classic example of this greediness problem. That problem is trying to match an HTML tag. Let’s imagine that you have the following HTML:

Most of the time, you’ll start by using this regular expression: //. You’d expect this regular expression to match each HTML tag in that HTML block. But it won’t.

Because of greediness, this regular expression is going to match the whole HTML block. This is a pretty common mistake that a lot of developers do. (You have or will make this mistake one day. It’s ok. Everyone does even me!)

Making quantifiers lazy

There are two ways to solve this problem. The first one is to make a quantifier lazy instead of greedy. Lazy quantifiers will try to match the repeating atom as few times as possible.

So how do you turn a greedy quantifier into a lazy one? It’s simple! You just need to add a question mark at the end of it. This works for all three quantifier metacharacters that we’ve seen so far too!

?? is the lazy version of the ? quantifier metacharacter. It’ll try to match an atom 0 times instead of once if it can. Meanwhile, *? and +? are the lazy versions of the * and + quantifier metacharacters respectively. They’ll both try to match an atom as little as possible instead of as long as possible.

So that’s one way to solve the issue with the .* combo in our HTML example. You can make use the lazy version of it which is .*?. So, if you changed the regular expression from // to //, your regular expression would match every HTML tag in the HTML block.

Use negated character classes

That said, laziness isn’t the only way to make a quantifier less greedy. We mentioned earlier that the negated character class is a good alternative to the dot metacharacter. Well, that’s also something that we can use with quantifiers too!

A negated character class is also a great way to make quantifiers less greedy. It’s a bit trickier to visualize when you’re starting off with regular expressions. But that’s why we have our HTML example!

So let’s go back to our HTML block. We’re trying to match all the HTML tags in it using a regular expression. The trick to using a negated character class to do this is to think about what defines an HTML tag.

We know what that is already. It’s the less-than sign (<) and greater-than sign (>). That’s why our greedy regular expression is /<.*>/.

But that’s also the key to replacing our dot with a negated character class! What we really want the dot to do is match anything but the greater-than sign. This is, in turn, tells us the change that we need to make to our regular expression!

We need to change our regular expression from /<.*>/ to /<[^>]*>/. This will make it match all the HTML tags in our HTML block. No need for a lazy quantifier!

Custom quantifiers

But what happens when you need to match an atom a specific number of times? How can you define that you want, for example, to match a number 1 or 2 times? In a situation like that, you want to use curly brackets ({}) to define how many times you want to match an atom.

The format of curly brackets is {min,max}min is the minimum number of times that you want the regular expression processor to match your atom. max is the maximum number of times you want it to match your atom. If you don’t want to set a maximum number of matches, you can leave it blank.

You might have realized that it’s possible to represent the quantifier that we’ve seen so far using curly brackets. The ?quantifier is the same as {0,1}* is the same as {0,} and + is the same as {1,}. These quantifier metacharacters are really just shortcuts for these common custom quantifiers.

Grouping

Up to this point, we’ve only looked at quantifiers in the context of a repetition of a single atom. But what happens if you need to check for a repeating pattern of atoms? Well, for that, you need to use grouping.

Grouping a group of atoms together is quite easy. You just need to put them in paratheses (()). This tells the regular expression processor that you want it to use these atoms as a single unit.

For example, let’s say that you have this regular expression: /dish(es)?/. This regular expression would match either the word “dish” or its plural form “dishes”. That’s because (es)? tells the regular expression processor to match the “es” character literals 0 or 1 times.

Choosing between different regular expression patterns

You can also use grouping to let the regular expression processor pick between more than one pattern to match. This isn’t any different from the “or” operator that we use in programming. (The mathematical term for it is logical disjunction.) This can come in handy in a lot of situations.

A good example of that is if you want to look for specific HTTP request methods. In such a scenario, you might want to scan for only GET and POST methods in an access log file. With regular expressions, you would represent this choice between the two methods with (GET|POST).

The vertical bar (|) metacharacter is the “or” operator for regular expressions. It’s how we represent this choice between multiple patterns. You use it to separate the different pattern choices that you want the regular expression processor to look for.

There are no limits to how many pattern choices that you can put inside your parentheses. That said, you should be aware that regular expression processor will always stop when it reaches the first match. It will never try to find the best match out of all the choices.

This means that you should be careful with how you order your pattern choices. For example, let’s say that you’re looking for a URL path like /pathand/path/value. If you write it as /(\/path|\/path\/value)/ (\/ is an escaped slash), the regular expression processor will never match /path/value. Instead, you want to write it as /(\/path\/value|\/path)/.

Capturing

But why should you care about the order of your pattern choices like we just saw? It’s because regular expressions don’t use parentheses only for grouping atoms together. They also use them for something called capturing.

Capturing lets you store part of a matched string for reuse. This is useful if you’re using regular expressions to extract information from a string. Or if you’re using regular expressions to do a complex search and replace inside a document.

Numbered capture group

By default, a capture group (that’s what we call the whole parentheses block) is numbered. What do we mean by that? Well, let’s say that we had a regular expression like this: /(\/path(\/[^\/]+))/.

This is a variation of our earlier regular expression for a URL path. (Also notice the awesome use of the negated character class!) It’ll match any string that starts with /path and that has a second subdirectory after /path. So for example, /pathwouldn’t match, but /path/value would.

This regular expression also has two capture groups. The first one encompasses the entire regular expression. The second is only for the subdirectory after /path.

This order also represents the number of each capture group. /(\/path(\/[^\/]+))/ is capture group #1. You can reference it as \1 in a regular expression or $1 when using PHP or an application like a text editor. We call these types of references like these “backreferences“.

Named capture group

Numbered capture groups like we just saw can often be tricky to use with large regular expressions. If you add or remove capture groups, the numbers representing those capture groups might change. This makes it hard to make changes to the regular expression without causing bugs.

To solve this issue, most modern regular expression processors let you name capture groups. This name replaces the number that you used to reference a capture group before. This lets you reference a capture group without having to worry that the reference to it might change.

So how do you use named capture groups? Well, let’s update our previous regular expression to use them so that you can see! We’ll name the first capture group full_path and the second capture group subdir.

/(?\/path(?\/[^\/]+))/ is our updated regular expression with named capture groups. You’ll notice that the way to define a named capture group is by using (?at the beginning. You just need to replace name with whatever name you want to use for your capture group.

Anchors

So far, the metacharacters that we’ve seen were always used in one specific way. We always used them to try to match a single character literal or a group of character literals. This won’t be the case with this last set of metacharacters.

We call these metacharacters “anchors”. Unlike the other metacharacters that we’ve seen, they don’t represent character literals. Instead, we use them to match a specific position inside a string.

The two anchors that you’ll see used the most often is the caret (^) and dollar sign ($). The caret represents the starting position of a string. Meanwhile, the dollar sign represents the ending position of a string.

So, let’s imagine that we have this regular expression: /^\/path$/. This regular expression would only match the /path string. That’s because we used both the starting and ending position anchors.

However, if you want to just match a string that starts with /path, you would just use the /^\/path/ regular expression. This is the same regular expression without the ending position anchor.

This is a much more common use of anchors. You tend to use them to ensure your string either starts or ends in a specific way. It’s not as common (or useful) to use both anchors in the same regular expression.

Modifiers

So let’s say that you have a list of URL paths like this:

What would happen if you used the /^\/path/ regular expression that we just saw on it? Well, you’d expect the regular expression processor to match both the first and last URL paths in that list. As you might guess, that’s not what will happen.

It’ll only match the first URL path and not the second one. That’s because, by default, a regular expression processor considers a text block like a single string. That means that the caret anchor only matches the starting position at the beginning of the text block.

This is where modifiers come in. Modifiers are flags that let you change how a regular expression processor will process your regular expression. In general, you tend to add them after the delimiter of your regular expression.

Useful modifiers

There are a lot of different modifiers. That said, for this guide, we’re just going to focus on a few useful ones. (These are the ones that you’re more likely to use anyways.)

The first useful modifier that you’ll use pretty often is i. It makes your regular expression case insensitive. For example, let’s say that you have /this/i as a regular expression. It would match any variation of “this” with lowercase and uppercase letters.

The next modifier that you should know about is s. It makes the dot match all possible characters including line breaks. This is useful when you’re using a regular expression on large blocks of texts. (You’re going to do that with regular expressions pretty often.)

It’s not unusual to have an HTML like block like the one above. (HTML generators tend to do weird things when they output HTML!) If you tried to use the /<strong>.*<\/strong>/ regular expression, it wouldn’t work because of the line breaks. But the /<strong>.*<\/strong>/s regular expression would because of the s modifier. (Note: /<strong>[^<]*<\/strong>/ wouldn’t need the modifier. That’s another reason why negated character classes are super useful!)

The last modifier that’s useful to know about is m. This modifier makes the caret and dollar sign match the starting and ending position of a line instead of a string. This is the modifier that you’d need to use to fix the issue with the \^\/path\ regular expression from earlier.

Additional resources

Wow! So that pretty much wraps up this beginner’s guide to regular expressions. That was quite a lot to take in!

That said, this was still an introduction. There’s still a lot more to regular expressions that we didn’t cover in this article. If you want to learn more about regular expressions, there’s plenty of resources out there.

One of the better ones out there is regular-expressions.info. It has a handy tutorialthat covers a lot of the concepts seen this article in more detail. It also covers more advanced concepts such as lookaround and conditionals.

But you don’t need any of those advanced concepts to get started with regular expressions! This guide has more than enough for you to begin using them more often. The key is to practice. (And to use an IDE!) You’ll get better at it with time.

CISSP PMP