What are capturing groups and backreferences, and how do they work?
Let’s learn about capturing groups and backreferences in regular expressions.
A capturing group allows you to “capture” a portion of the matched string to use however you might need. Capturing groups are defined by parentheses containing the pattern to capture, with no leading characters like a lookahead.
Let’s capture the code
from our freeCodeCamp
regular expression. To do that, we’ll enclose code
in parentheses and define it as a capture group:
const regex = /free(code)camp/i;
To confirm the behavior, we can test it against a freecodecamp
string:
const regex = /free(code)camp/i;
console.log(regex.test("freecodecamp")); // true
But this doesn’t actually make use of our captured group. Instead, let’s take a look at the result of using match
:
const regex = /free(code)camp/i;
console.log("freecodecamp".match(regex));
// [
// 'freecodecamp',
// 'code', <--
// index: 0,
// input: 'freecodecamp',
// groups: undefined
// ]
Here we can see that our match
array has a second element, which is the portion of the string which was captured by our capture group.
Notice how the capture group matches the exact pattern code
, where a character class would match a single character from the list c
, o
, d
, and e
.
But how can we actually use this? Well, capture groups are often used when replacing contents of a string. Let’s set up some code to do that. We’re going to turn freecodecamp
into paidcodeworld
:
const regex = /free(code)camp/i;
console.log("freecodecamp".replace(regex, "paidcodeworld"));
This works on its own, but what if we didn’t know how many o
’s were in code
? If we need a quantifier for one or more o
s:
const regex = /free(co+de)camp/i;
console.log("freecoooooooodecamp".replace(regex, "paidcodeworld"));
We’re getting paidcodeworld
as our result. We want to preserve the number of o
’s, so we need to reuse what was captured by the regular expression.
This is where a backreference comes in. Instead of hardcoding the code
portion of our replacement string, we can reference the captured group directly.
In a replace
call, you achieve a backreference by using a dollar sign ($
) followed by the number of the capture group to use. In our case, that would be $1
, since code
is captured in the first capture group:
const regex = /free(co+de)camp/i;
console.log("freecoooooooodecamp".replace(regex, "paid$1world")); // paidcooooooooworld
We have now successfully preserved an unknown number of o
characters when converting freecodecamp
into paidcodeworld
. But backreferences aren’t just limited to the replace call. You can actually use them directly in a regular expression.
This would allow you to match a previously captured pattern later on in the regular expression.
Let’s say we want to match freecodecamp
twice, with the same number of o
’s, but anywhere in the string.
First, we need to separate them with our wildcard character, and allow any number of characters to match that wildcard:
const regex = /free(co+de)camp.*free(co+de)camp/i;
This current expression won’t ensure that the number of o
characters is the same, however. To achieve that, we need to replace the second capture group with a reference to the first.
Inside a regular expression, a backreference is denoted with a backslash followed by the number of the capture group:
const regex = /free(co+de)camp.*free\1camp/i;
console.log(regex.test("freecooooodecamp is great i love freecooooodecamp")); // true
console.log(regex.test("freecooooodecamp is great i love freecodecamp")); // false
And with that, we can see that a string with the correct number of o
s matches, while a string with two different numbers of o
s does not.
This syntax is great, but can quickly get confusing when you are referencing multiple capture groups. Thankfully, instead of using numbers, you can give your groups names.
To define a named capture group, you add a question mark (?
) followed by the name enclosed in less than and greater than signs to the beginning of the group. Let’s name our capture group code
:
const regex = /free(?<code>co+de)camp.*free\1camp/i;
Now we can update our backreference in the regular expression to refer to this group. A named backreference starts with a backslash followed by the letter k
in JavaScript. Then you add the name, again enclosed in less than (<
) and greater than (>
) signs. Let’s take a look at that:
const regex = /free(?<code>co+de)camp.*free\k<code>camp/i;
Now if we check our test()
call, we can see that we still pass:
const regex = /free(?<code>co+de)camp.*free\k<code>camp/i;
console.log(regex.test("freecooooodecamp is freecooooodecamp")); // true
To use our named capture group in a replace()
call, we’d insert a dollar sign into the string, followed by the name enclosed in less than and greater than signs:
const regex = /free(?<code>co+de)camp/i;
console.log("freecooooodecamp".replace(regex, "paid$<code>camp")); // paidcooooodecamp
Finally, sometimes you want to create a group of characters, but don’t need the captured result.
Let’s say we want to match either freecodecamp
or freecandycamp
. You could create two patterns separated by an OR assertion:
const regex = /freecodecamp|freecandycamp/i;
But this can become quite lengthy for larger-scale regular expressions. Instead, you can create a non-capturing group around the characters that you need to OR:
const regex = /free(?:code|candy)camp/i;
A non-capturing group does not store the code|candy
match separately in memory. But it can be helpful for creating alternate patterns without sacrificing readability or performance.