Regular Expressions 101
Learn the basics of regular expressions, when to use them and how to use them.
- regex
- regular expressions
Introduction
Regular expressions seem complicated, confusing, even ciphered. They easily become hard to read and can lead to performance issues. Their reputation precedes them as a quirky mix between chaos and maths.
I ask you to put your prejudices aside and let me introduce you to the friendly and powerful regex I know. Let's start with the basics.
What is a regex?
A regex is a string that matches a pattern of text. In most programming languages you surround the string with forward slashes so it is recognized and treated as a regex. /foo/
, /[0-9]+\.[0-9]+/
and /(0[1-9]|1[0-2])/(0[1-9]|[12][0-9]|3[01])/\d{4}/
are valid regular expressions. The first matches the sequence of characters 'foo', the second matches decimal numbers, and the third - dates.
When to use a regex?
Regular expressions are used when you need to find, replace, split or validate strings. These tasks are so essential that all the major programming languages have a built-in regex engine.
How to use a regex?
The capabilities of regular expressions are vast and we only go through the basics in this article. You can find more sophisticated use-cases and examples {{here}}.
Most characters, like letters and numbers, match themselves. Creating a regex composed of only alphanumerical characters will be no different than looking for a substring - a function present in the standard library of modern languages. In order to unlock the full potential of regular expressions, we add a little bit of special syntax in the form of quantifiers, tokens, anchors, and groups.
Quantifiers
Quantifiers are symbols that indicate how many times a given character should be matched.
Examples: ?, *, +, ...
Cheatsheet
a? - matches one or zero of a a* - matches zero or more of a a+ - matches one or more of a a{1} - matches exactly one of a a{1,3} - matches one, two, or three of a a{1,} - matches at least one of a
Examples
Lets's have a test sentence: Son, it's too soon for you to drink alcohol
,and the goal to match both son
and soon
.
We can write the following regular expressions that will do the trick
/so{1,2}n/
- there are either 1 or 2 "o"s -/soo?n/
- the second "o" is optional -/soo{0,1}n/
- the second "o" is an optional alternative
Tokens
Tokens generally start with a backslash \
and represent a single or a group of characters. They also provide a way for matching the special characters that are part of the regex syntax.
Examples: ?, +, \s, \d, \D, \n, \t, ...
Cheatsheet
? - matches the character ? + - matches the character + \s - matches the whitespace character \d - matches any digit \D - matches any non-digit; the opposite of \d \n - matches the newline character \t - matches the tab character . - matches any character [abc] - matches a single character: a, b or c [a-z] - matches a single character in the range from a to z [0-9] - matches any digit
Examples
/\?{3}/
- matches 3 consecutive question marks -/\s+/
- matches one or many spaces ⠀◦ useful for splitting user input - like names -/\d{4}/
- matches 4 consecutive digits ⠀◦ think of years, ZIP codes -/fi.e/
- matches fire, file, five, because the 3rd character can be anything -/[A-Z][a-z]+/
- matches words that begin with an uppercase letter and have at least 2 letters -/[0-5][0-9]/
- matches all numbers that represent seconds ⠀◦ Imagine a digital clock that goes from 00 to 59
Anchors
Anchors indicate the start and the end of strings and boundaries.
Examples: ^, $, \b, \B
Cheatsheet
^ - start of string & - end of string \b - word boundary; a position between alphanumerical characters and non-alphanumerical characters \B - non-word boundary; a position between two alphanumerical characters or two non-alphanumerical characters
Examples
/^Regex/
- matches strings that begin with Regex-/regex.&/
- matches strings that end with *regex. *-/^Some\b/
- matches strings that begin with the word Some
Groups
Groups, or capturing groups, are a way to treat multiple characters as the same unit. This way you can isolate regex logic inside a group, or name a group so you can inspect what text it has matched.
Cheatsheet
({regex}) - a group is indicated with brackets around your custom {regex} (?<{name}>{regex}) - you name a group by writing a question mark right after the opening bracket and surrounding the {name} with less than and more than signs
Examples
/(\d+)/
- captures all numbers -/(?<long_spaces>\s{2,})/
- captures occurrences of two, or more, consecutive spaces
Extras
There are logical operators in regex too! You can negate and use or, just like in your favorite programming language.
Cheatsheet
^ - negation | - or
Examples
/[^a-z]/
- matches any character which is not a lowercase letter from the English alphabet -/al|eal/
- matches either the 2 characters al, or the word *eel *-/(al|ea)l/
- matches either the word all, or the word *eel *⠀◦ note that using a group isolates the or logic inside of it
Bringing the pieces together
Regular expressions need some time to sink in. Piling up more syntax will only confuse you so let us practice instead.
Email validation
Basic
/[^@]+@[^.]+\.[a-z]{2,}/
This should be one of the simplest regular expressions that validate emails. It can be broken down into 5 parts:
[^@]+
- matches one or more characters that are not the character @ -@
- matches the character @ -[^\.]+
- matches one or more characters that are not the character . ⠀◦ we escape the dot character as it can also mean any character! -\.
- matches the character . -[a-z]{2,}
- matches two or more letters
With groups
/(?<username>[^@]+)@(?<mail_server>[^\.]+)\.(?<domain>[a-z]{2,})/
Adding groups allows us to match test string and see what we capture in each of them.
john.doe@test.com
Groups: - username: john.doe - mail_server: test - domain: com
emil.kirilov@lexis.solutions
Groups: - username: emil.kirilov - mail_server: lexis - domain: solutions
Date validation
We will use the dd.mm.yyyy
format in this example.
Naive with groups
/(?<day>\d{2})\.(?<month>\d{2})\.(?<year>\d{4})/
27.12.2021
Groups: - Day: 12 - Month: 12 - Year: 2021
33.13.2021
Groups: - Day: 33 - Month: 13 - Year: 2021
This regex is naive because it matches too optimistically. We could provide laughable input and it would find it acceptable. Let's fix it!
A perfect enough with groups
/(?<day>0[1-9]|[12][0-9]|30|31)\.(?<month>0[1-9]|11|12)\.(?<year>[12][0-9]{3})/
Breakdown
Group day
: - the minimum date is 01, so dates starting with 0, can't end with 0 - all 10s and 20s are OK - 30 and 31 are both possible
Group month
: - same logic as with day
Group year
: - only accepts 1xxx and 2xxx years
Examples
27.12.2021
Groups: - Day: 12 - Month: 12 - Year: 2021
33.13.2021 - no match
Considerations
I labeled this regex as 'perfect enough' because not all months have 31 days, or 30 days, or 29 days for that matter. We could, of course, write a monster regex and account for each month's possible day count, but we won't. It will become unreadable and still for short on leap years.
Be careful not to misuse regular expressions. Our date regex will be exceptional if you need to extract dates in a long text. Perhaps your task is to find the min/max date? Do that with a library or a built-in date parser. Don't compare the year named groups, the month named, and the day named groups of each month.
End thoughts
Regular expressions are indispensable. They are a specific tool that you won't need every day but can save you days when you correctly recognize a use case.
I hope that by the end of this article you have a basic understanding and appreciation for the regex power and would, perhaps, revisit it whenever you find yourself in need of a cheat sheet in building regular expressions.