Photo by Agence Olloweb

Primer on Regular Expressions

In this post, I will try to give you a practical overview of Regular Expressions to teach you what they are, what they can be used for and a quick intro to how you can use them.

What are Regular Expressions even?

Regular Expressions (short Regexes) are Strings that work as a DSL (domain-specific language) to do some common tasks within other Strings. A DSL can also be subscribed as “a programming language within a programming language”.

What can I do with the “Regular Expression DSL”?

/.*@.*\\.com/.match?('') # => false /.*@.*\\.com/.match?('') # => true
/(.*)@(.*)\\.com/.match('').captures # => ["queenie.goldstein", "ilvermorny"]
''.gsub(/(.*)@(.*)\\.com/, '\\1@\\') # => ""

What does the “Regular Expressions DSL” look like?

There’s plenty of useful “cheat sheets” for this with great examples:

1. Character/Group Modifiers (e.g. *, +, {,}, ?)

The default “building” block of Regexes are characters. After each character, you can write a modifier that tells how many times the preceding character is matched. The following modifiers are available:

Custom Sets (created with [ and ])

You can define custom sets of characters by listing them without any separator within brackets, e.g. for a set of the characters a, b, c and numbers 1, 2, 3 we would write [abc123]. This is then considered as "one character of this set", thus matching multiple of them need character modifiers as in [abc123]* or [abc123]{2,5}.

Predefined Sets (\\s, \\S, \\d, \\D, \\w, \\W)

The following sets (simplified) are already pre-defined and can be used directly:

  • \\S effectively same as [^ \\t\\n], reads "any non-whitespace character"
  • \\d effectively same as [0-9], reads "any digit"
  • \\D effectively same as [^0-9], reads "any non-digit"
  • \\w similar to [a-zA-Z_0-9] (includes Umlauts etc.), reads "any word character"
  • \\W similar to [^a-zA-Z_0-9] reads "any non-word character"

Groups (e.g. ( and ), (?<name> and ))

Groups could be thought of like “words” or “sentences”, they change the default building block, which is “character” for any modifier to a set of characters, or a “group”. For example, writing abc* reads "one time a, one times b and any number of times c". If you want to write "any number of times abc" you do this: (abc)*. The abc is then considered one group and the regex would match the whole string abcabcabc.

Match Modifiers (e.g. \\A, \\z, ^, $, Lookaheads and Lookbehinds)

By default, a match for a Regex, like abc is done like a contains method. But you can also specify that the abc string needs to be at the beginning or end of a given string or of a line. For example, the ^ in ^abc makes sure only strings with abc at the beginning of a new line match. This will match def\\nabc but not defabc. The $ in abc$ makes sure there's a line-end after abc. Use \\A and \\z to match among the entire String (matching multiple lines).

Common Quirks and validating new Regexes

One common thing to consider is that . in most languages does not match the newline character by default. But it can typically be turned on with an option, in Ruby by specifying the /m at the end which stands for "make dot match newlines".

  1. A Regex cheat sheet visible right on the screen to look up things.
  2. A live matcher for the regex you write among the given sample String.

How can I use Regexes in my projects today?

There’s no need to wait until there’s a good opportunity to use Regexes, you can simply lint your projects using Regular expressions (including Auto-Correction support) via AnyLint:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store