Revisiting Regular Expression

How much do we know about regular expression ?

There’s always more beyond what we know.

Format

/RegExpr/Flags

Flags - gmiuy

Global - g

When not specified, return only the first match, once found.

When turned on, continue and return subsequent matches as well.

Multiple - m

This flag is only (?) for ^ and $, which matching the start and end position of the entire string.

When m is turned on, ^ and $ will match the start and end position of each line.

Unicode - u

Has to be used with the pattern \x{FFFFF}.

Case Insensitive - i

Handy, isn’t it ?

Character Classes

Match a digit / character / whitespace / their opposite / Anything

To match	Regex
Digit or not digit	\d \D
Lower-Ascii character and _ or not	\w \W
Whitespace or not	\s \S

Word Character

\w matches only lower-ascii characters and underscore.

This includes aphabetic letters and digits. (Alphanumeric)

For example, letters with umlaut ü will not be matched.

# More precisely, \w is equivalent to this
[a-zA-Z0-9_]

Whitespace

Each single space / tab / line-break is a match.

To match a consecutive sequence of spaces, use \s+

Anything - Really ?

Well, everyone knows it.

Still, it’s worth noting is that . matches anything but line breaks.

So it’s actually [^\r\n]

Escaped Characters

Common ones

Tab \t and vertical tab \v

I’ve never seen the vertical tab ..

Null \0

Line feed \n

Carriage return \r

Often appears in an HTTP header.

Control Characters \c

Characters correspond to ASCII codes from 1 - 26

\cA matches SOH, whose ASCII code is 1

\cZ matches SUB, whose ASCII code is 26

Unicode and extension

For example, \uFFFF

A unicode may take 2 - 4 bytes

Above 2 bytes, use \u{FFFFF} AND the flag /u

Just Escape

Octal can go up to \377 (Character code 255)

Must be 3 digit, for example \007

Hexadecimal \xFF

Reserved Characters \

Use backslash \ to escape any character that carries a non-literal meaning

.+?*
^$
[](){}
/\|

Grouping

One Of / None of

[ABC] matches a single character, A, B or C.

[^ABC] matches any character not in the set

Range of characters

For example, [a-z] or [0-9]

Match any character within the range, bounds included.

Capturing group ()

(abc)+ will match abc abcabc

Reference a group

to match a symmetric pattern

For example, (a|b).+\1 will match axa and bab, but not axb

Group without reference

(?:a|b)+(x|y)z\1 will match axzx, ayzy, bxzx

Here the numeric reference points to (x|y)

Position Matching

Anchors

^ and $ match the beginning and end of entire string

To matchin the beginning and end of each line, use ‘multiple’ flag ^/m $/m

Boundary

\b : The position between a word and a (non-word or start / end of the string).

\B: Anything that is not a boundary.

What is a word again ?

In the context of Regex, a word is [a-zA-Z0-9_]

So a\b will match aß , but will not match a_

In contrast, a\B will match a_ , but will not match aß

Positon is invisible

For example, whena\b matches aß , the result is just a, without the boundary character.

Preceding & Following

Remember \b and \B ?

What about a specific character position ?

\d(?=ies) will match any digit that is followed by ies

\w(?!=ies) will match word character that is not followed by ies

Quantifiers & Alternation

To match .. of preceding token	Regex
0 or 1	?
1 or more	+
Any number	*
Has to be exactly 3 occurrences	{3}
Can have 1 to 10 (inclusively) occurrences	{1, 10}

How many is more ?

By default, regex takes a greedy approach and will match as many as possible

Make it lazy ?

+? is same as {1}
*? is same as nothing or {0}

Alternation |

Usually used within a group, wrapped by parenthesis.

Question

Write an regex that matches a valid IP address

More precisely, how can we match a number between 0 and 255.
which may have preceding zeros.

(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}

Does each digit group precede a dot ?

x.x.x.y
Only the first 3 groups.
Have to copy paste the first part ..

((25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}\.){3}
(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}

Can a regex match any palindrome ?

No. The mechanism of regex matching relies on finite state automaton
Each character being read from the input string triggers a state transition.
- Self-looping transition is allowed.
A palindrome of arbitrary length requires arbitrary number of states.
FSA must have finite number of states.

Given a string, can we create a regex to check if it’s one ?

Yes, remember the capturing group references ?
OR, we can perform multiple match attempts until it has 1 or 0 characters left.
- (.)(.*)\1 and strip off the surrounding character pair after each match.

References

https://regexr.com provides interactive way of learning it.

Archives