Archives

How much do we know about regular expression ?

There’s always more beyond what we know.

 

Format

/RegExpr/Flags

 

Flags - gmiuy

Global - g

When not specified, return only the first match, once found.

When turned on, continue and return subsequent matches as well.

 

Multiple - m

This flag is only (?) for ^ and $, which matching the start and end position of the entire string.

When m is turned on, ^ and $ will match the start and end position of each line.

 

Unicode - u

Has to be used with the pattern \x{FFFFF}.

 

Case Insensitive - i

Handy, isn’t it ?

 

Character Classes

Match a digit / character / whitespace / their opposite / Anything

To matchRegex
Digit or not digit\d \D
Lower-Ascii character and _ or not\w \W
Whitespace or not\s \S

Word Character

\w matches only lower-ascii characters and underscore.

  • This includes aphabetic letters and digits. (Alphanumeric)

For example, letters with umlaut ü will not be matched.

# More precisely, \w is equivalent to this
[a-zA-Z0-9_]

Whitespace

Each single space / tab / line-break is a match.

To match a consecutive sequence of spaces, use \s+

 

Anything - Really ?

Well, everyone knows it.

Still, it’s worth noting is that . matches anything but line breaks.

So it’s actually [^\r\n]

 

Escaped Characters

Common ones

Tab \t and vertical tab \v

  • I’ve never seen the vertical tab ..

Null \0

Line feed \n

Carriage return \r

  • Often appears in an HTTP header.

Control Characters \c

Characters correspond to ASCII codes from 1 - 26

\cA matches SOH, whose ASCII code is 1

..

\cZ matches SUB, whose ASCII code is 26

 

Unicode and extension

For example, \uFFFF

  • A unicode may take 2 - 4 bytes

Above 2 bytes, use \u{FFFFF} AND the flag /u

 

Just Escape

Octal can go up to \377 (Character code 255)

  • Must be 3 digit, for example \007

Hexadecimal \xFF

 

Reserved Characters \

Use backslash \ to escape any character that carries a non-literal meaning

.+?*
^$
[](){}
/\|

 

 

Grouping

One Of / None of

[ABC] matches a single character, A, B or C.

[^ABC] matches any character not in the set

 

Range of characters

For example, [a-z] or [0-9]

Match any character within the range, bounds included.

Capturing group ()

(abc)+ will match abc abcabc

 

Reference a group

  • to match a symmetric pattern

For example, (a|b).+\1 will match axa and bab, but not axb

 

Group without reference

(?:a|b)+(x|y)z\1 will match axzx, ayzy, bxzx

  • Here the numeric reference points to (x|y)

 

Position Matching

Anchors

^ and $ match the beginning and end of entire string

To matchin the beginning and end of each line, use ‘multiple’ flag ^/m $/m

Boundary

\b : The position between a word and a (non-word or start / end of the string).

\B: Anything that is not a boundary.

What is a word again ?

In the context of Regex, a word is [a-zA-Z0-9_]

So a\b will match , but will not match a_

In contrast, a\B will match a_ , but will not match

Positon is invisible

For example, whena\b matches , the result is just a, without the boundary character.

Preceding & Following

Remember \b and \B ?

What about a specific character position ?

\d(?=ies) will match any digit that is followed by ies

\w(?!=ies) will match word character that is not followed by ies

 

Quantifiers & Alternation

To match .. of preceding tokenRegex
0 or 1?
1 or more+
Any number*
Has to be exactly 3 occurrences{3}
Can have 1 to 10 (inclusively) occurrences{1, 10}

How many is more ?

By default, regex takes a greedy approach and will match as many as possible

Make it lazy ?

+? is same as {1}
*? is same as nothing or {0}

Alternation |

  • Usually used within a group, wrapped by parenthesis.

 

Question

Write an regex that matches a valid IP address

  • More precisely, how can we match a number between 0 and 255.
  • which may have preceding zeros.
(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}

Does each digit group precede a dot ?

  • x.x.x.y
  • Only the first 3 groups.
  • Have to copy paste the first part ..
((25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}\.){3}
(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}

 

Can a regex match any palindrome ?

  • No. The mechanism of regex matching relies on finite state automaton
  • Each character being read from the input string triggers a state transition.
    • Self-looping transition is allowed.
  • A palindrome of arbitrary length requires arbitrary number of states.
  • FSA must have finite number of states.

Given a string, can we create a regex to check if it’s one ?

  • Yes, remember the capturing group references ?
  • OR, we can perform multiple match attempts until it has 1 or 0 characters left.
    • (.)(.*)\1 and strip off the surrounding character pair after each match.

 

References

https://regexr.com provides interactive way of learning it.