How much do we know about regular expression ?
There’s always more beyond what we know.
Format
/RegExpr/Flags
Flags - gmiuy
Global - g
When not specified, return only the first match, once found.
When turned on, continue and return subsequent matches as well.
Multiple - m
This flag is only (?) for ^ and $, which matching the start and end position of the entire string.
When m is turned on, ^ and $ will match the start and end position of each line.
Unicode - u
Has to be used with the pattern \x{FFFFF}.
Case Insensitive - i
Handy, isn’t it ?
Character Classes
Match a digit / character / whitespace / their opposite / Anything
To match | Regex |
---|---|
Digit or not digit | \d \D |
Lower-Ascii character and _ or not | \w \W |
Whitespace or not | \s \S |
Word Character
\w matches only lower-ascii characters and underscore.
- This includes aphabetic letters and digits. (Alphanumeric)
For example, letters with umlaut ü will not be matched.
# More precisely, \w is equivalent to this
[a-zA-Z0-9_]
Whitespace
Each single space / tab / line-break is a match.
To match a consecutive sequence of spaces, use \s+
Anything - Really ?
Well, everyone knows it.
Still, it’s worth noting is that .
matches anything but line breaks.
So it’s actually [^\r\n]
Escaped Characters
Common ones
Tab \t
and vertical tab \v
- I’ve never seen the vertical tab ..
Null \0
Line feed \n
Carriage return \r
- Often appears in an HTTP header.
Control Characters \c
Characters correspond to ASCII codes from 1 - 26
\cA
matches SOH, whose ASCII code is 1
..
\cZ
matches SUB, whose ASCII code is 26
Unicode and extension
For example, \uFFFF
- A unicode may take 2 - 4 bytes
Above 2 bytes, use \u{FFFFF}
AND the flag /u
Just Escape
Octal can go up to \377
(Character code 255)
- Must be 3 digit, for example
\007
Hexadecimal \xFF
Reserved Characters \
Use backslash \ to escape any character that carries a non-literal meaning
.+?*
^$
[](){}
/\|
Grouping
One Of / None of
[ABC]
matches a single character, A, B or C.
[^ABC]
matches any character not in the set
Range of characters
For example, [a-z] or [0-9]
Match any character within the range, bounds included.
Capturing group ()
(abc)+
will match abc
abcabc
Reference a group
- to match a symmetric pattern
For example, (a|b).+\1
will match axa
and bab
, but not axb
Group without reference
(?:a|b)+(x|y)z\1
will match axzx
, ayzy
, bxzx
- Here the numeric reference points to
(x|y)
Position Matching
Anchors
^ and $
match the beginning and end of entire string
To matchin the beginning and end of each line, use ‘multiple’ flag ^/m $/m
Boundary
\b
: The position between a word and a (non-word or start / end of the string).
\B: Anything that is not a boundary.
What is a word again ?
In the context of Regex, a word is [a-zA-Z0-9_]
So a\b
will match aß
, but will not match a_
In contrast, a\B
will match a_
, but will not match aß
Positon is invisible
For example, whena\b
matches aß
, the result is just a
, without the boundary character.
Preceding & Following
Remember \b and \B ?
What about a specific character position ?
\d(?=ies)
will match any digit that is followed by ies
\w(?!=ies)
will match word character that is not followed by ies
Quantifiers & Alternation
To match .. of preceding token | Regex |
---|---|
0 or 1 | ? |
1 or more | + |
Any number | * |
Has to be exactly 3 occurrences | {3} |
Can have 1 to 10 (inclusively) occurrences | {1, 10} |
How many is more ?
By default, regex takes a greedy approach and will match as many as possible
Make it lazy ?
+? is same as {1}
*? is same as nothing or {0}
Alternation |
- Usually used within a group, wrapped by parenthesis.
Question
Write an regex that matches a valid IP address
- More precisely, how can we match a number between 0 and 255.
- which may have preceding zeros.
(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}
Does each digit group precede a dot ?
- x.x.x.y
- Only the first 3 groups.
- Have to copy paste the first part ..
((25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}\.){3}
(25[0-5])|(2[0-4][1-9])|(0|1)?[0-9]{1,2}
Can a regex match any palindrome ?
- No. The mechanism of regex matching relies on finite state automaton
- Each character being read from the input string triggers a state transition.
- Self-looping transition is allowed.
- A palindrome of arbitrary length requires arbitrary number of states.
- FSA must have finite number of states.
Given a string, can we create a regex to check if it’s one ?
- Yes, remember the capturing group references ?
- OR, we can perform multiple match attempts until it has 1 or 0 characters left.
(.)(.*)\1
and strip off the surrounding character pair after each match.
References
https://regexr.com provides interactive way of learning it.