Definition
A regular expression (regex) describes a pattern for matching strings.
Building Blocks
| Expression | Matches |
|---|---|
| Nothing (empty language) | |
| Empty string only | |
| (symbol) | Just the string "a" |
Operations
| Operation | Notation | Meaning |
|---|---|---|
| Union | or $R_1 | R_2$ |
| Concatenation | followed by | |
| Kleene Star | Zero or more of |
Precedence (highest to lowest)
- Kleene star (*)
- Concatenation
- Union (|)
Use parentheses to override.
Examples
Basic Patterns
| Regex | Language |
|---|---|
| All strings over |
More Complex Examples
| Regex | Description |
|---|---|
| Binary strings ending in 1 | |
| Exactly one 1 | |
| Contains 00 | |
| Alternating, starting with 0 or 1 | |
| No consecutive 0s |
Additional Operators
These are shorthand (not fundamental):
| Operator | Meaning | Equivalent |
|---|---|---|
| One or more | ||
| Zero or one | ||
| Exactly | ( times) | |
| Between and | — | |
| Character class | ||
| Any character |
Equivalence with Finite Automata
Theorem (Kleene)
A language is regular if and only if it can be described by a regular expression.
Regular expressions ↔ NFAs ↔ DFAs
All three describe exactly the regular languages.
Regex to NFA
Thompson's Construction
Base cases:
:
→(i)──ε──→((f))
:
→(i)──a──→((f))
Union ():
ε [N1] ε
→(i)──────→ ──────→((f))
ε [N2] ε
Concatenation ():
→(i)──→[N1]──→[N2]──→((f))
Kleene Star ():
ε
→(i)──────────────────→((f))
ε ε
↓ ↑
[N]──→──┘
NFA to Regex
State Elimination Method
- Add new start and accept states
- Eliminate states one by one
- Update edge labels with regex
- Final edge label is the regex
Algebraic Properties
Identities
Distributive Laws
Arden's Lemma
If where , then:
Useful for converting DFAs to regex.
Practical Regex (Programming)
Modern regex engines add features beyond regular languages:
POSIX Character Classes
\d= digit =[0-9]\w= word character =[a-zA-Z0-9_]\s= whitespace
Anchors
^= start of string$= end of string\b= word boundary
Backreferences
(.)\1 matches repeated character (NOT regular!)
Lookahead/Lookbehind
(?=...)= positive lookahead(?!...)= negative lookahead
Common Patterns
| Pattern | Regex |
|---|---|
[\w.]+@[\w.]+\.\w+ | |
| Phone (US) | \d{3}-\d{3}-\d{4} |
| IP address | \d{1,3}(\.\d{1,3}){3} |
| Integer | -?[0-9]+ |
| Identifier | [a-zA-Z_][a-zA-Z0-9_]* |
Limitations
Regular expressions cannot match:
- Balanced parentheses:
- Palindromes
- Nested structures
- Anything requiring counting or memory
For these, use context-free grammars/parsers.