Automata & LanguagesTopic #37 of 40

Regular Expressions

Regular expression syntax, equivalence with finite automata, and pattern matching.

Definition

A regular expression (regex) describes a pattern for matching strings.

Building Blocks

ExpressionMatches
\emptysetNothing (empty language)
ϵ\epsilonEmpty string only
aa (symbol)Just the string "a"

Operations

OperationNotationMeaning
UnionR1R2R_1 \cup R_2 or $R_1R_2$
ConcatenationR1R2R_1 R_2R1R_1 followed by R2R_2
Kleene StarRR^*Zero or more of RR

Precedence (highest to lowest)

  1. Kleene star (*)
  2. Concatenation
  3. Union (|)

Use parentheses to override.

Examples

Basic Patterns

RegexLanguage
aa{a}\{a\}
abab{ab}\{ab\}
aba \cup b{a,b}\{a, b\}
aa^*{ϵ,a,aa,aaa,}\{\epsilon, a, aa, aaa, \ldots\}
(ab)(ab)^*{ϵ,ab,abab,}\{\epsilon, ab, abab, \ldots\}
aba^*b^*{aibji,j0}\{a^i b^j \mid i, j \geq 0\}
(ab)(a \cup b)^*All strings over {a,b}\{a, b\}

More Complex Examples

RegexDescription
(01)1(0 \cup 1)^* 1Binary strings ending in 1
0100^* 1 0^*Exactly one 1
(01)00(01)(0 \cup 1)^* 00 (0 \cup 1)^*Contains 00
(0110)(01 \cup 10)^*Alternating, starting with 0 or 1
1(01+)1^* (01^+ )^*No consecutive 0s

Additional Operators

These are shorthand (not fundamental):

OperatorMeaningEquivalent
R+R^+One or moreRRRR^*
R?R?Zero or oneRϵR \cup \epsilon
R{n}R\{n\}Exactly nnRRRRR\cdots R (nn times)
R{n,m}R\{n,m\}Between nn and mm
[abc][abc]Character classabca \cup b \cup c
..Any characterΣ\Sigma

Equivalence with Finite Automata

Theorem (Kleene)

A language is regular if and only if it can be described by a regular expression.

Regular expressions ↔ NFAs ↔ DFAs

All three describe exactly the regular languages.

Regex to NFA

Thompson's Construction

Base cases:

ϵ\epsilon:

→(i)──ε──→((f))

aa:

→(i)──a──→((f))

Union (R1R2R_1 \cup R_2):

        ε     [N1]     ε
→(i)──────→       ──────→((f))
        ε     [N2]     ε

Concatenation (R1R2R_1 R_2):

→(i)──→[N1]──→[N2]──→((f))

Kleene Star (RR^*):

        ε
→(i)──────────────────→((f))
     ε      ε
     ↓      ↑
    [N]──→──┘

NFA to Regex

State Elimination Method

  1. Add new start and accept states
  2. Eliminate states one by one
  3. Update edge labels with regex
  4. Final edge label is the regex

Algebraic Properties

Identities

R=RR \cup \emptyset = R Rϵ=ϵR=RR \cdot \epsilon = \epsilon \cdot R = R R=R=R \cdot \emptyset = \emptyset \cdot R = \emptyset RR=RR \cup R = R (R)=R(R^*)^* = R^* =ϵ\emptyset^* = \epsilon ϵ=ϵ\epsilon^* = \epsilon

Distributive Laws

R(ST)=RSRTR(S \cup T) = RS \cup RT (RS)T=RTST(R \cup S)T = RT \cup ST

Arden's Lemma

If X=AXBX = AX \cup B where ϵL(A)\epsilon \notin L(A), then: X=ABX = A^* B

Useful for converting DFAs to regex.

Practical Regex (Programming)

Modern regex engines add features beyond regular languages:

POSIX Character Classes

  • \d = digit = [0-9]
  • \w = word character = [a-zA-Z0-9_]
  • \s = whitespace

Anchors

  • ^ = start of string
  • $ = end of string
  • \b = word boundary

Backreferences

(.)\1 matches repeated character (NOT regular!)

Lookahead/Lookbehind

  • (?=...) = positive lookahead
  • (?!...) = negative lookahead

Common Patterns

PatternRegex
Email[\w.]+@[\w.]+\.\w+
Phone (US)\d{3}-\d{3}-\d{4}
IP address\d{1,3}(\.\d{1,3}){3}
Integer-?[0-9]+
Identifier[a-zA-Z_][a-zA-Z0-9_]*

Limitations

Regular expressions cannot match:

  • Balanced parentheses: {anbn}\{a^n b^n\}
  • Palindromes
  • Nested structures
  • Anything requiring counting or memory

For these, use context-free grammars/parsers.