Automata & LanguagesTopic #40 of 40

Context-Free Grammars

CFG notation, derivations, parse trees, ambiguity, and Chomsky normal form.

Definition

A context-free grammar (CFG) is a 4-tuple G=(V,Σ,R,S)G = (V, \Sigma, R, S) where:

  • VV = finite set of variables (non-terminals)
  • Σ\Sigma = finite set of terminals (disjoint from VV)
  • RR = finite set of production rules of form AαA \to \alpha
    • AVA \in V (single variable on left)
    • α(VΣ)\alpha \in (V \cup \Sigma)^* (string of variables and terminals)
  • SVS \in V = start symbol

Example Grammar

Grammar for balanced parentheses: S(S)SSϵS \to (S) \mid SS \mid \epsilon

Notation Conventions

  • Capital letters: variables (S,A,BS, A, B)
  • Lowercase letters: terminals (a,b,ca, b, c)
  • ϵ\epsilon: empty string
  • \mid: separates alternatives

Derivations

A derivation shows how to generate a string from the start symbol.

One-Step Derivation (\Rightarrow)

αAβαγβ\alpha A \beta \Rightarrow \alpha \gamma \beta if AγA \to \gamma is a rule.

Multi-Step Derivation (\Rightarrow^*)

Zero or more derivation steps.

Example

Grammar: SaSbϵS \to aSb \mid \epsilon

Derivation of "aabb": SaSbaaSbbaabbS \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aabb

Language of a Grammar

L(G)={wΣSw}L(G) = \{w \in \Sigma^* \mid S \Rightarrow^* w\}

All terminal strings derivable from the start symbol.

Context-Free Languages

A language is context-free if some CFG generates it.

Examples

LanguageGrammar
{anbn}\{a^n b^n\}SaSbϵS \to aSb \mid \epsilon
Balanced parenthesesS(S)SSϵS \to (S) \mid SS \mid \epsilon
PalindromesSaSabSbabϵS \to aSa \mid bSb \mid a \mid b \mid \epsilon
{anbmcmdn}\{a^n b^m c^m d^n\}SaSdBS \to aSd \mid B, BbBcϵB \to bBc \mid \epsilon

Parse Trees

A parse tree visualizes a derivation:

  • Root = start symbol
  • Internal nodes = variables
  • Leaves = terminals (or ϵ\epsilon)
  • Children of node AA match right side of some rule A...A \to ...

Yield

Reading leaves left-to-right gives the derived string.

Leftmost and Rightmost Derivations

Leftmost derivation: Always expand the leftmost variable.

Rightmost derivation: Always expand the rightmost variable.

Every parse tree corresponds to exactly one leftmost and one rightmost derivation.

Ambiguity

A grammar is ambiguous if some string has two or more parse trees.

Example: Ambiguous Grammar

EE+EE×E(E)aE \to E + E \mid E \times E \mid (E) \mid a

String "a+a×aa + a \times a" has two parse trees:

  • (a+a)×a(a + a) \times a (add first)
  • a+(a×a)a + (a \times a) (multiply first)

Resolving Ambiguity

Rewrite grammar to enforce precedence/associativity: EE+TTE \to E + T \mid T TT×FFT \to T \times F \mid F F(E)aF \to (E) \mid a

Inherently Ambiguous Languages

Some CFLs have no unambiguous grammar.

Example: {aibjcki=j or j=k}\{a^i b^j c^k \mid i = j \text{ or } j = k\}

Chomsky Normal Form (CNF)

Every rule has form:

  • ABCA \to BC (two variables), or
  • AaA \to a (single terminal)

(Start symbol may have SϵS \to \epsilon)

Conversion to CNF

  1. Eliminate ϵ\epsilon-rules (except from start)
  2. Eliminate unit rules (ABA \to B)
  3. Replace terminals in long rules with variables
  4. Break long rules into binary rules

Importance

  • Simplifies parsing algorithms (CYK)
  • Guarantees O(n3)O(n^3) parsing

Greibach Normal Form (GNF)

Every rule has form: AaαA \to a\alpha

where aa is a terminal and α\alpha is a (possibly empty) string of variables.

Pumping Lemma for CFLs

If LL is context-free, there exists pp such that:

For any wLw \in L with wp|w| \geq p, we can write w=uvxyzw = uvxyz where:

  1. vxyp|vxy| \leq p
  2. vy1|vy| \geq 1
  3. For all i0i \geq 0: uvixyizLuv^i xy^i z \in L

Example: {anbncn}\{a^n b^n c^n\} is not context-free

Choose w=apbpcpw = a^p b^p c^p. Any pumping produces imbalance. \square

Closure Properties

OperationCFLs Closed?
UnionYes
ConcatenationYes
Kleene starYes
IntersectionNo
ComplementNo
Intersection with regularYes

Parsing Algorithms

AlgorithmTimeNotes
CYKO(n3)O(n^3)Any CFG in CNF
EarleyO(n3)O(n^3)Any CFG
LL(k)O(n)O(n)Restricted grammars
LR(k)O(n)O(n)Most programming languages

Applications

Programming Languages

  • Syntax defined by CFGs
  • Parsed by compiler front-end

Natural Language Processing

  • Sentence structure analysis
  • Machine translation

Markup Languages

  • HTML, XML structure
  • JSON format