Terminals = symbols of the alphabet of the language being defined.
Variables = Nonterminals = a finite set of other symbols, each of which represents a language.
Start Symbol = the variable whose language is the one being defined.
Production
variable(head)->string of variables and terminals(body)
Iterated Derivation
=>* means "zero or more derivation steps."
Sentential Form
Any string of variables and/or terminals derived from the start symbol.
Formally, α is a sentential form iff S=>∗ α .
Context-Free Language
If G is a CFG, then the language of G, i.e., L(G) is {w|S=>∗w} .
A language that is defined by some CFG is called a context-free language.
Derivations allow us to replace any of the variables in a string.
Leads to many different derivations of the same string.
By forcing the leftmost variable (or alternatively, the rightmost variable) to be replaced, we avoid these "distinctions without a difference".
Say wAα=>lmwβα if w is a string of terminals only and A→β is a production.
Also, α=>∗lmβ if α becomes β by a sequence of 0 or more =>lm steps.
Say αAw=>rmαβ w if w is a string of terminals only and A→β is a production.
Also, α=>∗rmβ if α becomes β by a sequence of 0 or more =>rm steps.
The concatenation of the labels of the leaves in left-to-right order (that is, in the order of a preorder traversal) is called the yield of the parse tree.
We sometimes talk about trees that are not exactly parse trees, but only because the root is labeled by some variable A that is not the start symbol.
Call these parse trees with root A.
A CFG is ambiguous if there is a string in the language that is the yield of two or more parse trees.
Equivalent definitions of "ambiguous grammar"
Ambiguity is a Property of Grammars, not Languages.
For the balanced-parentheses language, here is another CFG, which is unambiguous.
A symbol is useful if it appears in some derivation of some terminal string from the start symbol.
Otherwise, it is useless.
- Eliminate symbols that derive no terminal string.
- Eliminate unreachable symbols.
- Discover all variables that derive terminal strings.
- For all other variables, remove all productions in which they appear in either the head or body.
S -> AB | C, A -> aA | a, B -> bB, C -> c
Basis: A and C are discovered because of A -> a and C -> c.
Induction: S is discovered because of S -> C.
Nothing else can be discovered.
Result: S -> C, A -> aA | a, C -> c
- Remove from the grammar all symbols not discovered reachable from S and all productions that involve these symbols.
Theorem: If L is a CFL, then L-{ε} has a CFG with no ε-productions.
Note: ε cannot be in the language of any grammar that has no ε–productions.
nullable symbols = variables A such that A =>* ε.
S -> AB, A -> aA | ε, B -> bB | A
Basis: A is nullable because of A -> ε.
Induction: B is nullable because of B -> A.
Then, S is nullable because of S -> AB.
Key idea: turn each production A→X1…Xn into a family of productions.
For each subset of nullable X' s, there is one production with those eliminated from the right side "in advance".
- Find all pairs (A, B) such that A =>* B by a sequence of unit productions only.
Theorem: if L is a CFL, then there is a CFG for L – {ε} that has:
i.e., every body is either a single terminal or has length > 2.
Perform the following steps in order:
- Eliminate ε-productions.
- Eliminate unit productions.
- Eliminate variables that derive no terminal string.
- Eliminate variables not reached from the start symbol.
A CFG is said to be in Chomsky Normal Form if every production is of one of these two forms:
- A -> BC (body is two variables).
- A -> a (body is a single terminal).
Theorem: If L is a CFL, then L – {ε} has a CFG in CNF.
Step 1: Clean the grammar, so every body is either a single terminal or of length at least 2.
Step 2: For each body ≠a single terminal, make the right side all variables.
Consider production A→BcDe `.
We need variables Ac and Ae with productions Ac→c and Ae→e .
Replace A→BcDe by A→BAcDAe .
Step 3: Break right sides longer than 2 into a chain of productions with right sides of two variables.
Example: A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
- F and G must be used nowhere else.
Recall A -> BCDE is replaced by A -> BF, F -> CG, and G -> DE.
In the new grammar, A => BF => BCG => BCDE.
More importantly: Once we choose to replace A by BF, we must continue to BCG and BCDE.
A PDA is described by:
If δ(q, a, Z) contains (p, α ) among its actions, then one thing the PDA can do in state q, with a at the front of the input, and Z on top of the stack is:
1. Change the state to p.
2. Remove a from the front of the input (but a may be ε).
3. Replace Z on the top of the stack by α .
Design a PDA to accept { 0n1n | n > 1}.
The states:
The stack symbols:
The transitions:
We can formalize the pictures just seen with an instantaneous description(ID).
A ID is a triple (q, w, α ), where:
To say that ID I can become ID J in one move of the PDA, we write I⊦J .
Formally, (q,aw,Xα)⊦(p,w,βα) for any w and α , if δ(q,a,X) contains (p,β) .
Extend ⊦ to ⊦*, meaning "zero or more moves".
Using the previous example PDA, we can describe the sequence of moves by:
(q,000111,Z0)⊦(q,00111,XZ0)⊦(q,0111,XXZ0)⊦(q,111,XXXZ0)⊦(p,11,
XXZ0)⊦(p,1,XZ0)⊦(p,ε,Z0)⊦(f,ε,Z0).
Thus,(q,000111,Z0)⊦*(f,ε,Z0)
.
Theorem 1: Given a PDA P, if (q,x,α)⊦∗(p,y,β) , for all the string w in Σ∗ and all the string γ in Γ∗ , we have (q,xw,αγ)⊦∗(p,yw,βγ)
Theorem 2: Given a PDA P, if (q,xw,α)⊦∗(p,yw,β) , we have (q,x,α)⊦∗(p,y,β)
The common way to define the language of a PDA is by final state.
If P is a PDA, then L(P) is the set of strings w such that (q0,w,Z0)⊦∗(f,ε,α) for final state f and any α .
Another language defined by the same PDA is by empty stack.
If P is a PDA, then N(P) is the set of strings w such that (q0,w,Z0)⊦∗(q,ε,ε) for any state q.
Think about wwR .
Theorem:
If L is a regular language, there exists a DPDA P, such that L=L(P).
Think about a DFA with a stack that never change…
Note: It's NOT true of empty stack. Take {0}* for example.
RE=>DPDA(L(P))=>NPDA
RE ≠ >DPDA(N(p))=>DPDA(L(P))
CFG' s and PDA' s are both useful to deal with properties of the CFL' s.
Also, PDA's, being "algorithmic", are often easier to use when arguing that a language is a CFL.
Let L=L(G) .
Construct PDA P such that N(P)=L .
P has:
At each step, P represents some left-sentential form (step of a leftmost derivation).
If the stack of P is α , and P has so far consumed x from its input, then P represents left-sentential form xα .
At empty stack, the input consumed is a string in L(G) .
Transition Function of P
δ(q, a, a) = (q, ε). (Type 1 rules)
This step does not change the LSF represented, but "moves" responsibility for a from the stack to the consumed input.
If A→α is a production of G, then δ(q,ε,A) contains (q,α) . (Type 2 rules)
Now, assume L=N(P) .
We' ll construct a CFG G such that L=L(G) .
G will have variables [pXq] generating exactly the inputs that cause P to have the net effect of popping stack symbol X while going from state p to state q.
G' s variables are of the form [pXq].
This variable generates all and only the strings w such that (p,w,X)⊦∗(q,ε,ε) .
Also a start symbol S we' ll talk about later.
Each production for [pXq] comes from a move of P in state p with stack symbol X.
Simplest case:
δ(p, a, X) contains (q, ε).
Note a can be an input symbol or ε.
Then the production is [pXq]->a
.
Here, [pXq] generates a, because reading a is one way to pop X and go from p to q.
Next simplest case:
δ(p, a, X) contains (r, Y) for some state r and symbol Y.
G has production [pXq]->a[rYq]
.
We can erase X and go from p to q by reading a (entering state r and replacing the X by Y) and then reading some w that gets P from r to q while erasing the Y.
Third simplest case:
δ(p, a, X) contains (r, YZ) for some state r and symbols Y and Z.
Now, P has replaced X by YZ.
To have the net effect of erasing X, P must erase Y, going from state r to some state s, and then erase Z, going from s to q.
Since we do not know state s, we must generate a family of productions:[pXq]->a[rYs][sZq]
for all states s.
[pXq]=>*auv
whenever [rYs]=>*u
and [sZq]=>*v
.
General Case:
Suppose δ(p,a,X) contains (r,Y1,…,Yk) for some state r and k > 3.
Generate family of productions:
[pXq]→a[rY1s1][s1Y2s2]…[sk−2Yk−1sk−1][sk−1Ykq] .
We can prove that (q0,w,Z0)⊦∗(p,ε,ε) iff [q0Z0p]=>∗w .
Add to G another variable , the start symbol S, and add productions S→[q0Z0p] for each state p.
We can always find two pieces of any sufficiently long string to "pump" in tandem.
That is: if we repeat each of the two pieces the same number of times, we get another string in the language.
For every context-free language L, there is an integer n, such that
For every string z in L of length ≥ n, there exists z = uvwxy such that:
{0i10i|i≥1} is a CFL.
But L = {0i10i10i|i>1} is not.
Proof (using the pumping lemma)
Suppose L were a CFL.
Let n be L's pumping-lemma constant.
Consider z=0n10n10n .
We can write z = uvwxy, where |vwx| ≤ n, and |vx| ≥ 1.
Case 1: vx has no 0's.
Then at least one of them is a 1, and uwy has at most one 1, which no string in L does.
Case 2: vx has at least one 0.
Many questions that can be decided for regular sets cannot be decided for CFL's.
If the start symbol is one of useless variables, then the CFL is empty; otherwise not.
Assume G is in CNF or Convert the given grammar to CNF.
Note: w = ε is a special case, solved by testing if the start symbol is nullable.
Algorithm CYK is a good example of dynamic programming and runs in time O(n3) , where n = |w|.
Let w = a1…an .
We construct an n-by-n triangular array of sets of variables.
Xij = {variables A | A =>* ai⋯aj }.
Induction on j–I+1 (length of the derived string)
Finally, ask if S is in X1n .
(Note: The idea is essentially the same as for regular languages.)
Use the pumping lemma constant n.
If there is a string in the language of length between n and 2n−1 , then the language is infinite; otherwise not.
CFL' s are closed under union, concatenation, and Kleene closure.
Also, under reversal, homomorphisms and inverse homomorphisms.
But NOT under intersection or difference.
Proof:
Proof:
Proof:
Proof:
Example:
Let G have S→0S1 | 01.
The reversal of L(G) has grammar S→ 1S0 | 10.
Proof:
Example:
G has productions S -> 0S1 | 01.
h is defined by h(0) = ab, h(1) = ε.
h(L(G)) has the grammar with productions S -> abS | ab.
Proof:
CFG: S->AB, A->0A1|01, B->2B|2.
Proof:
We can prove something more general:
Any class of languages that is closed under difference is closed under intersection.
Proof: L ∩ M = L – (L – M).
Thus, if CFL's were closed under difference, they would be closed under intersection, but they are not.
The intersection of a CFL with a regular language is always a CFL.
Proof:
Formal Construction