I want to present an important and interesting topic in computer science, the Finite State Machine (FSM). In this part we start with the basics, gaining an understanding of what FSMs are and what they can be used for. This part is very elementary, so please be patient. In subsequent parts things will become much more complicated and interesting.
A finite state machine is a conceptual model that can used to describe how many things work. Think about a light bulb for instance. The circuit consists of a switch that can be ON or OFF, a few wires and the bulb itself. At any moment in time the bulb is in some state - it is either turned on (emits light) or turned off (no visible effect). For a more focused discussion, let's assume that we have two buttons - one for "turn on" and one for "turn off".
How would you describe the light bulb circuit? You'd probably put it like this: When it's dark and I press ON, the bulb starts emitting light. Then if I press ON again, nothing changes. If I press OFF the bulb is turned off. Then if I press OFF again, nothing changes.
This description is very simple and intuitive, but in fact it describes a state machine!
Think of it the following way: we have a machine (the bulb) with two states, ON and OFF. We have two inputs, an ON switch and an OFF switch. If we are in state ON, pressing ON changes nothing, but pressing OFF moves the machine to state OFF. If we are in state OFF, pressing OFF changes nothing, but pressing ON moves the machine to state ON.
The above is a rephrasing of the first description, just a little more formal. It is, in fact, a formal description of a state machine. Another customary way to describe state machines is with a diagram (people like diagrams and drawings more than words for some insights):
Textual descriptions can become quite wordy, and a simple diagram like this can contain a lot of information. Note how states are represented by circles, and transitions by arrows. This is almost all one needs to know, which makes diagrams very descriptive. This state machine can also be translated to (C++) code:
typedef enum{ON, OFF} bulb_state;
typedef enum{TURN_ON, TURN_OFF}
switch_command;
...... bulb_state state; switch_command command;
switch(state)
{
case ON:
if(command == TURN_ON){}
else if(command == TURN_OFF)
{ state = OFF;}
break;
case OFF:
if(command == TURN_ON)
{ state = ON;
}
else if(command == TURN_OFF){}
break;
default:assert(0);
}
If code such as this looks familiar, it's hardly surprising. Many of us write state machines in our code without even noticing. Most of the state machines we write are "implicit", in the sense that there isn't a single switch statement that handles the whole machine, but rather it's distributed throughout the whole program. If, additionally, the state variables don't have the word "state" in their name, guessing that a certain code is a state machine in disguise is even harder.
Many programs we write are state machines. Think about a chess game for a moment. If we write a program to play chess against human beings, it's actually a state machine in some sense. It waits for the human to move - an idle state. Once the human moves, it goes into active state of "thinking" about a move to make. A game can have an "end" state such as a victory for one side or a draw. If we think of a GUI for a chess game, the state machine is even more obvious. There is a basic state, when it's a human's move. If a human clicks on some piece, we go into the "clicked" state. In this state, a click on an empty tile may produce a move by the piece. Note that if we click on an empty tile in the "basic" state (no pieces selected), nothing happens. Can you see an obvious state machine here?
In the beginning of this part, I said we were going to discuss FSMs. By now I hope you already know what a state machine is, but what has "finite" got to do with it?
Well, our computers are finite. There's only so much memory (even if it's quite large these days). Therefore, the applications are finite. Finite = Limited. For state machines it means that the amount of states is limited. 1 is limited, 2 is limited, but 107 is also limited, though quite large.
This point may seem banal to some of you, but it is important to emphasize. So now you know what a FSM is: a state machine with a finite number of states. It can be inferred that all state machines implemented in computer hardware and software must be FSMs.
State machines can be also used explicitly. We can benefit a lot from knowingly incorporating state machines in our code. First and foremost, it's a great way to reason about difficult problems. If you see that your code or a part of it can actually be in several states, with inputs affecting these states and outputs resulting from them, you can reason about this code using a state machine. Draw a diagram; visualizing always helps. With a diagram errors can be spotted more easily. Think about all the inputs to your code - what should they change in every state? Does your code cover all possibilities (all inputs in all states)? Are some state transitions illegal?
Another use of state machines is for certain algorithms. In the next part you'll see how essential state machines are for a very common and important application: regular expressions.
Think of an identifier in C++ (such as this, temp_var2, assert etc.). How would you describe what qualifies for an identifier? Let's see if we remember... an identifier consists of letters, digits and the underscore character (_), and must start with a letter (it's possible to start with an underscore, but better not to, as these identifiers are reserved by the language).
A regular expression is a notation that allows us to define such things precisely:
letter (letter | digit | underscore) *
In regular expressions, the vertical bar | means "or", the parentheses are used to group sub expressions (just like in math) and the asterisk (*) means "zero or more instances of the previous". So the regular expression above defines the set of all C++ identifiers (a letter followed by zero or more letters, digits or underscores). Let's see some more examples:
There is another symbol we'll use: eps (usually denoted with the Greek letter Epsilon) eps means "nothing". So, for example the regular expression for "either xy or xyz" is: xy(z|eps).
People familiar with regexes know that there are more complicated forms than * and |. However, anything can be built from *, | and eps. For instance,x? (zero or one instance of x) is a shorthand for (x|eps). x+ (one or more instances of x) is a shorthand for xx*. Note also the interesting fact that * can be represented with +, namely: x* is equivalent to (x+)|eps.
[Perl programmers and those familiar with Perl syntax (Python programmers, that would include you) will recognize eps as a more elegant alternative to the numerical notation {m, n} where both m and n are zero. In this notation, x* is equivalent to x{0,} (unbound upper limit), x+ is x{1,} and all other cases can be built from these two base cases. - Ed.]
Usually a regex is implemented to solve some recognition problem. For example, suppose your application asks a question and the user should answer Yes or No. The legal input, expressed as a regex is (yes)|(no). Pretty simple and not too exciting - but things can get much more interesting.
Suppose we want to recognize the following regex: (a|b)*abb, namely all words consisting of a's and b's and ending with abb. For example: "ababb", "aaabbbaaabbbabb". Say you'd like to write the code that accepts such words and rejects others. The following function does the job:
1 bool recognize(string str) 2 { 3 string::size_type len = str.length(); 4 // can't be shorter than 3 chars 5 if (len < 3) 6 return false; 7 // last 3 chars must be "abb" 8 if (str.substr(len - 3, 3) != "abb") 9 return false; 10 // must contain no chars other than "a" and "b" 11 if (str.find_first_not_of("ab") != string::npos) 12 return false; 13 return true; 14 }
It's pretty clean and robust - it will recognize (a|b)*abb and reject anything else. However, it is clear that the techniques employed in the code are very "personal" to the regex at hand.
If we slightly change the regex to (a|b)*abb(a|b)* for instance (all sequences of a's and b's that have abb somewhere in them), it would change the algorithm completely. (We'd then probably want to go over the string, a char at a time and record the appearance of abb. If the string ends and abb wasn't found, it's a rejection, etc.). It seems that for each regex, we should think of some algorithm to handle it, and this algorithm can be completely different from algorithms for other regexes.
So what is the solution? Is there any standardized way to handle regexes? Can we even dream of a general algorithm that can produce a recognizer function given a regex? We sure can!
It happens so that Finite State Machines are a very useful tool for regular expressions. More specifically, a regex (any regex!) can be represented as an FSM. To show how, however, we must present two additional definitions (which actually are very logical, assuming we use a FSM for a regex).
It is best presented with an example:
The start state 0 is denoted with a "Start" arrow. 1 is the accepting state (it is denoted with the double border). Now, try to figure out what regex this FSM represents.
It actually represents xy* - x and then 0 or more of y. Do you see how? Note that x leads the FSM to state 1, which is the accepting state. Adding y keeps the FSM in the accepting state. If a x appears when in state 1, the FSM moves to state 2, which is non accepting and "stuck", since any input keeps the FSM in state 2 (because xy* rejects strings where a x comes after y-s). But what happens with other letters? For simplicity we'll now assume that for this FSM our language consists of solely x and y. If our input set would be larger (say the whole lowercase alphabet), we could define that each transition not shown (for instance, on input "z" in state 0) leads us into some "unaccepting" state.
I will now present the general algorithm for figuring out whether a given FSM recognizes a given word. It's called "FSM Simulation". But first lets define an auxiliary function: move(state, input) returns the state resulting from getting input in state state. For the sample FSM above, move(0, X) is 1, move (0, Y) is 0, etc. So, the algorithm goes as follows:
state = start_state input = get_next_input whilenotend of input do state = move(state, input) input = get_next_input endif state is a final state return"ACCEPT"elsereturn"REJECT"
The algorithm is presented in very general terms and should be well understood. Lets "run" it on the simple xy* FSM, with the input "xyy". We start from the start state - 0; Get next input: x, end of input? not yet; move(0, x) moves us to state 1; input now becomes y; not yet end of input; move(1, y) moves us to state 1; exactly the same with the second y; now it's end of input; state 1 is a final state, so "ACCEPT";
Piece of cake isn't it? Well, it really is! It's a straightforward algorithm and a very easy one for the computer to execute.
Let's go back to the regex we started with - (a|b)*abb. Here is the FSM that represents (recognizes) it:
Although it is much more complicated than the previous FSM, it is still simple to comprehend. This is the nature of FSMs - looking at them you can easily characterize the states they can be in, what transitions occur, and when. Again, note that for simplicity our alphabet consists of solely "a" and "b".
Paper and pencil is all you need to "run" the FSM Simulation algorithm on some simple string. I encourage you to do it, to understand better how this FSM relates to the (a|b)*abb regex.
Just for example take a look at the final state - 3. How can we reach it? From 2 with the input b. How can we reach 2? From 1 with input b. How can we reach 1? Actually from any state with input a. So, "abb" leads us to accept a string, and it indeed fits the regex.
As I said earlier, it is possible to generate code straight from any regex. That is, given a regular expression, a general tool can generate the code that will correctly recognize all strings that fit the regex, and reject the others. Let's see what it takes...
The task is indeed massive, but consider dividing it to two distinct stages:
Hey, we already know how to do the second stage! It is actually the FSM Simulation algorithm we saw earlier. Most of the algorithm is the same from FSM to FSM (just dealing with input). The only part that changes is the "move" function, which represents the transition diagram of some FSM, and we learned how to code the transition function in the pvery basic, the technique is the same for any FSM!
Let's now write down the full code that recognizes (a|b)*abb.
1 #include <iostream> 2 #include <cassert> 3 #include <string> 4 using namespace std; 5 typedef int fsm_state; 6 typedef char fsm_input; 7 bool is_final_state(fsm_state state) 8 { 9 return (state == 3) ? true : false; 10 } 11 fsm_state get_start_state(void) 12 { 13 return 0; 14 } 15 fsm_state move(fsm_state state, fsm_input input) 16 { 17 // our alphabet includes only 'a' and 'b' 18 if (input != 'a' && input != 'b') 19 assert(0); 20 switch (state) 21 { 22 case 0: 23 if (input == 'a') 24 { 25 return 1; 26 } 27 else if (input == 'b') 28 { 29 return 0; 30 } 31 break; 32 case 1: 33 if (input == 'a') 34 { 35 return 1; 36 } 37 else if (input == 'b') 38 { 39 return 2; 40 } 41 break; 42 case 2: 43 if (input == 'a') 44 { 45 return 1; 46 } 47 else if (input == 'b') 48 { 49 return 3; 50 } 51 break; 52 case 3: 53 if (input == 'a') 54 { 55 return 1; 56 } 57 else if (input == 'b') 58 { 59 return 0; 60 } 61 break; 62 default: 63 assert(0); 64 } 65 } 66 bool recognize(string str) 67 { 68 if (str == "") 69 return false; 70 fsm_state state = get_start_state(); 71 string::const_iterator i = str.begin(); 72 fsm_input input = *i; 73 while (i != str.end()) 74 { 75 state = move(state, *i); 76 ++i; 77 } 78 if (is_final_state(state)) 79 return true; 80 else 81 return false; 82 } 83 // simple driver for testing 84 int main(int argc, char** argv) 85 { 86 recognize(argv[1]) ? cout < 1 : cout < 0; 87 return 0; 88 }
Take a good look at the recognize function. You should immediately see how closely it follows the FSM Simulation algorithm. The FSM is initialized to the start state, and the first input is read. Then, in a loop, the machine moves to its next state and fetches the next input, etc. until the input string ends. Eventually, we check whether we reached a final state.
Note that this recognize function will be the same for any regex. The only functions that change are the trivial is_final_state and get_start_state, and the more complicated transition function move. But move is very structural - it closely follows the graphical description of the FSM. As you'll see later, such transition functions are easily generated from the description.
So, what have we got so far? We know how to write code that runs a state machine on a string. What don't we know? We still don't know how to generate the FSM from a regex.
FSM, as you already know, stands for Finite State Machine. A more scientific name for it is FA - Finite Automaton (plural automata). Finite Automata can be classified into several categories, but the one we need for the sake of regex recognition is the notion of determinism. Something is deterministic when it involves no chance - everything is known and can be prescribed and simulated beforehand. On the other hand, nondeterminism is about chance and probabilities. It is commonly defined as "A property of a computation which may have more than one result".
Thus, the world of FSMs can be divided to two: a deterministic FSM is called DFA (Deterministic Finite Automaton) and a nondeterministic FSM is called NFA (Nondeterministic Finite Automaton).
A nondeterministic finite automaton is a mathematical model that consists of:
I will now elaborate on a few fine points (trying to simplify and avoid mathematical implications).
A NFA accepts an input string X if and only if there is some path in the transition graph from the start state to some accepting (final) state, such that the edge labels along this path spell out X.
The definition of a NFA doesn't pose a restriction on the amount of states resulting in some input in some state. So, given we're in some state N it is completely legal (in a NFA) to transition to several different states given the input a.
Furthermore, epsilon (eps) transitions are allowed in a NFA. That is, there may be a transition from state to state given "no input".
I know this must sound very confusing if it's the first time you learn about NFAs, but an example I'll show a little later should make things more understandable.
By definition, a deterministic finite automaton is a special case of a NFA, in which
You can immediately see that a DFA is a more "normal" FSM. In fact the FSMs we were discussing earlier are DFAs.
To make this more tolerable, consider an example comparing the DFA and the NFA for the regex (a|b)*abb we saw earlier. Here is the DFA:
And this is the NFA:
Can you see a NFA unique feature in this diagram? Look at state 0. When the input is a, where can we move? To state 0 and state 1 - a multiple transition, something that is illegal in a DFA. Take a minute to convince yourself that this NFA indeed accepts (a|b)*abb. For instance, consider the input string abababb. Recall how NFA's acceptance of a string is defined. So, is there a path in the NFA graph above that "spells out" abababb? There indeed is. The path will stay in state 0 for the first 4 characters, and then will move to states 1->2->3. Consider the input string baabab. Is there a path that spells out this string? No, there isn't, as in order to reach the final state, we must go through abb in the end, which the input string lacks.
Both NFAs and DFAs are important in computer science theory and especially in regular expressions. Here are a few points of difference between these constructs:
There are several techniques involving DFAs and NFAs to build recognizers from regexes:
At first, I was determined to spare you from the whole DFA/NFA discussion and just use the third - direct DFA - technique for recognizer generation. Then, I changed my mind, for two reasons. First, the distinction between NFAs and DFAs in the regex world is important. Different tools use different techniques (for instance, Perl uses NFA while lex and egrep use DFA), and it is valuable to have at least a basic grasp of these topics. Second, and more important, I couldn't help falling to the charms of the NFA-from-regex construction algorithm. It is simple, robust, powerful and complete – in one word, beautiful.
So, I decided to go for the second technique.
Recall the basic building blocks of regular expressions: eps which represents "nothing" or "no input"; characters from the input alphabet (we used aand b most often here); characters may be concatenated, like this: abb; alternation a|b meaning a or b; the star * meaning "zero or more of the previous"; and grouping ().
What follows is Thompson's construction - an algorithm that builds a NFA from a regex. The algorithm is syntax directed, in the sense that it uses the syntactic structure of the regex to guide the construction process.
The beauty and simplicity of this algorithm is in its modularity. First, construction of trivial building blocks is presented.
For eps, construct the NFA:
Here i is a new start state and f is a new accepting state. It's easy to see that this NFA recognizes the regex eps.
For some a from the input alphabet, construct the NFA:
Again, it's easy to see that this NFA recognizes the trivial regex a.
Now, the interesting part of the algorithm: an inductive construction of complex NFAs from simple NFAs. More specifically, given that N(s) and N(t)are NFA's for regular expressions s and t, we'll see how to combine the NFAs N(s) and N(t) according to the combination of their regexes.
For the regular expression s|t, construct the following composite NFA N(s|t):
The eps transitions into and out of the simple NFAs assure that we can be in either of them when the match starts. Any path from the initial to the final state must pass through either N(s) or N(t) exclusively. Thus we see that this composite NFA recognizes s|t.
For the regular expression st (s and then t), construct the composite NFA NFA(st):
The composite NFA will have the start state of N(s) and the end state of N(t). The accepting (final) state of N(s) is merged with the start state of N(t). Therefore, all paths going through the composite NFA must go through N(s) and then through N(t), so it indeed recognizes N(st).
For the regular expression s*, construct the composite NFA N(s*):
Note how simply the notion of "zero or more" is represented by this NFA. From the initial state, either "nothing" is accepted with the eps transition to the final state or the "more than" is accepted by going into N(s). The eps transition inside N(s) denotes that N(s) can appear again and again.
For the sake of completeness: a parenthesized regular expression (s) has the same NFA as s, namely N(s).
As you can see, the algorithm covers all the building blocks of regular expressions, denoting their translations into NFAs.
If you follow the algorithm closely, the following NFA will result for (our old friend,) the regex (a|b)*abb:
Sure, it is much larger than the NFA we saw earlier for recognizing the same regex, but this NFA was automatically generated from a regex description using Thompson's construction, rather than crafted by hand.
Let's see how this NFA was constructed:
First, it's easy to note that states 2 and 3 are the basic NFA for the regex a.
Similarly, states 4 and 5 are the NFA for b.
Can you see the a|b? It's clearly states 1,2,3,4,5,6 (without the eps transition from 6 to 1).
Parenthesizing (a|b) doesn't change the NFA
The addition of states 0 and 7, plus the eps transition from 6 to 1 is the star on NFA(a|b), namely states 0 - 7 represent (a|b)*.
The rest is easy. States 8 - 10 are simply the concatenation of (a|b)* with abb.
Try to run a few strings through this NFA until you convince yourself that it indeed recognizes (a|b)*abb. Recall that a NFA recognizes a string when the string's characters can be spelled out on some path from the initial to the final state.
At last, let's get our hands on some code. Now that we know the theory behind NFA-from-regex construction, it's clear that we will be doing some NFA manipulations. But how will we represent NFAs in code?
NFA is not a trivial concept, and there are full-blown implementations for general NFAs that are far too complex for our needs. My plan is to code as simple an implementation as possible - one that will be enough for our needs and nothing more. After all, the regex recognizing engine is not supposed to expose its NFAs to the outer world - for us a NFA is only an intermediate representation of a regular expression, which we want to simulate in order to "accept" or "reject" input strings.
My philosophy in such cases is the KISS principle: "Keep It Simple, Stupid". The goal is first to code the simplest implementation that fits my needs. Later, I have no problem refactoring parts of the code and inserting new features, on an as-needed basis.
A very simple NFA implementation is now presented. We will build upon it later, and for now it is enough just to demonstrate the concept. Here is the interface:
1 #ifndef NFA_H 2 #define NFA_H 3 #include <vector> 4 using namespace std; 5 // Convenience types and constants 6 typedef unsigned state; 7 typedef char input; 8 enum {EPS = -1, NONE = 0}; 9 class NFA 10 { 11 public: 12 // Constructed with the NFA size (amount of 13 // states), the initial state and the final state 14 NFA(unsigned size_, state initial_, state final_); 15 // Adds a transition between two states 16 void add_trans(state from, state to, input in); 17 // Prints out the NFA 18 void show(void); 19 private: 20 bool is_legal_state(state s); 21 state initial; 22 state final; 23 unsigned size; 24 vector<vector<input> > trans_table; 25 }; 26 #endif // NFA_H
As promised, the public interface is kept trivial, for now. All we can do is create a NFA object (specifying the amount of states, the start state and the final state), add transitions to it, and print it out. This NFA will then consist of states 0 .. size-1, with the given transitions (which are single characters). Note that we use only one final state for now, for the sake of simplicity. Should we need more than one, it won't be difficult to add.
A word about the implementation: I don't want to go deep into graph-theory here (if you're not familiar with the basics, a web search can be very helpful), but basically a NFA is a directed graph. It is most common to implement a graph using either a matrix or an array of linked lists. The first implementation is more speed efficient, the second is better space-wise. For our NFA I picked the matrix (vector of vectors), mostly because (in my opinion) it is simpler.
The classic matrix implementation of a graph has 1 in cell (i, j) when there is an edge between vertex i and vertex j, and 0 otherwise.
A NFA is a special graph, in the sense that we are interested not only in whether there is an edge, but also in the condition for the edge (the input that leads from one state to another in FSM terminology). Thus, our matrix holds inputs (a nickname for chars, as you can see). So, for instance, 'c' in trans_table[i][j] means that the input 'c' leads from state i to state j in our NFA.
Here is the implementation of the NFA class:
1 #include <iostream> 2 #include <string> 3 #include <cassert> 4 #include <cstdlib> 5 #include "nfa.h" 6 using namespace std; 7 NFA::NFA(unsigned size_, state initial_, state final_) 8 { 9 size = size_; 10 initial = initial_; 11 final = final_; 12 assert(is_legal_state(initial)); 13 assert(is_legal_state(final)); 14 // Initialize trans_table with an "empty graph", 15 // no transitions between its states 16 for (unsigned i = 0; i < size; ++i) 17 { 18 vector<input> v; 19 for (unsigned j = 0; j < size; ++j) 20 { 21 v.push_back(NONE); 22 } 23 trans_table.push_back(v); 24 } 25 } 26 bool NFA::is_legal_state(state s) 27 { 28 // We have 'size' states, numbered 0 to size-1 29 if (s < 0 || s >= size) 30 return false; 31 return true; 32 } 33 void NFA::add_trans(state from, state to, input in) 34 { 35 assert(is_legal_state(from)); 36 assert(is_legal_state(to)); 37 trans_table[from][to] = in; 38 } 39 void NFA::show(void) 40 { 41 cout < "This NFA has " < size < " states: 0 - " < size - 1 < endl; 42 cout < "The initial state is " < initial < endl; 43 cout < "The final state is " < final < endl < endl; 44 for (unsigned from = 0; from < size; ++from) 45 { 46 for (unsigned to = 0; to < size; ++to) 47 { 48 input in = trans_table[from][to]; 49 if (in != NONE) 50 { 51 cout < "Transition from " < from < " to " < to < " on input "; 52 if (in == EPS) 53 { 54 cout < "EPS" < endl; 55 } 56 else 57 { 58 cout < in < endl; 59 } 60 } 61 } 62 } 63 }
The code is very simple, so you should have no problem understanding what every part of it does. To demonstrate, let's see how we would use this class to create the NFA for (a|b)*abb - the one we built using Thompson's construction earlier (only the driver code is included):
1 #include "nfa.h" 2 int main() 3 { 4 NFA n(11, 0, 10); 5 n.add_trans(0, 1, EPS); 6 n.add_trans(0, 7, EPS); 7 n.add_trans(1, 2, EPS); 8 n.add_trans(1, 4, EPS); 9 n.add_trans(2, 3, 'a'); 10 n.add_trans(4, 5, 'b'); 11 n.add_trans(3, 6, EPS); 12 n.add_trans(5, 6, EPS); 13 n.add_trans(6, 1, EPS); 14 n.add_trans(6, 7, EPS); 15 n.add_trans(7, 8, 'a'); 16 n.add_trans(8, 9, 'b'); 17 n.add_trans(9, 10, 'b'); 18 n.show(); 19 return 0; 20 }
This would (quite expectedly) result in the following output:
This NFA has 11 states: 0 - 10
The initial state is 0
The final state is 10
Transition from 0 to 1 on input EPS
Transition from 0 to 7 on input EPS
Transition from 1 to 2 on input EPS
Transition from 1 to 4 on input EPS
Transition from 2 to 3 on input a
Transition from 3 to 6 on input EPS
Transition from 4 to 5 on input b
Transition from 5 to 6 on input EPS
Transition from 6 to 1 on input EPS
Transition from 6 to 7 on input EPS
Transition from 7 to 8 on input a
Transition from 8 to 9 on input b
Transition from 9 to 10 on input b
As I mentioned earlier: as trivial as this implementation may seem at the moment, it is the basis we will build upon later. Presenting it in small pieces will, hopefully, make the learning curve of this difficult subject less steep for you.
Thompson's Construction tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics:
The most basic regular expression is just some single character, for example a. The NFA for such a regex is:
Here is the implementation:
// Builds a basic, single input NFA NFA build_nfa_basic(input in){ NFA basic(2,0,1); basic.add_trans(0,1,in);return basic;}
Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says "in moves from state 0 to state 1". That's it - a simple regex, a simple construction procedure.
Note that this procedure is suited for the construction of an eps transition as well.
The previous implementation of a simple NFA class was just a starting point and we have quite a few changes to make.
First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members (size, initial, final and trans_table) have been made public.
Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each i andj, trans_table[i][j] is the input that takes the NFA from state i to state j. It's NONE if there's no such transition (hence a lot of space is wasted - the matrix representation, while fast, is inefficient in memory).
Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (included in the source code attached to this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp in front of your eyes, and to follow the code for the operations while reading these explanations. Here they are:
append_empty_state - I want to append a new, empty state to the NFA. This state will have no transitions to it and no transitions from it. If this is the transition table before the appending (a sample table with size 5 - states 0 to 4):
Then this is the table after the appending:
The shaded cells are the transitions of the original table (be they empty or not), and the white cells are the new table cells - containing NONE.
shift_states - I want to rename all NFA's states, shifting them "higher" by some given number. For instance, if I have 5 states numbered 0 - 4, and I want to have the same states, just named 2 - 6, I will call shift_states(2), and will get the following table:
fill_states - I want to copy the states of some other NFA into the first table cells of my NFA. For instance, if I take the shifted table from above, and fill its first two states with a new small NFA, I will get (the new NFA's states are darkest):
Note that using fill_states after shift_states is not incidental. These two operations were created to be used together - to concatenate two NFAs. You'll see how they are employed shortly.
Now I will explain how the more complex operations of Thompson's Construction are implemented. You should understand how the operations demonstrated above work, and also have looked at their source code (a good example of our NFA's internal table manipulation). You may still lack the "feel" of why these operations are needed, but this will soon be covered. Just understand how they work, for now.
Here is the diagram of NFA alternation from earlier:
Given two NFAs, we must build another one that includes all the states of both NFAs plus additional, unified initial and final states. The function that implements this in nfa.cpp is build_nfa_alter. Take a look at its source code now - it is well commented and you should be able to follow through all the steps with little difficulty. Note the usage of the new NFA operations to complete the task. First, the NFAs states are shifted to make room for the full NFA. fill_states is used to copy the contents of nfa1 to the unified NFA. Finally, append_empty_state is used to add another state at the end - the new final state.
Here is the diagram of NFA concatenation from earlier:
Given two NFAs, we must build another one that includes all the states of both NFAs (note that nfa1's final and nfa2's initial states are overlapping). The function that implements this in nfa.cpp is build_nfa_concat. Just as in build_nfa_alter, the new NFA operations are used to construct, step by step, a bigger NFA that contains all the needed states of the concatenation.
Here is the diagram of the NFA for a* from earlier:
Although the diagram looks complex, the implementation of this construction is relatively simple, as you'll see in build_nfa_star. There's no need to shift states, because no two NFAs are joined together. There's only a creation of new initial and final states, and new eps transitions added to implement the star operation.
You might have observed that all the NFAs constructed by Thompson's Construction have some very specific behavior. For instance, all the basic building blocks for single letters are similar, and the rest of the constructions just create new links between these states to allow for the alternation, concatenation and star operations. These NFAs are also special implementation-wise. For instance, note that in our NFA implementation, the first state is always the initial, and the last state is always final. You may have noted that this is useful in several operations.
With these operations implemented, we now have a full NFA construction implementation in nfa.cpp! For instance, the regex (a|b)*abb can be built as follows:
NFA a = build_nfa_basic('a'); NFA b = build_nfa_basic('b'); NFA alt = build_nfa_alter(a, b); NFA str = build_nfa_star(alt); NFA sa = build_nfa_concat(str, a); NFA sab = build_nfa_concat(sa, b); NFA sabb = build_nfa_concat(sab, b);
With these steps completed, sabb is the NFA representing (a|b)*abb. Note how simple it's to build NFAs this way! There's no need to specify individual transitions like we did before. In fact, it's not necessary to understand NFAs at all - just build the desired regex from its basic blocks, and that's it.
Though it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:
In this tree: . is concatenation, | is alternation, and * is star. So, the regex (a|b)*abb is represented here in a tree form, just like an arithmetic expression.
Such trees in the world of parsing and compilation are called expression trees, or parse trees. For example, to implement an infix calculator (one that can calculate, for example: 3*4 + 5), the expressions are first turned into parse trees, and then these parse trees are walked to make the calculations.
Note, this is, as always, an issue of representation. We have the regex in a textual representation: (a|b)*abb and we want it in NFA representation. Now we're wondering how to turn it from one representation to the other. My solution is to use an intermediate representation - a parse tree. Going from a regex to a parse tree is similar to parsing arithmetic expressions, and going from a parse tree to a NFA will be now demonstrated, using the Thompson's Construction building blocks described in this article.
A parse tree in our case is just a binary tree, since no operation has more than two arguments. Concatenation and alternations have two arguments, hence their nodes in the tree have two children. Star has one argument - hence only the left child. Chars are the tree leaves. Take a good look at the tree drawn above, you'll see this very clearly.
Take a look at the file regex_parse.cpp from the source code archive on the book's companion website. It has a lot in it, but you only need to focus only on some specific things for now. First, let's look at the definition of parse_node:
1 typedef enum {CHR, STAR, ALTER, CONCAT} node_type; 2 // Parse node 3 struct parse_node 4 { 5 parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) : 6 type(type_), 7 data(data_), 8 left(left_), 9 right(right_) 10 { 11 } 12 node_type type; 13 char data; 14 parse_node* left; 15 parse_node* right; 16 };
This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the tree_to_nfa function from regex_parse.cpp:
1 NFA tree_to_nfa(parse_node* tree) 2 { 3 assert(tree); 4 switch (tree->type) 5 { 6 case CHR: 7 return build_nfa_basic(tree->data); 8 case ALTER: 9 return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); 10 case CONCAT: 11 return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); 12 case STAR: 13 return build_nfa_star(tree_to_nfa(tree->left)); 14 default: 15 assert(0); 16 } 17 }
Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code...
If you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like (a|b)*abb).
I really hate to do this, but I won't explain how this regex-to-tree code works. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested.
The N in NFA stands for non-deterministic. Our computers, however, are utterly deterministic beasts, which makes "true" simulation of an NFA impossible. But we do know how to simulate DFAs. So, what's left is to see how NFAs can be converted to DFAs.
The algorithm for constructing from an NFA a DFA that recognizes the same language is called "Subset Construction". The main idea of this algorithm is in the following observations:
Take, for example, the familiar NFA of the (a|b)*abb regex (generated automatically by Thompson's Construction, with the code from the last column):
The initial state of this NFA is 0... or is it ? Take a look at the diagram, and count in how many states this NFA can be before any input is read. If you remember the previous columns where I explained how eps transitions work, you should have no trouble noticing that initially, the NFA can be in any of the states {0, 1, 2, 4, 7}, because these are the states reachable by eps transitions from the initial state.
Note: the set T is reachable by eps from itself by definition (the NFA doesn't have to take an eps transition, it can also stay in its current state).
Now imagine we received the input a. What happens next? In which states can the NFA be now? This should be easy to answer. Just go over all the states the NFA can be in before the input, and see where the input a leads from them. This way, a new set emerges: {1, 2, 3, 4, 6, 7, 8}. I hope you understand why: initially, the NFA can be in states {0, 1, 2, 4, 7}. But from states 0, 1 and 4 there are no transitions on a. The only transitions on afrom that set are from state 2 (to state 3) and from state 7 (to state 8). However, the states {3, 8} is an incomplete answer. There are eps transitions from these states - to states {1, 2, 4, 6, 7}, so the NFA can actually be in any of the states {1, 2, 3, 4, 6, 7, 8}.
If you understand this, you understand mostly how the Subset Construction algorithm works. All that's left is the implementation details. But before we get to the implementation of the conversion algorithm itself, there are a couple of prerequisites.
Given N - an NFA and T - a set of NFA states, we would like to know which states in N are reachable from states T by eps transitions. eps-closureis the procedure that answers this question. Here is the algorithm:
algorithm eps-closure inputs: N - NFA, T -set of NFA states output: eps-closure(T)- states reachable from T by eps transitions eps-closure(T)= T foreach state t in T push(t, stack)while stack isnot empty do t = pop(stack)foreach state u with an eps edge from t to u if u isnotin eps-closure(T) add u to eps-closure(T) push(u, stack)endreturn eps-closure(T)
algorithm dfa-simulate inputs: D - DFA, I -Input output: ACCEPT or REJECT s = start state of D i =getnext input character from I whilenotend of I do s = state reached with input i from state s i =getnext input character from I endif s is a final state return ACCEPT elsereturn REJECT
algorithm subset-construction inputs: N - NFA output: D - DFA add eps-closure(N.start) to dfa_states, unmarked D.start = eps-closure(N.start)while there is an unmarked state T in dfa_states do mark(T)if T contains a final state of N add T to D.finalforeach input symbol i in N.inputs U = eps-closure(N.move(T, i))if U isnotin dfa_states add U to dfa_states, unmarked D.trans_table(T, i)= U end
eps-closure(move({0,1,2,4,7}, a))= eps-closure({3,8})={1,2,3,4,6,7,8}
C = eps-closure((move(A, b))= eps-closure({5})={1,2,4,5,6,7}
A ={0,1,2,4,7} B ={1,2,3,4,6,7,8} C ={1,2,4,5,6,7} D ={1,2,4,5,6,7,9} E ={1,2,4,5,6,7,10}