有限状态机与正则表达式

Introduction

I want to present an important and interesting topic in computer science, the Finite State Machine (FSM). In this part we start with the basics, gaining an understanding of what FSMs are and what they can be used for. This part is very elementary, so please be patient. In subsequent parts things will become much more complicated and interesting.

Finite State Machines - What are they?

A finite state machine is a conceptual model that can used to describe how many things work. Think about a light bulb for instance. The circuit consists of a switch that can be ON or OFF, a few wires and the bulb itself. At any moment in time the bulb is in some state - it is either turned on (emits light) or turned off (no visible effect). For a more focused discussion, let's assume that we have two buttons - one for "turn on" and one for "turn off".

How would you describe the light bulb circuit? You'd probably put it like this: When it's dark and I press ON, the bulb starts emitting light. Then if I press ON again, nothing changes. If I press OFF the bulb is turned off. Then if I press OFF again, nothing changes.

This description is very simple and intuitive, but in fact it describes a state machine!

Think of it the following way: we have a machine (the bulb) with two states, ON and OFF. We have two inputs, an ON switch and an OFF switch. If we are in state ON, pressing ON changes nothing, but pressing OFF moves the machine to state OFF. If we are in state OFF, pressing OFF changes nothing, but pressing ON moves the machine to state ON.

The above is a rephrasing of the first description, just a little more formal. It is, in fact, a formal description of a state machine. Another customary way to describe state machines is with a diagram (people like diagrams and drawings more than words for some insights):

Textual descriptions can become quite wordy, and a simple diagram like this can contain a lot of information. Note how states are represented by circles, and transitions by arrows. This is almost all one needs to know, which makes diagrams very descriptive. This state machine can also be translated to (C++) code:

typedef enum{ON, OFF} bulb_state;
typedef enum{TURN_ON, TURN_OFF} 
switch_command;
......

bulb_state state;

switch_command command;
switch(state)
{
   case ON: 
       if(command == TURN_ON){}
       else if(command == TURN_OFF)
       {

          state = OFF;}
          break;
   case OFF:
       if(command == TURN_ON) 
       {

        state = ON;
       }
       else if(command == TURN_OFF){}
       break;
   default:assert(0);
}

If code such as this looks familiar, it's hardly surprising. Many of us write state machines in our code without even noticing. Most of the state machines we write are "implicit", in the sense that there isn't a single switch statement that handles the whole machine, but rather it's distributed throughout the whole program. If, additionally, the state variables don't have the word "state" in their name, guessing that a certain code is a state machine in disguise is even harder.

Many programs we write are state machines. Think about a chess game for a moment. If we write a program to play chess against human beings, it's actually a state machine in some sense. It waits for the human to move - an idle state. Once the human moves, it goes into active state of "thinking" about a move to make. A game can have an "end" state such as a victory for one side or a draw. If we think of a GUI for a chess game, the state machine is even more obvious. There is a basic state, when it's a human's move. If a human clicks on some piece, we go into the "clicked" state. In this state, a click on an empty tile may produce a move by the piece. Note that if we click on an empty tile in the "basic" state (no pieces selected), nothing happens. Can you see an obvious state machine here?

Finite?

In the beginning of this part, I said we were going to discuss FSMs. By now I hope you already know what a state machine is, but what has "finite" got to do with it?

Well, our computers are finite. There's only so much memory (even if it's quite large these days). Therefore, the applications are finite. Finite = Limited. For state machines it means that the amount of states is limited. 1 is limited, 2 is limited, but 107 is also limited, though quite large.

This point may seem banal to some of you, but it is important to emphasize. So now you know what a FSM is: a state machine with a finite number of states. It can be inferred that all state machines implemented in computer hardware and software must be FSMs.

Employing state machines for our needs

State machines can be also used explicitly. We can benefit a lot from knowingly incorporating state machines in our code. First and foremost, it's a great way to reason about difficult problems. If you see that your code or a part of it can actually be in several states, with inputs affecting these states and outputs resulting from them, you can reason about this code using a state machine. Draw a diagram; visualizing always helps. With a diagram errors can be spotted more easily. Think about all the inputs to your code - what should they change in every state? Does your code cover all possibilities (all inputs in all states)? Are some state transitions illegal?

Another use of state machines is for certain algorithms. In the next part you'll see how essential state machines are for a very common and important application: regular expressions.

Regular expressions (regexes)

Think of an identifier in C++ (such as this, temp_var2, assert etc.). How would you describe what qualifies for an identifier? Let's see if we remember... an identifier consists of letters, digits and the underscore character (_), and must start with a letter (it's possible to start with an underscore, but better not to, as these identifiers are reserved by the language).

A regular expression is a notation that allows us to define such things precisely:

letter (letter | digit | underscore) *

In regular expressions, the vertical bar | means "or", the parentheses are used to group sub expressions (just like in math) and the asterisk (*) means "zero or more instances of the previous". So the regular expression above defines the set of all C++ identifiers (a letter followed by zero or more letters, digits or underscores). Let's see some more examples:

abb*a - all words starting with a, ending with a, and having at least one b in-between. For example: "aba", "abba", "abbbba" are words that fit, but "aa" or "abab" are not.
0|1|2|3|4|5|6|7|8|9 - all digits. For example: "1" is a digit, "fx" is not.
x(y|z)*(a|b|c) - all words starting with x, then zero or more y or z, and end with a, b or c. For example: "xa", "xya", "xzc", "xyzzzyzyyyzb".
xyz - only the word "xyz".

There is another symbol we'll use: eps (usually denoted with the Greek letter Epsilon) eps means "nothing". So, for example the regular expression for "either xy or xyz" is: xy(z|eps).

People familiar with regexes know that there are more complicated forms than * and |. However, anything can be built from *, | and eps. For instance,x? (zero or one instance of x) is a shorthand for (x|eps). x+ (one or more instances of x) is a shorthand for xx*. Note also the interesting fact that * can be represented with +, namely: x* is equivalent to (x+)|eps.

[Perl programmers and those familiar with Perl syntax (Python programmers, that would include you) will recognize eps as a more elegant alternative to the numerical notation {m, n} where both m and n are zero. In this notation, x* is equivalent to x{0,} (unbound upper limit), x+ is x{1,} and all other cases can be built from these two base cases. - Ed.]

Recognizing strings with regexes

Usually a regex is implemented to solve some recognition problem. For example, suppose your application asks a question and the user should answer Yes or No. The legal input, expressed as a regex is (yes)|(no). Pretty simple and not too exciting - but things can get much more interesting.

Suppose we want to recognize the following regex: (a|b)*abb, namely all words consisting of a's and b's and ending with abb. For example: "ababb", "aaabbbaaabbbabb". Say you'd like to write the code that accepts such words and rejects others. The following function does the job:

 1 bool recognize(string str)

 2 {

 3   string::size_type len = str.length();

 4   // can't be shorter than 3 chars

 5   if (len < 3)

 6     return false;

 7   // last 3 chars must be "abb"

 8   if (str.substr(len - 3, 3) != "abb")

 9     return false;

10   // must contain no chars other than "a" and "b"

11   if (str.find_first_not_of("ab") != string::npos)

12     return false;

13   return true;

14 }

View Code

It's pretty clean and robust - it will recognize (a|b)*abb and reject anything else. However, it is clear that the techniques employed in the code are very "personal" to the regex at hand.

If we slightly change the regex to (a|b)*abb(a|b)* for instance (all sequences of a's and b's that have abb somewhere in them), it would change the algorithm completely. (We'd then probably want to go over the string, a char at a time and record the appearance of abb. If the string ends and abb wasn't found, it's a rejection, etc.). It seems that for each regex, we should think of some algorithm to handle it, and this algorithm can be completely different from algorithms for other regexes.

So what is the solution? Is there any standardized way to handle regexes? Can we even dream of a general algorithm that can produce a recognizer function given a regex? We sure can!

FSMs to the rescue

It happens so that Finite State Machines are a very useful tool for regular expressions. More specifically, a regex (any regex!) can be represented as an FSM. To show how, however, we must present two additional definitions (which actually are very logical, assuming we use a FSM for a regex).

A start state is the state in which a FSM starts.
An accepting state is a final state in which the FSM returns with some answer.

It is best presented with an example:

The start state 0 is denoted with a "Start" arrow. 1 is the accepting state (it is denoted with the double border). Now, try to figure out what regex this FSM represents.

It actually represents xy* - x and then 0 or more of y. Do you see how? Note that x leads the FSM to state 1, which is the accepting state. Adding y keeps the FSM in the accepting state. If a x appears when in state 1, the FSM moves to state 2, which is non accepting and "stuck", since any input keeps the FSM in state 2 (because xy* rejects strings where a x comes after y-s). But what happens with other letters? For simplicity we'll now assume that for this FSM our language consists of solely x and y. If our input set would be larger (say the whole lowercase alphabet), we could define that each transition not shown (for instance, on input "z" in state 0) leads us into some "unaccepting" state.

I will now present the general algorithm for figuring out whether a given FSM recognizes a given word. It's called "FSM Simulation". But first lets define an auxiliary function: move(state, input) returns the state resulting from getting input in state state. For the sample FSM above, move(0, X) is 1, move (0, Y) is 0, etc. So, the algorithm goes as follows:

state = start_state

input = get_next_input

whilenotend of input do

  state = move(state, input)

  input = get_next_input

endif state is a final state

  return"ACCEPT"elsereturn"REJECT"

The algorithm is presented in very general terms and should be well understood. Lets "run" it on the simple xy* FSM, with the input "xyy". We start from the start state - 0; Get next input: x, end of input? not yet; move(0, x) moves us to state 1; input now becomes y; not yet end of input; move(1, y) moves us to state 1; exactly the same with the second y; now it's end of input; state 1 is a final state, so "ACCEPT";

Piece of cake isn't it? Well, it really is! It's a straightforward algorithm and a very easy one for the computer to execute.

Let's go back to the regex we started with - (a|b)*abb. Here is the FSM that represents (recognizes) it:

Although it is much more complicated than the previous FSM, it is still simple to comprehend. This is the nature of FSMs - looking at them you can easily characterize the states they can be in, what transitions occur, and when. Again, note that for simplicity our alphabet consists of solely "a" and "b".

Paper and pencil is all you need to "run" the FSM Simulation algorithm on some simple string. I encourage you to do it, to understand better how this FSM relates to the (a|b)*abb regex.

Just for example take a look at the final state - 3. How can we reach it? From 2 with the input b. How can we reach 2? From 1 with input b. How can we reach 1? Actually from any state with input a. So, "abb" leads us to accept a string, and it indeed fits the regex.

Coding it the FSM way

As I said earlier, it is possible to generate code straight from any regex. That is, given a regular expression, a general tool can generate the code that will correctly recognize all strings that fit the regex, and reject the others. Let's see what it takes...

The task is indeed massive, but consider dividing it to two distinct stages:

Convert the regex to a FSM
Write FSM code to recognize strings

Hey, we already know how to do the second stage! It is actually the FSM Simulation algorithm we saw earlier. Most of the algorithm is the same from FSM to FSM (just dealing with input). The only part that changes is the "move" function, which represents the transition diagram of some FSM, and we learned how to code the transition function in the pvery basic, the technique is the same for any FSM!

Let's now write down the full code that recognizes (a|b)*abb.

 1 #include <iostream>

 2 #include <cassert>

 3 #include <string>

 4 using namespace std;

 5 typedef int     fsm_state;

 6 typedef char    fsm_input;

 7 bool is_final_state(fsm_state state)

 8 {

 9   return (state == 3) ? true : false;

10 }

11 fsm_state get_start_state(void)

12 {

13   return 0;

14 }

15 fsm_state move(fsm_state state, fsm_input input)

16 {

17   // our alphabet includes only 'a' and 'b'

18   if (input != 'a' && input != 'b')

19     assert(0);

20   switch (state)

21   {

22     case 0:

23       if (input == 'a')

24       {

25         return 1;

26       }

27       else if (input == 'b')

28       {

29         return 0;

30       }

31       break;

32     case 1:

33       if (input == 'a')

34       {

35         return 1;

36       }

37       else if (input == 'b')

38       {

39         return 2;

40       }

41       break;

42     case 2:

43       if (input == 'a')

44       {

45         return 1;

46       }

47       else if (input == 'b')

48       {

49         return 3;

50       }

51       break;

52     case 3:

53       if (input == 'a')

54       {

55         return 1;

56       }

57       else if (input == 'b')

58       {

59         return 0;

60       }

61       break;

62       default:

63         assert(0);

64   }

65 }

66 bool recognize(string str)

67 {

68   if (str == "")

69     return false;

70   fsm_state state = get_start_state();

71   string::const_iterator i = str.begin();

72   fsm_input input = *i;

73   while (i != str.end())

74   {

75     state = move(state, *i);

76     ++i;

77   }

78   if (is_final_state(state))

79     return true;

80   else

81     return false;

82 }

83 // simple driver for testing

84 int main(int argc, char** argv)

85 {

86   recognize(argv[1]) ? cout < 1 : cout < 0;

87   return 0;

88 }

View Code

Take a good look at the recognize function. You should immediately see how closely it follows the FSM Simulation algorithm. The FSM is initialized to the start state, and the first input is read. Then, in a loop, the machine moves to its next state and fetches the next input, etc. until the input string ends. Eventually, we check whether we reached a final state.

Note that this recognize function will be the same for any regex. The only functions that change are the trivial is_final_state and get_start_state, and the more complicated transition function move. But move is very structural - it closely follows the graphical description of the FSM. As you'll see later, such transition functions are easily generated from the description.

So, what have we got so far? We know how to write code that runs a state machine on a string. What don't we know? We still don't know how to generate the FSM from a regex.

DFA + NFA = FSM

FSM, as you already know, stands for Finite State Machine. A more scientific name for it is FA - Finite Automaton (plural automata). Finite Automata can be classified into several categories, but the one we need for the sake of regex recognition is the notion of determinism. Something is deterministic when it involves no chance - everything is known and can be prescribed and simulated beforehand. On the other hand, nondeterminism is about chance and probabilities. It is commonly defined as "A property of a computation which may have more than one result".

Thus, the world of FSMs can be divided to two: a deterministic FSM is called DFA (Deterministic Finite Automaton) and a nondeterministic FSM is called NFA (Nondeterministic Finite Automaton).

NFA

A nondeterministic finite automaton is a mathematical model that consists of:

A set of states S
A set of input symbols A (the input symbol alphabet)
A transition function move that maps state-symbol pairs to sets of states
A state s0 that is the start state
A set of states F that are the final states

I will now elaborate on a few fine points (trying to simplify and avoid mathematical implications).

A NFA accepts an input string X if and only if there is some path in the transition graph from the start state to some accepting (final) state, such that the edge labels along this path spell out X.

The definition of a NFA doesn't pose a restriction on the amount of states resulting in some input in some state. So, given we're in some state N it is completely legal (in a NFA) to transition to several different states given the input a.

Furthermore, epsilon (eps) transitions are allowed in a NFA. That is, there may be a transition from state to state given "no input".

I know this must sound very confusing if it's the first time you learn about NFAs, but an example I'll show a little later should make things more understandable.

DFA

By definition, a deterministic finite automaton is a special case of a NFA, in which

No state has an eps-transition
For each state S and input a, there is at most one edge labeled a leaving S.

You can immediately see that a DFA is a more "normal" FSM. In fact the FSMs we were discussing earlier are DFAs.

Recognizing regexes with DFAs and with NFAs

To make this more tolerable, consider an example comparing the DFA and the NFA for the regex (a|b)*abb we saw earlier. Here is the DFA:

And this is the NFA:

Can you see a NFA unique feature in this diagram? Look at state 0. When the input is a, where can we move? To state 0 and state 1 - a multiple transition, something that is illegal in a DFA. Take a minute to convince yourself that this NFA indeed accepts (a|b)*abb. For instance, consider the input string abababb. Recall how NFA's acceptance of a string is defined. So, is there a path in the NFA graph above that "spells out" abababb? There indeed is. The path will stay in state 0 for the first 4 characters, and then will move to states 1->2->3. Consider the input string baabab. Is there a path that spells out this string? No, there isn't, as in order to reach the final state, we must go through abb in the end, which the input string lacks.

Both NFAs and DFAs are important in computer science theory and especially in regular expressions. Here are a few points of difference between these constructs:

It is simpler to build a NFA directly from a regex than a DFA.
NFAs are more compact that DFAs. You must agree that the NFA diagram is much simpler for the (a|b)*abb regex. Due to its definition, a NFA doesn't explicitly need many of the transitions a DFA needs. Note how elegantly state 0 in the NFA example above handles the (a|b)*alternation of the regex. In the DFA, the char a can't both keep the automaton in state 0 and move it to state 1, so the many transitions on ato state 1 are required from any other state.
The compactness also shows in storage requirements. For complex regexes, NFAs are often smaller than DFAs and hence consume less space. There are even cases when a DFA for some regex is exponential in size (while an NFA is always linear) - though this is quite rare.
NFAs can't be directly simulated in the sense DFAs can. This is due to them being nondeterministic "beings", while our computers are deterministic. They must be simulated using a special technique that generates a DFA from their states "on the fly". More on this later.
The previous leads to NFAs being less time efficient than DFAs. In fact, when large strings are searched for regex matches, DFAs will almost always be preferable.

There are several techniques involving DFAs and NFAs to build recognizers from regexes:

Build a NFA from the regex. "Simulate" the NFA to recognize input.
Build a NFA from the regex. Convert the NFA to a DFA. Simulate the DFA to recognize input.
Build a DFA directly from the regex. Simulate the DFA to recognize input.
A few hybrid methods that are too complicated for our discussion.

At first, I was determined to spare you from the whole DFA/NFA discussion and just use the third - direct DFA - technique for recognizer generation. Then, I changed my mind, for two reasons. First, the distinction between NFAs and DFAs in the regex world is important. Different tools use different techniques (for instance, Perl uses NFA while lex and egrep use DFA), and it is valuable to have at least a basic grasp of these topics. Second, and more important, I couldn't help falling to the charms of the NFA-from-regex construction algorithm. It is simple, robust, powerful and complete – in one word, beautiful.

So, I decided to go for the second technique.

Construction of a NFA from a regular expression

Recall the basic building blocks of regular expressions: eps which represents "nothing" or "no input"; characters from the input alphabet (we used aand b most often here); characters may be concatenated, like this: abb; alternation a|b meaning a or b; the star * meaning "zero or more of the previous"; and grouping ().

What follows is Thompson's construction - an algorithm that builds a NFA from a regex. The algorithm is syntax directed, in the sense that it uses the syntactic structure of the regex to guide the construction process.

The beauty and simplicity of this algorithm is in its modularity. First, construction of trivial building blocks is presented.

For eps, construct the NFA:

Here i is a new start state and f is a new accepting state. It's easy to see that this NFA recognizes the regex eps.

For some a from the input alphabet, construct the NFA:

Again, it's easy to see that this NFA recognizes the trivial regex a.

Now, the interesting part of the algorithm: an inductive construction of complex NFAs from simple NFAs. More specifically, given that N(s) and N(t)are NFA's for regular expressions s and t, we'll see how to combine the NFAs N(s) and N(t) according to the combination of their regexes.

For the regular expression s|t, construct the following composite NFA N(s|t):

The eps transitions into and out of the simple NFAs assure that we can be in either of them when the match starts. Any path from the initial to the final state must pass through either N(s) or N(t) exclusively. Thus we see that this composite NFA recognizes s|t.

For the regular expression st (s and then t), construct the composite NFA NFA(st):

The composite NFA will have the start state of N(s) and the end state of N(t). The accepting (final) state of N(s) is merged with the start state of N(t). Therefore, all paths going through the composite NFA must go through N(s) and then through N(t), so it indeed recognizes N(st).

For the regular expression s*, construct the composite NFA N(s*):

Note how simply the notion of "zero or more" is represented by this NFA. From the initial state, either "nothing" is accepted with the eps transition to the final state or the "more than" is accepted by going into N(s). The eps transition inside N(s) denotes that N(s) can appear again and again.

For the sake of completeness: a parenthesized regular expression (s) has the same NFA as s, namely N(s).

As you can see, the algorithm covers all the building blocks of regular expressions, denoting their translations into NFAs.

An example

If you follow the algorithm closely, the following NFA will result for (our old friend,) the regex (a|b)*abb:

Sure, it is much larger than the NFA we saw earlier for recognizing the same regex, but this NFA was automatically generated from a regex description using Thompson's construction, rather than crafted by hand.

Let's see how this NFA was constructed:
First, it's easy to note that states 2 and 3 are the basic NFA for the regex a.

Similarly, states 4 and 5 are the NFA for b.

Can you see the a|b? It's clearly states 1,2,3,4,5,6 (without the eps transition from 6 to 1).

Parenthesizing (a|b) doesn't change the NFA

The addition of states 0 and 7, plus the eps transition from 6 to 1 is the star on NFA(a|b), namely states 0 - 7 represent (a|b)*.

The rest is easy. States 8 - 10 are simply the concatenation of (a|b)* with abb.

Try to run a few strings through this NFA until you convince yourself that it indeed recognizes (a|b)*abb. Recall that a NFA recognizes a string when the string's characters can be spelled out on some path from the initial to the final state.

Implementation of a simple NFA

At last, let's get our hands on some code. Now that we know the theory behind NFA-from-regex construction, it's clear that we will be doing some NFA manipulations. But how will we represent NFAs in code?

NFA is not a trivial concept, and there are full-blown implementations for general NFAs that are far too complex for our needs. My plan is to code as simple an implementation as possible - one that will be enough for our needs and nothing more. After all, the regex recognizing engine is not supposed to expose its NFAs to the outer world - for us a NFA is only an intermediate representation of a regular expression, which we want to simulate in order to "accept" or "reject" input strings.

My philosophy in such cases is the KISS principle: "Keep It Simple, Stupid". The goal is first to code the simplest implementation that fits my needs. Later, I have no problem refactoring parts of the code and inserting new features, on an as-needed basis.

A very simple NFA implementation is now presented. We will build upon it later, and for now it is enough just to demonstrate the concept. Here is the interface:

 1 #ifndef NFA_H

 2 #define NFA_H

 3 #include <vector>

 4 using namespace std;

 5 // Convenience types and constants

 6 typedef unsigned state;

 7 typedef char input;

 8 enum {EPS = -1, NONE = 0};

 9 class NFA

10 {

11   public:

12     // Constructed with the NFA size (amount of

13     // states), the initial state and the final state

14     NFA(unsigned size_, state initial_, state final_);

15     // Adds a transition between two states

16     void add_trans(state from, state to, input in);

17     // Prints out the NFA

18     void show(void);

19   private:

20     bool is_legal_state(state s);

21     state initial;

22     state final;

23     unsigned size;

24     vector<vector<input> > trans_table;

25 };

26 #endif // NFA_H

View Code

As promised, the public interface is kept trivial, for now. All we can do is create a NFA object (specifying the amount of states, the start state and the final state), add transitions to it, and print it out. This NFA will then consist of states 0 .. size-1, with the given transitions (which are single characters). Note that we use only one final state for now, for the sake of simplicity. Should we need more than one, it won't be difficult to add.

A word about the implementation: I don't want to go deep into graph-theory here (if you're not familiar with the basics, a web search can be very helpful), but basically a NFA is a directed graph. It is most common to implement a graph using either a matrix or an array of linked lists. The first implementation is more speed efficient, the second is better space-wise. For our NFA I picked the matrix (vector of vectors), mostly because (in my opinion) it is simpler.

The classic matrix implementation of a graph has 1 in cell (i, j) when there is an edge between vertex i and vertex j, and 0 otherwise.

A NFA is a special graph, in the sense that we are interested not only in whether there is an edge, but also in the condition for the edge (the input that leads from one state to another in FSM terminology). Thus, our matrix holds inputs (a nickname for chars, as you can see). So, for instance, 'c' in trans_table[i][j] means that the input 'c' leads from state i to state j in our NFA.

Here is the implementation of the NFA class:

 1 #include <iostream>

 2 #include <string>

 3 #include <cassert>

 4 #include <cstdlib>

 5 #include "nfa.h"

 6 using namespace std;

 7 NFA::NFA(unsigned size_, state initial_, state final_)

 8 {

 9   size = size_;

10   initial = initial_;

11   final = final_;

12   assert(is_legal_state(initial));

13   assert(is_legal_state(final));

14   // Initialize trans_table with an "empty graph",

15   // no transitions between its states

16   for (unsigned i = 0; i < size; ++i)

17   {

18     vector<input> v;

19     for (unsigned j = 0; j < size; ++j)

20     {

21       v.push_back(NONE);

22     }

23     trans_table.push_back(v);

24   }

25 }

26 bool NFA::is_legal_state(state s)

27 {

28   // We have 'size' states, numbered 0 to size-1

29   if (s < 0 || s >= size)

30     return false;

31   return true;

32 }

33 void NFA::add_trans(state from, state to, input in)

34 {

35   assert(is_legal_state(from));

36   assert(is_legal_state(to));

37   trans_table[from][to] = in;

38 }

39 void NFA::show(void)

40 {

41   cout < "This NFA has " < size < " states: 0 - " < size - 1 < endl;

42   cout < "The initial state is " < initial < endl;

43   cout < "The final state is " < final < endl < endl;

44   for (unsigned from = 0; from < size; ++from)

45   {

46     for (unsigned to = 0; to < size; ++to)

47     {

48       input in = trans_table[from][to];

49       if (in != NONE)

50       {

51         cout < "Transition from " < from < " to " < to < " on input ";

52         if (in == EPS)

53         {

54           cout < "EPS" < endl;

55         }

56         else

57         {

58           cout < in < endl;

59         }

60       }

61     }

62   }

63 }

View Code

The code is very simple, so you should have no problem understanding what every part of it does. To demonstrate, let's see how we would use this class to create the NFA for (a|b)*abb - the one we built using Thompson's construction earlier (only the driver code is included):

 1 #include "nfa.h"

 2 int main()

 3 {

 4   NFA n(11, 0, 10);

 5   n.add_trans(0, 1, EPS);

 6   n.add_trans(0, 7, EPS);

 7   n.add_trans(1, 2, EPS);

 8   n.add_trans(1, 4, EPS);

 9   n.add_trans(2, 3, 'a');

10   n.add_trans(4, 5, 'b');

11   n.add_trans(3, 6, EPS);

12   n.add_trans(5, 6, EPS);

13   n.add_trans(6, 1, EPS);

14   n.add_trans(6, 7, EPS);

15   n.add_trans(7, 8, 'a');

16   n.add_trans(8, 9, 'b');

17   n.add_trans(9, 10, 'b');

18   n.show();

19   return 0;

20 }

View Code

This would (quite expectedly) result in the following output:

This NFA has 11 states: 0 - 10 The initial state is 0 The final state is 10 Transition from 0 to 1 on input EPS Transition from 0 to 7 on input EPS Transition from 1 to 2 on input EPS Transition from 1 to 4 on input EPS Transition from 2 to 3 on input a Transition from 3 to 6 on input EPS Transition from 4 to 5 on input b Transition from 5 to 6 on input EPS Transition from 6 to 1 on input EPS Transition from 6 to 7 on input EPS Transition from 7 to 8 on input a Transition from 8 to 9 on input b Transition from 9 to 10 on input b

As I mentioned earlier: as trivial as this implementation may seem at the moment, it is the basis we will build upon later. Presenting it in small pieces will, hopefully, make the learning curve of this difficult subject less steep for you.

Implementing Thompson's Construction

Thompson's Construction tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics:

The simplest regular expression

The most basic regular expression is just some single character, for example a. The NFA for such a regex is:

Here is the implementation:

// Builds a basic, single input NFA

NFA build_nfa_basic(input in){

  NFA basic(2,0,1);

  basic.add_trans(0,1,in);return basic;}

Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says "in moves from state 0 to state 1". That's it - a simple regex, a simple construction procedure.

Note that this procedure is suited for the construction of an eps transition as well.

Some changes to the NFA class

The previous implementation of a simple NFA class was just a starting point and we have quite a few changes to make.

First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members (size, initial, final and trans_table) have been made public.

Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each i andj, trans_table[i][j] is the input that takes the NFA from state i to state j. It's NONE if there's no such transition (hence a lot of space is wasted - the matrix representation, while fast, is inefficient in memory).

Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (included in the source code attached to this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp in front of your eyes, and to follow the code for the operations while reading these explanations. Here they are:

append_empty_state - I want to append a new, empty state to the NFA. This state will have no transitions to it and no transitions from it. If this is the transition table before the appending (a sample table with size 5 - states 0 to 4):

Then this is the table after the appending:

The shaded cells are the transitions of the original table (be they empty or not), and the white cells are the new table cells - containing NONE.

shift_states - I want to rename all NFA's states, shifting them "higher" by some given number. For instance, if I have 5 states numbered 0 - 4, and I want to have the same states, just named 2 - 6, I will call shift_states(2), and will get the following table:

fill_states - I want to copy the states of some other NFA into the first table cells of my NFA. For instance, if I take the shifted table from above, and fill its first two states with a new small NFA, I will get (the new NFA's states are darkest):

Note that using fill_states after shift_states is not incidental. These two operations were created to be used together - to concatenate two NFAs. You'll see how they are employed shortly.

Now I will explain how the more complex operations of Thompson's Construction are implemented. You should understand how the operations demonstrated above work, and also have looked at their source code (a good example of our NFA's internal table manipulation). You may still lack the "feel" of why these operations are needed, but this will soon be covered. Just understand how they work, for now.

Implementing alternation: a|b

Here is the diagram of NFA alternation from earlier:

Given two NFAs, we must build another one that includes all the states of both NFAs plus additional, unified initial and final states. The function that implements this in nfa.cpp is build_nfa_alter. Take a look at its source code now - it is well commented and you should be able to follow through all the steps with little difficulty. Note the usage of the new NFA operations to complete the task. First, the NFAs states are shifted to make room for the full NFA. fill_states is used to copy the contents of nfa1 to the unified NFA. Finally, append_empty_state is used to add another state at the end - the new final state.

Implementing concatenation: ab

Here is the diagram of NFA concatenation from earlier:

Given two NFAs, we must build another one that includes all the states of both NFAs (note that nfa1's final and nfa2's initial states are overlapping). The function that implements this in nfa.cpp is build_nfa_concat. Just as in build_nfa_alter, the new NFA operations are used to construct, step by step, a bigger NFA that contains all the needed states of the concatenation.

Implementing star: a*

Here is the diagram of the NFA for a* from earlier:

Although the diagram looks complex, the implementation of this construction is relatively simple, as you'll see in build_nfa_star. There's no need to shift states, because no two NFAs are joined together. There's only a creation of new initial and final states, and new eps transitions added to implement the star operation.

Specialty of NFAs constructed by Thompson's Construction

You might have observed that all the NFAs constructed by Thompson's Construction have some very specific behavior. For instance, all the basic building blocks for single letters are similar, and the rest of the constructions just create new links between these states to allow for the alternation, concatenation and star operations. These NFAs are also special implementation-wise. For instance, note that in our NFA implementation, the first state is always the initial, and the last state is always final. You may have noted that this is useful in several operations.

A complete NFA construction implementation

With these operations implemented, we now have a full NFA construction implementation in nfa.cpp! For instance, the regex (a|b)*abb can be built as follows:

NFA a = build_nfa_basic('a');

NFA b = build_nfa_basic('b');

NFA alt = build_nfa_alter(a, b);

NFA str = build_nfa_star(alt);

NFA sa = build_nfa_concat(str, a);

NFA sab = build_nfa_concat(sa, b);

NFA sabb = build_nfa_concat(sab, b);

With these steps completed, sabb is the NFA representing (a|b)*abb. Note how simple it's to build NFAs this way! There's no need to specify individual transitions like we did before. In fact, it's not necessary to understand NFAs at all - just build the desired regex from its basic blocks, and that's it.

Even closer to complete automation

Though it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:

In this tree: . is concatenation, | is alternation, and * is star. So, the regex (a|b)*abb is represented here in a tree form, just like an arithmetic expression.

Such trees in the world of parsing and compilation are called expression trees, or parse trees. For example, to implement an infix calculator (one that can calculate, for example: 3*4 + 5), the expressions are first turned into parse trees, and then these parse trees are walked to make the calculations.

Note, this is, as always, an issue of representation. We have the regex in a textual representation: (a|b)*abb and we want it in NFA representation. Now we're wondering how to turn it from one representation to the other. My solution is to use an intermediate representation - a parse tree. Going from a regex to a parse tree is similar to parsing arithmetic expressions, and going from a parse tree to a NFA will be now demonstrated, using the Thompson's Construction building blocks described in this article.

From a parse tree to a NFA

A parse tree in our case is just a binary tree, since no operation has more than two arguments. Concatenation and alternations have two arguments, hence their nodes in the tree have two children. Star has one argument - hence only the left child. Chars are the tree leaves. Take a good look at the tree drawn above, you'll see this very clearly.

Take a look at the file regex_parse.cpp from the source code archive on the book's companion website. It has a lot in it, but you only need to focus only on some specific things for now. First, let's look at the definition of parse_node:

 1 typedef enum {CHR, STAR, ALTER, CONCAT} node_type;

 2 // Parse node

 3 struct parse_node

 4 {

 5   parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) :

 6     type(type_),

 7     data(data_),

 8     left(left_),

 9     right(right_)

10   {

11   }

12   node_type type;

13   char data;

14   parse_node* left;

15   parse_node* right;

16 };

View Code

This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the tree_to_nfa function from regex_parse.cpp:

 1 NFA tree_to_nfa(parse_node* tree)

 2 {

 3   assert(tree);

 4   switch (tree->type)

 5   {

 6     case CHR:

 7       return build_nfa_basic(tree->data);

 8     case ALTER:

 9       return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right));

10     case CONCAT:

11       return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right));

12     case STAR:

13       return build_nfa_star(tree_to_nfa(tree->left));

14     default:

15       assert(0);

16   }

17 }

View Code

Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code...

From a regex to a parse tree

If you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like (a|b)*abb).

I really hate to do this, but I won't explain how this regex-to-tree code works. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested.

Converting NFAs to DFAs

The N in NFA stands for non-deterministic. Our computers, however, are utterly deterministic beasts, which makes "true" simulation of an NFA impossible. But we do know how to simulate DFAs. So, what's left is to see how NFAs can be converted to DFAs.

The algorithm for constructing from an NFA a DFA that recognizes the same language is called "Subset Construction". The main idea of this algorithm is in the following observations:

In the transition table of an NFA, given a state and an input there's a set of states we can move to. In a DFA, however, there's only one state we can move to.
Because of eps transitions, even without an input there's a set of states an NFA can be in at any given moment. Not so with the DFA, for which we always know exactly in what state it is.
Although we can't know in which state from the sets described above the NFA is, the set is known. Therefore, we will represent each set of NFA states by a DFA state. The DFA uses its state to keep track of all possible states the NFA can be in after reading each input symbol.

Take, for example, the familiar NFA of the (a|b)*abb regex (generated automatically by Thompson's Construction, with the code from the last column):

The initial state of this NFA is 0... or is it ? Take a look at the diagram, and count in how many states this NFA can be before any input is read. If you remember the previous columns where I explained how eps transitions work, you should have no trouble noticing that initially, the NFA can be in any of the states {0, 1, 2, 4, 7}, because these are the states reachable by eps transitions from the initial state.

Note: the set T is reachable by eps from itself by definition (the NFA doesn't have to take an eps transition, it can also stay in its current state).

Now imagine we received the input a. What happens next? In which states can the NFA be now? This should be easy to answer. Just go over all the states the NFA can be in before the input, and see where the input a leads from them. This way, a new set emerges: {1, 2, 3, 4, 6, 7, 8}. I hope you understand why: initially, the NFA can be in states {0, 1, 2, 4, 7}. But from states 0, 1 and 4 there are no transitions on a. The only transitions on afrom that set are from state 2 (to state 3) and from state 7 (to state 8). However, the states {3, 8} is an incomplete answer. There are eps transitions from these states - to states {1, 2, 4, 6, 7}, so the NFA can actually be in any of the states {1, 2, 3, 4, 6, 7, 8}.

If you understand this, you understand mostly how the Subset Construction algorithm works. All that's left is the implementation details. But before we get to the implementation of the conversion algorithm itself, there are a couple of prerequisites.

eps-closure

Given N - an NFA and T - a set of NFA states, we would like to know which states in N are reachable from states T by eps transitions. eps-closureis the procedure that answers this question. Here is the algorithm:

algorithm eps-closure inputs: N - NFA, T -set of NFA states output: eps-closure(T)- states reachable from T by eps transitions eps-closure(T)= T foreach state t in T push(t, stack)while stack isnot empty do t = pop(stack)foreach state u with an eps edge from t to u if u isnotin eps-closure(T) add u to eps-closure(T) push(u, stack)endreturn eps-closure(T)

This algorithm iteratively finds all the states reachable by eps transitions from the states T. First, the states T themselves are added to the output. Then, one by one the states are checked for eps transitions, and the states these transitions lead to are also added to the output, and are pushed onto the stack (in order to be checked for eps transitions). The process proceeds iteratively, until no more states can be reached with epstransitions only.

For instance, for the (a|b)*abb NFA above, eps-closure({0}) = {0, 1, 2, 4, 7}, eps-closure({8, 9}) = {8, 9}, etc.

In the attached source code, the implementation of eps-closure is in the file subset_construct.cpp, function build_eps_closure. It follows the algorithm outlined above very closely, so you should have no trouble understanding it.

move - a new NFA member function

Given T - a set of NFA states, and A - an input, we would like to know which states in the NFA are reachable from T with the input A. This operation was implemented in the function move, in nfa.cpp (member of the NFA class). The function is very simple. It traverses the set T, and looks for transitions on the given input, returning the states that can be reached. It doesn't take into account the eps transitions from those states - there's eps-closure for that.

Keeping track of the input language of an NFA

For reasons that will soon become obvious, we must keep track of the input language used in the NFA. For example, for the regex (a|b)*abb the language is {a, b}. A new member was added to the NFA class for this purpose - inputs. Take a look at the implementation in nfa.cpp to see how it's managed.

DFA implementation

Since we intend to build DFAs in this column, we need a DFA implementation. A very basic implementation was coded in dfa.h - take a look at it now. The DFA's transition table is implemented with a map, that maps (state, input) pairs to states. For example, (t1, i) will be mapped to t2 if input i in state t1 leads to state t2. Note that it's not the same representation as the one I used for the NFA. There are two reasons for this difference:

I want you to see two different implementations of a graph (which is what a transition table is, as I told you before). Both are legitimate for certain purposes.
A little bit of cheating... I know which operations I'll need from the DFA, so I'm tailoring the representation to these operations. This is a very common programming trick - one should always think about the ways a data structure will be used before he designs its exact implementation.

So take a look at dfa.h - the DFA implementation is very simple - only a transition table, a start state and a set of final states. There are also methods for showing the DFA and for simulating it...

To remind you, this is the algorithm for DFA simulation :

algorithm dfa-simulate

inputs: D - DFA, I -Input

output: ACCEPT or REJECT

  s = start state of D

  i =getnext input character from I

  whilenotend of I do

    s = state reached with input i from state s

    i =getnext input character from I

  endif s is a final state

    return ACCEPT

  elsereturn REJECT

Let's now finally learn how the DFA is built.

Subset Construction

In the Introduction, I provided some observations about following NFA states, and gave an example. You saw that while it's impossible to know in what state an NFA is at any given moment, it's possible to know the set of states it can be in. Then, we can say with certainty that the NFA is in one of the states in this set, and not in any state that's not in the set.

So, the idea of Subset Construction is to build a DFA that keeps track where the NFA can be. Each state in this DFA stands for a set of states the NFA can be in after some transition. "How is this going to help us?", you may ask yourself. Good question.

Recall how we simulate a DFA (if you don't feel confident with this material, read the 2nd part of Algorithmic Forays again). When can we say that a DFA recognizes an input string? When the input string ends, we look at the state we're left in. If this is a final state - ACCEPT, if it's not a final state - REJECT.

So, say that we have a DFA, each state of which represents a set of NFA states. Since a NFA will always "pick the correct path", we can assume that if the set contains a final state, the NFA will be in it, and the string is accepted. More formally: A DFA state D represents S - a set of NFA states. If (and only if) one or more of the states in S is a final state, then D is a final state. Therefore, if a simulation ends in D, the input string is accepted.

So, you see, it's useful to keep track of sets of NFA states. This will allow us to correctly "simulate" an NFA by simulating a DFA that represents it.

Here's the Subset Construction algorithm:

algorithm subset-construction

inputs: N - NFA

output: D - DFA

  add eps-closure(N.start) to dfa_states, unmarked

  D.start = eps-closure(N.start)while there is an unmarked state T in dfa_states do

    mark(T)if T contains a final state of N

      add T to D.finalforeach input symbol i in N.inputs

      U = eps-closure(N.move(T, i))if U isnotin dfa_states

        add U to dfa_states, unmarked

      D.trans_table(T, i)= U

  end

The result of this procedure is D - a DFA with a transition table, a start state and a set of final states (not incidentally just what we need for our DFA class...).

Here are some points to help you understand how subset-construction works:

It starts by creating the initial state for the DFA. Since, as we saw in the Introduction, an initial state is really the NFA's initial state plus all the states reachable by eps transitions from it, the DFA initial state is the eps-closure of the NFA's initial state.
A state is "marked" if all the transitions from it were explored.
A state is added to the final states of the DFA if the set it represents contains the NFA's final state
The rest of the algorithm is a simple iterative graph search. Transitions are added to the DFA transition table for each symbol in the alphabet of the regex. So the DFA transition actually represents a transition to the eps-closure in each case. Recall once again: a DFA state represents a set of states the NFA can be in after a transition.

This algorithm is implemented in the function subset_construct in subset_construct.cpp, take a look at it now. Here are some points about the implementation:

dfa_states from the algorithm is represented by two sets: marked_states and unmarked_states, with the obvious meanings.
Since DFA states are really sets of NFA states (see the state_rep typedef), something must be done about numbering them properly (we want a state to be represented by a simple number in the DFA). So, the dfa_state_num map takes care of it, with the numbers generated on-demand by gen_new_state.
The loop runs while the unmarked_states set is not empty. Then, "some" state is marked by taking it out from the unmarked set (the first member of the set is picked arbitrarily) and putting it into the marked set.

I hope an example will clear it all up. Let's take our favorite regex - (a|b)*abb, and show how the algorithm runs on the NFA created from it. Here's the NFA again:

Following the algorithm: The start state of the equivalent DFA is the eps-closure of NFA state 0, which is A = {0, 1, 2, 4, 7}. So, we enter into the loop and mark A. A doesn't contain a final state, so we don't add it to the DFA's final set. The input symbol alphabet of the regex (a|b)*abb is {a, b}, so first we compute eps-closure(move(A, a)). Let's expand this:

eps-closure(move({0,1,2,4,7}, a))= eps-closure({3,8})={1,2,3,4,6,7,8}

Let's call this set B. B is not a member of dfa_states yet, so we add it there, unmarked. We also create the DFA transition D.trans_table(A, a) = B. Now we're back to the inner loop, with the input b. Of all the states in set A, the only transition on b is from 4 to 5, so we create a new set:

C = eps-closure((move(A, b))= eps-closure({5})={1,2,4,5,6,7}

So C is added, unmarked, to dfa_states. Since all the alphabet symbols are done for A, we go back to the outer loop. Are there any unmarked states? Yes, there are B and C. So we pick B and go over the process again and again, until all the sets that are states of the DFA are marked. Then, the algorithm terminates, and D.trans_table contains all the relevant transitions for the DFA. The five different sets we get for DFA states are:

A ={0,1,2,4,7}

B ={1,2,3,4,6,7,8}

C ={1,2,4,5,6,7}

D ={1,2,4,5,6,7,9}

E ={1,2,4,5,6,7,10}

A is the initial state of the DFA, since it contains the NFA initial state 0. E is obviously the only final state of the NFA, since it's the only one containing the NFA final state 10. If you follow the whole algorithm through and take note of the DFA transitions created, this is the resulting diagram:

You can easily:

See that this is a DFA - no eps transitions, no more than a single transition on each input from a state.
Verify that it indeed recognizes the regular expression (a|b)*abb

Conclusion

We've come a long way, and finally our mission is accomplished. We've created a real regular expression engine. It's not a complete engine like the one Lex or Perl have, but it's a start. The basis has been laid, the rest is just extensions. Let's see what we have:

A regular expression is read into a parse tree (implemented in regex_parse.cpp) - I didn't actually explain how this part is done, but you can trust me (and I hope you verified!) that it works.
An NFA is built from the regular expression, using Thompson's Construction (implemented in nfa.h/cpp)
The NFA is converted to a DFA, using the Subset Construction algorithm (implemented in subset_construct.h/cpp and dfa.h)
The resulting DFA is simulated using the DFA simulation algorithm (implemented in dfa.h)

A small main function is implemented in regex_parse.cpp. It reads a regex and a string as arguments, and says whether the string fits the regex. It does it by going through the steps mentioned above, so I advise you to take a look at it and make sure that you understand the order in which things are called.

你可能感兴趣的:(正则表达式)

python实现规则引擎_规则引擎python weixin_39601511 python实现规则引擎
广告关闭回望2020，你在技术之路上，有什么收获和成长么？对于未来，你有什么期待么？云+社区年度征文，各种定制好礼等你！我正在用python编写日志收集分析应用程序，我需要编写一个“规则引擎”来匹配和处理日志消息。它需要具有以下特点：正则表达式匹配消息本身消息严重性优先级的算术比较布尔运算符我设想一个例子规则可能是这样的：(message~program:messageandseverity>=h
Regular Expression 正则表达式 Aimyon_36 Data Development 正则表达式 redis 数据库
RegularExpression前言1.基本匹配2.元字符2.1点运算符.2.2字符集2.2.1否定字符集2.3重复次数2.3.1*号2.3.2+号2.3.3?号2.4{}号2.5(...)特征标群2.6|或运算符2.7转码特殊字符2.8锚点2.8.1^号2.8.2$号3.简写字符集4.零宽度断言（前后预查）4.1?=...正先行断言4.2?!...负先行断言4.3?Thefatcatsaton
Nginx从入门到实践(三) 听你讲故事啊
动静分离动静分离是将网站静态资源（JavaScript，CSS，img等文件）与后台应用分开部署，提高用户访问静态代码的速度，降低对后台应用访问。动静分离的一种做法是将静态资源部署在nginx上，后台项目部署到应用服务器上，根据一定规则静态资源的请求全部请求nginx服务器，达到动静分离的目标。rewrite规则Rewrite规则常见正则表达式Rewrite主要的功能就是实现URL的重写，Ngin
爬虫技术抓取网站数据 Bearjumpingcandy 爬虫
爬虫技术是一种自动化获取网站数据的技术，它可以模拟人类浏览器的行为，访问网页并提取所需的信息。以下是爬虫技术抓取网站数据的一般步骤：发起HTTP请求：爬虫首先会发送HTTP请求到目标网站，获取网页的内容。解析HTML：获取到网页内容后，爬虫会使用HTML解析器解析HTML代码，提取出需要的数据。数据提取：通过使用XPath、CSS选择器或正则表达式等工具，爬虫可以从HTML中提取出所需的数据，如文
互联网 Java 工程师面试题（Java 面试题四）苹果酱0567 面试题汇总与解析 java 中间件开发语言 spring boot 后端
下面列出这份Java面试问题列表包含的主题多线程，并发及线程基础数据类型转换的基本原则垃圾回收（GC）Java集合框架数组字符串GOF设计模式SOLID抽象类与接口Java基础，如equals和hashcode泛型与枚举JavaIO与NIO常用网络协议Java中的数据结构和算法正则表达式JVM底层Java最佳实JDBCDate,Time与CalendarJava处理XMLJUnit编程现在是时候给
【无标题】正则表达式笔记 qis_qis 正则表达式笔记
作用查找特殊规则的字符串编写一个正则表达式，用来查找所有以0开头，后面跟着2-3个数字，然后是一个连字号“-”，最后是7或8位数字的字符串(像010-12345678或0376-7654321)。0\d{2,3}-\d{7,8}基本匹配区分大小写cat会匹配"cat"CAt会匹配"CAt"元字符元字符是正则表达式的基本组成元素。元字符在这里跟它通常表达的意思不一样，而是以某种特殊的含义去解释。有些
python学习第七节：正则表达式一只会敲代码的小灰灰 python学习 python 学习正则表达式
python学习第七节：正则表达式正则表达式基本上在所有开发语言中都会使用到，在python中尤为重要。当我们使用python开发爬虫程序将目标网页扒下来之后我们要从网页中解析出我们想要的信息，这个时候就需要正则表达式去进行匹配。importrere的常量re模块中有9个常量，常量的值都是int类型！（知道就行）修饰符描述re.l使匹配对大小写不敏感re.L做本地化识别(locale-aware)
Linux三剑客之grep命令详解 promise524 Linux linux 服务器 python shell bash 后端运维
grep是Linux中最常用的文本搜索工具，用于在文件或文本输出中查找与指定模式匹配的行。它支持基本正则表达式、扩展正则表达式、多文件搜索、递归搜索等多种功能，非常适合过滤、搜索和提取文本内容。1.grep的基本语法grep[选项]模式[文件...]模式：搜索的文本模式，可以是普通字符串或正则表达式。[文件...]：要搜索的文件。如果没有指定文件，grep会从标准输入中读取数据。2.常用选项-i：
Linux三剑客与管道使用许琳珊
一、管道1、什么是管道linux提供管道符“|”将两个命令隔开，管道符左边命令的输出就会作为管道符右边命令的输入2、例子echo"hello123"|grep"hello"二、正则1、什么是正则正则表达式就是记录文本规则的代码2、正则的用法常用元字符代码说明.匹配除换行符以外的任意字符\w匹配字母或数字或下划线或汉字\s匹配任意的空白符\d匹配数字\b匹配单词的开始或结束^匹配字符串的开始$匹配字
Java 正则表达式详解艾伦~耶格尔 Java初级 java 正则表达式开发语言学习
正则表达式(RegularExpression，简称regex)是一种强大的文本处理工具，可以用来匹配、搜索和替换文本中的特定模式。在Java中，正则表达式由java.util.regex包提供支持。1.理解正则表达式语法正则表达式使用特殊的字符和符号来定义匹配模式。一些常用的元字符如下：.:匹配任意单个字符*:匹配前面的字符零次或多次+:匹配前面的字符一次或多次?:匹配前面的字符零次或一次[]:
Linux三剑客-sed krb___ linux 运维服务器
前言：sed是StreamEditor（字符流）的缩写，简称流编辑器。sed是操作、过滤和转换问吧内容的强大工具。sed是一次读取一行数据常用功能包括结合正则表达式对文件实现快速增删改查，其中查询的功能中最常用的两大功能是过滤（过滤指定字符串），取行（取出指定行）sed命令语法：sed[选项][sed内置命令字符][输入文件]选项参数解释-n取消默认sed的输出，常与sed内置命令p一起使用-i直
Python基础知识进阶之正则表达式_头歌python正则表达式进阶前端陈萨龙程序员 python 学习面试
最后硬核资料：关注即可领取PPT模板、简历模板、行业经典书籍PDF。技术互助：技术群大佬指点迷津，你的问题可能不是问题，求资源在群里喊一声。面试题库：由技术群里的小伙伴们共同投稿，热乎的大厂面试真题，持续更新中。知识体系：含编程语言、算法、大数据生态圈组件（Mysql、Hive、Spark、Flink）、数据仓库、Python、前端等等。网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是
Java中的数组和字符串 RenX000 Java SE java
文章目录数组一维数组创立默认值转型多维数组可变长参数基本格式应用字符串String类StringBuilder类裁剪正则表达式检测数组数组类型本身也是类，即使是基本类型的数组也是以对象形式存在的，并不是基本数据类型一维数组int[]array=newint[10];//创建数组时需要指定长度创立类型[]变量名称=new类型[数组大小];类型变量名称[]=new类型[数组大小];//支持C语言样式，
Linux如何使用sed命令进行文本替换 yang295242361 linux 运维服务器
在Linux中，sed（StreamEditor）是一个用于处理文本流的命令行工具，它非常适合用于执行基本的文本转换。sed可以读取输入的文本文件，根据指定的指令对文本进行处理，并将结果输出到标准输出设备。以下是如何使用sed命令进行文本替换的详细说明：1.基本语法sed命令的基本语法如下：sed's/regexp/replacement/flags'fileregexp：正则表达式，用于匹配要替
Linux 运维三剑客：grep、sed 和 awk 实战案例与命令参数详解 Lyle_Tu Linux 云计算运维运维 linux chrome 云计算服务器
在Linux运维中，grep、sed和awk是三个非常强大的文本处理工具，它们在处理文本数据时发挥着重要作用。本文将通过一些实战案例，展示这三个工具的使用方法和强大功能，并对它们的命令参数进行详解。grep：文本搜索利器grep是一个强大的文本搜索工具，它使用正则表达式来匹配文本模式。以下是grep的一些常用命令参数：-i：忽略大小写进行匹配。-v：反向查找，只打印不匹配的行。-n：显示匹配行的行
python核心编程课后习题答案--第一章 NewForMe
正则表达式1-1[bh][aiu]t;1-2\w+\w+;1-3\w+,\s\w+;1-4[A-Za-z_]+[\w_]+python有效标识符的定义：1.python中的标识符是区分大小写的。2.标示符以字母或下划线开头，可包括字母，下划线和数字。3.以下划线开头的标识符是有特殊意义的。1-5\d+(\s\w+)+1-6(1)^w{3}://.+com/?$(2)^\w+://.+?\.\w{3
Java 正则表达式南风_001
正则表达式定义了字符串的模式。正则表达式可以用来搜索、编辑或处理文本。正则表达式并不仅限于某一种语言，但是在每种语言中有细微的差别。正则表达式实例一个字符串其实就是一个简单的正则表达式，例如HelloWorld正则表达式匹配"HelloWorld"字符串。.（点号）也是一个正则表达式，它匹配任何一个字符如："a"或"1"。下表列出了一些正则表达式的实例及描述：正则表达式描述thisistext匹配
Linux shell sed 命令详解 BugBear1989
详细的sed命令详解，请参考https://my.oschina.net/u/3908182/blog/1921761一、sed命令工作机制：每次读取一行文本至“模式空间(patternspace)”中，在模式空间中完成处理；将处理结果输出至标准输出设备；语法：sed[OPTION]...{script}[input-file]...参数说明-r支持扩展正则表达式-n静默模式-escript1-e
用正则表达式过滤logcat中的多个tag的日志 fc82bb084ee7
在AndroidStudio中,在过滤器的byLogTag选项中配置.我配置了2个tagfilter方便开发,1.multi-tag-filter2.ignore-multi-tag-filter.过滤出指定tag的日志信息^(?:Watchdog|InputReader|ahking)Watchdog忽略指定tag的日志信息^(?!WifiMonitor|WifiHW)有些tag的无用log非常
Python实现对哈利波特小说单词统计胜天半月子 Python基础及应用 python 字符串列表正则表达式
文章目录要求一、打开文件正则表达式spilt()函数实例二、词频统计三、单词排序四、输出或写入文件python文件写入要求对HarryPotter5.txt英文小说进行词频统计，统计出前二十个频率最高的单词，并打印输出或写入文件一、打开文件打开文件并将单词中非单词字符用空格代替代码：#读取小说内容fp=open('HarryPotter5.txt')content=fp.read()#所有标点符号
javase笔记3----正则表达式芝奥小婷笔记
正则表达式简介正则表达式（RegularExpressions），是一个特殊的字符串，可以对普通的字符串进行校验检测等工作，校验一个字符串是否满足预设的规则。基本语法字符集合[]:表示匹配括号里的任意一个字符。[abc]:匹配a或者b或者c[^abc]:匹配任意一个字符，只要不是a,或b,或c就表示匹配成功[a-z]:表示匹配所有的小写字母的任意一个。[A-Za-z]:表示匹配所有的小写字母和大写
搜索结果关键字标红 — 正则月亮消失了.974 servlet html javascript
str是你的内容，key是关键字正则表达式匹配模式支持的三个标志（newregexp的第二个参数）g:global全文搜索，不添加则搜索到第一个匹配停止；i:ignorecase忽略大小写，默认大小写敏感；m:multiplelines多行搜索highlight(str,key){ varreg=newRegExp(`(${key})`,'gi'); v
正则表达式语法、运算符优先级 weixin_54668000 mvc
正则表达式(regularexpression)描述了一种字符串匹配的模式（pattern），可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。例如：runoo+b，可以匹配runoob、runooob、runoooooob
shell脚本——正则表达式诚诚k 正则表达式
概述正则表达式是你所定义的模式模板，Linux工具可以用它来过滤文本。Linux工具（比如sed编辑器或gawk程序）能够在处理数据时使用正则表达式对数据进行模式匹配。如果数据匹配模式，它就会被接受并进一步处理；如果数据不匹配模式，它就会被滤掉。数据流--正则表达式---（1）匹配的数据（2）滤掉的数据正则表达式（或称RegularExpression，简称RE），是用于描述字符排列和匹配模式的一
正则表达式-运算符优先级一只小棉花正则表达式正则表达式-优先级
转自：http://www.runoob.com/regexp/regexp-operator.html
【Python】正则表达式丕羽 python 正则表达式 mysql
正则表达式正则表达式,全称是RegularExpression,正则表达式,即:正确的,符合特定规则的式子.用来校验和匹配数据,正则不独属于任意的一门语言,Java,Python…都支持,且:正则规则都是一样的,不同的是写法不一样.python中正则使用步骤:#1.导包importre#2.正则校验.re.match()re.search()re.compile().sub()#3.获取匹配结果.
re模块匿隱
defmain():""""""#1.compile(正则表达式)->将正则表达式转换成正则对象"""编译后可以直接通过对象调用相关的对象方法"""re_object=re.compile(r'\d{3}')re_object.fullmatch('432')#2.fullmatch(正则表达式,字符串)->让字符串和正则表达式完全匹配，匹配成功返回匹配对象，匹配失败返回None"""应用：检测字
Python 标准库一马归一码 Python python
目录1.一些常见的标准库：2.os模块的导入和使用3.re模块的导入与调用4.math模块的导入与调用5.datetime模块的导入与调用标准库：Python本身带着的一些标准的模块库，这些模块被直接构建在解析器里，虽然不是语言内置的功能，但可以高效地调用，甚至是系统级调用也可以。1.一些常见的标准库：os模块：提供了很多与操作系统相关联的函数re模块：为高级字符串处理提供了正则表达式工具，对于复
正则表达式他@ 正则表达式 php 数据库
一：正则表达式grep-a不要忽略二进制数据。-A除了显示符合范本样式的那一行之外，并显示该行之后的内容。-b在显示符合范本样式的那一行之外，并显示该行之前的内容。-c计算符合范本样式的列数。-C或-除了显示符合范本样式的那一列之外，并显示该列之前后的内容。-d当指定要查找的是目录而非文件时，必须使用这项参数，否则grep命令将回报信息并停止动作。-e指定字符串作为查找文件内容的范本样式。-E将范
14.JS-正则表达式的反向引用 WahFung_ js笔记正则表达式 js
选择字符：|com|cn|edu---选择其中一个(含有其中一个就能匹配成功)子表达式：用()包围的就是子表达式str="((/d)(/w))"第一个子表达式：((\d)(\w))第二个子表达式：(\d)第三个子表达式：(\w)子表达式：以第一个出现的(为第一个表达式捕获：将匹配到的子表达式保存在RegExp对象中RegExp.$1：保存第一个子表达式RegExp.$2：保存第二个子表达式RegE
安装数据库首次应用 Array_06 java oracle sql
可是为什么再一次失败之后就变成直接跳过那个要求 enter full pathname of java.exe的界面这个java.exe是你的Oracle 11g安装目录中例如：【F:\app\chen\product\11.2.0\dbhome_1\jdk\jre\bin】下的java.exe 。不是你的电脑安装的java jdk下的java.exe！注意第一次，使用SQL D
Weblogic Server Console密码修改和遗忘解决方法 bijian1013 Welogic
在工作中一同事将Weblogic的console的密码忘记了，通过网上查询资料解决，实践整理了一下。一.修改Console密码打开weblogic控制台，安全领域 --> myrealm -->&n
IllegalStateException: Cannot forward a response that is already committed Cwind java Servlets
对于初学者来说，一个常见的误解是：当调用 forward() 或者 sendRedirect() 时控制流将会自动跳出原函数。标题所示错误通常是基于此误解而引起的。示例代码： protected void doPost() { if (someCondition) { sendRedirect(); } forward(); // Thi
基于流的装饰设计模式木zi_鸣设计模式
当想要对已有类的对象进行功能增强时，可以定义一个类，将已有对象传入，基于已有的功能，并提供加强功能。自定义的类成为装饰类模仿BufferedReader，对Reader进行包装，体现装饰设计模式装饰类通常会通过构造方法接受被装饰的对象，并基于被装饰的对象功能，提供更强的功能。装饰模式比继承灵活，避免继承臃肿，降低了类与类之间的关系装饰类因为增强已有对象，具备的功能该
Linux中的uniq命令被触发 linux
Linux命令uniq的作用是过滤重复部分显示文件内容，这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个及以后更多个重复行将被删去，行比较是根据所用字符集的排序序列进行的。该命令加工后的结果写到输出文件中。输入文件和输出文件必须不同。如果输入文件用“- ”表示，则从标准输入读取。 AD： uniq [选项] 文件说明：这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个
正则表达式Pattern 肆无忌惮_ Pattern
正则表达式是符合一定规则的表达式，用来专门操作字符串，对字符创进行匹配，切割，替换，获取。例如，我们需要对QQ号码格式进行检验规则是长度6~12位不能0开头只能是数字，我们可以一位一位进行比较，利用parseLong进行判断，或者是用正则表达式来匹配[1-9][0-9]{4,14} 或者 [1-9]\d{4,14} &nbs
Oracle高级查询之OVER (PARTITION BY ..) 知了ing oracle sql
一、rank()/dense_rank() over(partition by ...order by ...) 现在客户有这样一个需求，查询每个部门工资最高的雇员的信息，相信有一定oracle应用知识的同学都能写出下面的SQL语句： select e.ename, e.job, e.sal, e.deptno from scott.emp e, (se
Python调试矮蛋蛋 python pdb
原文地址： http://blog.csdn.net/xuyuefei1988/article/details/19399137 1、下面网上收罗的资料初学者应该够用了，但对比IBM的Python 代码调试技巧： IBM：包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试： http://www.ibm.com/d
webservice传递自定义对象时函数为空，以及boolean不对应的问题 alleni123 webservice
今天在客户端调用方法 NodeStatus status=iservice.getNodeStatus(). 结果NodeStatus的属性都是null。进行debug之后，发现服务器端返回的确实是有值的对象。后来发现原来是因为在客户端，NodeStatus的setter全部被我删除了。本来是因为逻辑上不需要在客户端使用setter，结果改了之后竟然不能获取带属性值的
java如何干掉指针，又如何巧妙的通过引用来操作指针————>说的就是java指针百合不是茶
C语言的强大在于可以直接操作指针的地址，通过改变指针的地址指向来达到更改地址的目的,又是由于c语言的指针过于强大，初学者很难掌握， java的出现解决了c，c++中指针的问题 java将指针封装在底层，开发人员是不能够去操作指针的地址，但是可以通过引用来间接的操作：定义一个指针p来指向a的地址（&是地址符号）：
Eclipse打不开，提示“An error has occurred.See the log file ***/.log” bijian1013 eclipse
打开eclipse工作目录的\.metadata\.log文件，发现如下错误： !ENTRY org.eclipse.osgi 4 0 2012-09-10 09:28:57.139 !MESSAGE Application error !STACK 1 java.lang.NoClassDefFoundError: org/eclipse/core/resources/IContai
spring aop实例annotation方法实现 bijian1013 java spring AOP annotation
在spring aop实例中我们通过配置xml文件来实现AOP，这里学习使用annotation来实现，使用annotation其实就是指明具体的aspect,pointcut和advice。1.申明一个切面(用一个类来实现)在这个切面里,包括了advice和pointcut AdviceMethods.jav
[Velocity一]Velocity语法基础入门 bit1129 velocity
用户和开发人员参考文档 http://velocity.apache.org/engine/releases/velocity-1.7/developer-guide.html 注释 1.行级注释## 2.多行注释#* *# 变量定义使用$开头的字符串是变量定义，例如$var1, $var2, 赋值使用#set为变量赋值，例
【Kafka十一】关于Kafka的副本管理 bit1129 kafka
1. 关于request.required.acks request.required.acks控制者Producer写请求的什么时候可以确认写成功，默认是0， 0表示即不进行确认即返回。 1表示Leader写成功即返回，此时还没有进行写数据同步到其它Follower Partition中 -1表示根据指定的最少Partition确认后才返回，这个在 Th
lua统计nginx内部变量数据 ronin47 lua nginx　统计
server { listen 80; server_name photo.domain.com; location /{set $str $uri; content_by_lua ' local url = ngx.var.uri local res = ngx.location.capture(
java-11.二叉树中节点的最大距离 bylijinnan java
import java.util.ArrayList; import java.util.List; public class MaxLenInBinTree { /* a. 1 / \ 2 3 / \ / \ 4 5 6 7 max=4 pass "root"
Netty源码学习-ReadTimeoutHandler bylijinnan java netty
ReadTimeoutHandler的实现思路：开启一个定时任务，如果在指定时间内没有接收到消息，则抛出ReadTimeoutException 这个异常的捕获，在开发中，交给跟在ReadTimeoutHandler后面的ChannelHandler，例如 private final ChannelHandler timeoutHandler = new ReadTim
jquery验证上传文件样式及大小(好用) cngolon 文件上传 jquery验证
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script src="jquery1.8/jquery-1.8.0.
浏览器兼容【转】 cuishikuan css 浏览器 IE
浏览器兼容问题一：不同浏览器的标签默认的外补丁和内补丁不同问题症状：随便写几个标签，不加样式控制的情况下，各自的margin 和padding差异较大。碰到频率:100% 解决方案：CSS里 *{margin:0;padding:0;} 备注：这个是最常见的也是最易解决的一个浏览器兼容性问题，几乎所有的CSS文件开头都会用通配符*来设
Shell特殊变量：Shell $0, $#, $*, $@, $?, $$和命令行参数 daizj shell $#$?特殊变量
前面已经讲到，变量名只能包含数字、字母和下划线，因为某些包含其他字符的变量有特殊含义，这样的变量被称为特殊变量。例如，$ 表示当前Shell进程的ID，即pid，看下面的代码： $echo $$ 运行结果 29949 特殊变量列表变量含义 $0 当前脚本的文件名 $n 传递给脚本或函数的参数。n 是一个数字，表示第几个参数。例如，第一个
程序设计KISS 原则-------KEEP IT SIMPLE, STUPID! dcj3sjt126com unix
翻到一本书，讲到编程一般原则是kiss：Keep It Simple, Stupid.对这个原则深有体会，其实不仅编程如此，而且系统架构也是如此。 KEEP IT SIMPLE, STUPID! 编写只做一件事情，并且要做好的程序；编写可以在一起工作的程序，编写处理文本流的程序，因为这是通用的接口。这就是UNIX哲学.所有的哲学真正的浓缩为一个铁一样的定律，高明的工程师的神圣的“KISS 原
android Activity间List传值 dcj3sjt126com Activity
第一个Activity： import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import android.app.Activity;import android.content.Intent;import android.os.Bundle;import a
tomcat 设置java虚拟机内存 eksliang tomcat 内存设置
转载请出自出处：http://eksliang.iteye.com/blog/2117772 http://eksliang.iteye.com/ 常见的内存溢出有以下两种: java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: Java heap space ------------
Android 数据库事务处理 gqdy365 android
使用SQLiteDatabase的beginTransaction()方法可以开启一个事务，程序执行到endTransaction() 方法时会检查事务的标志是否为成功，如果程序执行到endTransaction()之前调用了setTransactionSuccessful() 方法设置事务的标志为成功则提交事务，如果没有调用setTransactionSuccessful() 方法则回滚事务。事
Java 打开浏览器 hw1287789687 打开网址 open浏览器 open browser 打开url 打开浏览器
使用java 语言如何打开浏览器呢? 我们先研究下在cmd窗口中,如何打开网址使用IE 打开 D:\software\bin>cmd /c start iexplore http://hw1287789687.iteye.com/blog/2153709 使用火狐打开 D:\software\bin>cmd /c start firefox http://hw1287789
ReplaceGoogleCDN：将 Google CDN 替换为国内的 Chrome 插件 justjavac chrome Google google api chrome插件
Chrome Web Store 安装地址： https://chrome.google.com/webstore/detail/replace-google-cdn/kpampjmfiopfpkkepbllemkibefkiice 由于众所周知的原因，只需替换一个域名就可以继续使用Google提供的前端公共库了。同样，通过script标记引用这些资源，让网站访问速度瞬间提速吧
进程VS.线程 m635674608 线程
资料来源： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001397567993007df355a3394da48f0bf14960f0c78753f000 1、Apache最早就是采用多进程模式 2、IIS服务器默认采用多线程模式 3、多进程优缺点优点：多进程模式最大
Linux下安装MemCached 字符串 memcached
前提准备：1. MemCached目前最新版本为：1.4.22，可以从官网下载到。2. MemCached依赖libevent，因此在安装MemCached之前需要先安装libevent。2.1 运行下面命令，查看系统是否已安装libevent。[root@SecurityCheck ~]# rpm -qa|grep libevent libevent-headers-1.4.13-4.el6.n
java设计模式之--jdk动态代理（实现aop编程） Supanccy2013 java DAO 设计模式 AOP
与静态代理类对照的是动态代理类，动态代理类的字节码在程序运行时由Java反射机制动态生成，无需程序员手工编写它的源代码。动态代理类不仅简化了编程工作，而且提高了软件系统的可扩展性，因为Java 反射机制可以生成任意类型的动态代理类。java.lang.reflect 包中的Proxy类和InvocationHandler 接口提供了生成动态代理类的能力。 &
Spring 4.2新特性-对java8默认方法(default method)定义Bean的支持 wiselyman spring 4
2.1 默认方法(default method) java8引入了一个default medthod; 用来扩展已有的接口,在对已有接口的使用不产生任何影响的情况下,添加扩展使用default关键字 Spring 4.2支持加载在默认方法里声明的bean 2.2 将要被声明成bean的类 public class DemoService {