CS536 Reading Notes: The Scanner

The Scanner

Overview

  • use a scanner generator such as lex or jlex
  • The input to a scanner includes one regular expression for each token

Finite State Machine

  • S is the start state, each FSM has exactly one.
  • A is a final state, drawn using a double circle. A FSM may have more than one final state.
  • The finite-state machine stops when it gets stuck or when it has consumed all of the input characters.

An input string is accepted by a FSM if

* The entire string is consumed, and
* The machine ends in a final state.

Formal Definition

An FSM is a 5-tuple: (Q,Σ,δ,q,F)
  • Q is a finite set of states ({S,A,B} in the above example).
  • Σ (an uppercase sigma) is the alphabet of the machine, a finite set of characters that label the edges ({+,−,0,1,...,9} in the above example).
  • q is the start state, an element of Q (S in the above example).
  • F is the set of final states, a subset of Q ({B} in the above example).
  • δ is the state transition relation: Q×Σ→Q (i.e., it is a function that takes two arguments -- a state in Q and a character in Σ -- and returns a state in Q).

Regular Expressions

Pay attention to the precedence of operators.
"|" < "." < "*"

for example: letter(letter|digit)* is different with
letter.letter|digit*, which is (letter.letter)|(digit*).

Using Regular Expressions and FSMs to Define a Scanner

There is a theorem that says that for every regular expression, there is a finite-state machine that defines the same language, and vice versa.

你可能感兴趣的:(CS536 Reading Notes: The Scanner)