Scala - Combinator Parsing

Background

Occasionally, you may need to process a small, special-purpose language.
Essentially, you have only a few choices, One choice is to roll your own parser (and lexical analyzer). If you are not an expert, this is hard, if you are an expert, it is still time consuming 
An alternative choice is to use a parser generator,  there exist quite a few of these generators. some of the better known are Yacc and Bison for parser written in C and ANTLR for parer written in Java. 
You will probably also need a scanner generator such as Lex, Flex, or JFlexto go with it.   However, you need to learn new tools, including their sometimes obscure - error messages.
This chapter presents a third alternative. Instead of using the standalong domain specific language of a parser generator,  you will learn an internal domain specific language, or internal DSL for short. The internal DSL will consist of a library of parser combinator - functions and operatos defined in Scala that will server as building block for parsers. These building block will map one to one to the constructions of a "context-free" grammer, to make it easy to understand
We will show by an example

Example : Arithmetic expressions

suppose that the grammer for an arithmetic expression is as follow. 

/*
expr ::= term { "+" term | "-" term }.
term ::= factor { "*" factor | "/" factor }.
factor ::= floatingPointNumber  | "(" expr ")"

| : denote alternative production 
{...}: denotes repetition (zero or more times)
[...]: means an optional occurrence
*/
since we have the grammer, we can convert that to the combinator. 

import scala.util.parsing.combinator._

class Arith extends JavaTokenParsers { 
  def expr: Parser[Any] = term ~ rep ("+" ~ term | "-" ~ term)
  def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor)
  def factor: Parser[Any] = floatingPointNumber | "(" ~ expr ~ ")"  // what is the definition of the floatingPointNumber, is that an internal one?
                                                                    // actually, it is defined in trait JavaTokenParser, the definition being:   def floatingPointNumber: Parser[String] =   """-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r
}
and we have an general rule converting our grammer to combinator expressions. 
  1. every production becomes a method
  2. results of each method is Parser[Any], so you need to change ::= to ": Parser[Any]" , you can make it more precise, we will show later
  3. in the grammer, sequential composition was implicit, in program, it is expressed by an explicit operator: ~. 
  4. Repetition is expressed rep(...) instead of {...}. Analogously, option is expressed opt(...) instead of [...]
  5. The period (.) at the end of each production is omitted -  you can, however, write a semicolon (;) if you prefer 

Running the Parser

Nowe we have the grammer and we have formalize that into combinator syntax.  

// running your parser
import scala.util.parsing.combinator._
object ParseExpr extends Arith {
  def main(args : Array[String]) { 
    println("input : "  + args(0))
    println(parseAll(expr, args(0)))
  }
}
the key here is parseAll(expr, input)

to run the main method. 

ParseExpr.main(Array[String]("2 * ( 3 * 7)"))
//
// input : 2 * (3 + 7)
// [1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List())

ParseExpr.main(Array[String]("2 * (3 + 7))")) // deliberately something wrong to the syntax
// 
//[1.12] failure: string matching regex `\z' expected but `)' found
//
//2 * (3 + 7))
//           ^
it is more like a hack, but you can  also run that with Scala ParseExpr..


Basic regular expressions parsers

we have seen from above combinator, which has floatingPointNumber, this is inherited from Arith's super trait, which is called JavaTokenParsers. That uses a regular expresssion parser. The idea is that y ou can use any regular expression as a parser, the regular expression parses all strings it can match. the result is the parsed string. 


import scala.util.parsing.combinator._

object MyParsers extends RegexParsers { 
  val ident : Parser[String] = """[a-zA-Z_]\w*""".r
  
  // to test 
  def main(args : Array[String] ) { 
    println("input " + args(0))
    println(parseAll(ident, args(0)))
  }
}

MyParsers.main(Array[String]("joe"))
MyParsers.main(Array[String]("-joe"))   // fail
MyParsers.main(Array[String]("joe.wang")) // fail
MyParsers.main(Array[String]("123")) // fail

Another example : JSON

/*
value ::=  obj | arr | stringLiteral |
		   floatingPointerNumber | 
		   "nul" | "true" | "false"
obj ::= "{" [members] "}"
arr ::= "{" [values] "}"
members ::= member {"," member}
member ::= stringLiteral ":" value
values ::= value {",", value}
*/
and a well-formed, valid json file could be like this: 
{
   "address book": { 
      "name": "John Smith",
      "address" : {
         "street": "10 Market Street",
         "city" : "San Francisco, CA",
         "zip"  : 94111 
      },
      "phone numbers": [
         "408 338-4238",
         "408 111-6892"
      ]
   }
}
with the grammer in hand, we can transform that to the combinator syntax. 
import java.io.FileReader 
import scala.util.parsing.combinator._

class JSON extends JavaTokenParsers {
  def value : Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false"
  def obj : Parser[Any] = "{" ~ repsep(member, ",") ~ "}"
  def arr : Parser[Any] = "["~  repsep(value, ",") ~ "]"
  def member : Parser[Any] = stringLiteral ~ ":" ~ value 
}
and its to parse drive method is as follow. 
object ParseJSON extends JSON { 
  def main(args : Array[String]) {
    val reader = new FileReader(args(0))
    println(parseAll(value, reader))
  }
}
to run the drive method. 
// scala ParseJSON address-book.json
ParseJSON.main(Array[String]("address-book.json")) // you might need to given the right path for the *.json file

Parser output

If you run the command, you might get results such as 

[14.2] parsed: (({~List((("address book"~:)~(({~List((("name"~:)~"John Smith"),
(("address"~:)~(({~List((("street"~:)~"10 Market Street"), (("city"~:)~"San Fran
cisco, CA"), (("zip"~:)~94111)))~})), (("phone numbers"~:)~(([~List("408 338-423
8", "408 111-6892"))~]))))~}))))~})

the output of the previous JSON program is that it does not have intuitive meanings. It seems to be a sequence composed of bits and pieces of the input glued together with lists and ~ combinations. 

and check on the code, each and every combinator factor returns just a Parser[Any], it does not reflect the business model, the JSON object a more likely result type should be Map[String, Any].. and JSON array shuld be List[Any]., values "true", "false", should just be true, false.

there is an operator called "^^" which can help you do the result transformation. 

Also, from the output, something like ~(~("{", ms, "}"), this is illegible. why we keep the "{" or "}" in the result? the combinator has introduced the fllowign operator, which allow you to just keep the left or right match . they are ~> and <~

now, equipped with the new operator, we have this:  

import scala.util.parsing.combinator._

class JSON1 extends JavaTokenParsers { 
  def obj : Parser[Map[String, Any]] = 
    "{" ~> repsep (member, ",") <~ "}" ^^ (Map() ++ _)
  
  def arr : Parser[List[Any]] = 
     "[" ~> repsep (value, ",") <~ "]"
  
  def member : Parser[(String, Any)] = 
    stringLiteral ~ ":" ~ value ^^
      { case name ~ ":" ~value => (name, value) }
  
   def value : Parser[Any] = (
      obj 
    | arr
    | stringLiteral
    | floatingPointNumber ^^ (_.toDouble)
    | "null" ^^ (x => null)
    | "true" ^^ (x => true)
    | "fale" ^^ (x => false)
    )
   
}

import java.io.FileReader

object JSON1Test extends JSON1 {
  def main(args : Array[String]) { 
    val reader = new FileReader(args(0))
    println(parseAll(value, reader))
  }
}
now, if we run it, we have: 

[14.2] parsed: Map("address book" -> Map("name" -> "John Smith", "address" -> Ma
p("street" -> "10 Market Street", "city" -> "San Francisco, CA", "zip" -> 94111.
0), "phone numbers" -> List("408 338-4238", "408 111-6892")))
Much easier to read, isn't it?

As a side note, since there is no need to explicitly insert some semi-colon, so you write 

def value : Parser[Any] = 
  obj | 
  arr |
  stringLiteral |
  ...
or if you write as follow.  
  obj; // semicolon implicitly inserted
| arr
you can put the whole expression in a parenthesis avoids the semicolon and makes the code compile correctly. 

and the summary of the Parser combinators 

"..."            
literal
"...".r          
regular expressions 
P~Q  
sequential composition
P<~Q, P~>Q  sequential composition; keep left/right only 
P | Q 
alternative
opt(P)    
option
rep(P)
repetition
repsep(P, Q) 
interleaved repetition
p ^^ f 
result conversion

Why we hvae those symbolic operators, what if we have alphabetic ones, it has too much visual estate. and the symbolic operators are specially chosen so that it has decreasing order of precedence. 

suppose that we have the following alphabetic operators. 

// suppose for a moment that we have only those alphabetic ones instead of the symbolic ones. 

class ArithHypothetical extends JavaTokenParsers { 
  def expr : Parser[Any] = 
    term andThen rep(("+" andThen term) orElse 
         ("-" andThen term))
  def term: Parser[Any] = 
       factor andThen rep(("*" andThen factor) orElse 
         ("/" andThen factor))
  def factor : Parser[Any] = floatingPointNumber orElse ("(" andThen expr andThen ")")
}

Implementing combinator parsers

The previous sections have shown that Scala's  combinator provides a convenient means for constructing your own parsers.

the core of Scala's combinator parsing framework is contained in the trait scala.util.parsing.combinator.Parsers

package scala.util.parsing.combinator

trait Parser { 
  ... // code goes here unless otherwise stated.
}

// a parser is in essence just a function from some input type to a parser result 
type Parser[T] = Input = ParseResult[T]

Parser input

// reader is from scala.util.parsing.input, it is similar to Stream, but it also keeps track of the positions of all the elements it reads.
type Input = Reader[Elem]  
e.g of Elem, for RegexParsers, the Elem fixed to Char, but it would also be possible to set Elem to some other type, such as the type of tokens returned from a separate lexer. 

Parser Results

sealed abstract class ParseResult[+T] 
case class Success[T] (result : T, in :Input) extends ParseResult[T]
case class Failure (msg: String, in :Input) extends ParseResult[Nothing]

T stands for the type of results that returned by a success match, while the second parameter, in, is used for chaining. 
for failure, the None denote nothing result returned, but the second in is used for not chaining, but positioning.  

The parser class

abstract class Parser[+T] extends (Input => ParserResult[T]) {

   p => 
   // an unspecified method that denote 
   // the behavior of this parser
   def appply(in : INput) : ParserResult[T]
   def ~ ...
   def ! ...
}
because parser are (i.e. inherit from) functions, they need to define an apply method, you can see an abstract apply method in class Parser, but this is just for documentation, as the same method is in 
any case inherited from the parent type Input => ParserResult[T].  (Input => ParserResult[T]) is an abbreviation for scala.Function1[Input, ParserResult[T]).

Alias this

abstract class Parser[+T] extends (Input => ParserResult[T]) { p => 
a clause such as "id => " immediately after the opening braces of a class template define the identifier id as an alias for this in the class. it as if you had written : 
val id = this 
use of this alias "is":
class Outer { outer => 
  class Inner {
     println(Outer.this eq outer) // prints : true 

Single-token parsers

def elem(kind : String, p : Elem => Boolean) =
  new Parser[Elem] {
     def apply(in : Input) = 
       if (p(in.first)) Succes(in.first, in.rest)
       else Failure(Kind + " expected ", in)
}

Sequential composition

// as we can know that ~ is case class, and with element of type T and U 
abstract class Parser[+T] extends (Input => ParserResult[T]) { p =>
   def ~[U] (q : => Parser[U])  = new Parser[T ~U] { 
     case Success(x, in1) => 
       q(in1) match { 
         case Success(y, in2) => Success(new ~(x, y), in2)
         case failure => failure 
       }
     case failure => failure
   }
}
the other two sequential composition operator, <~ and ~>, coudl be defined just like ~, only with some small adjustment in how the result is computed, a more elegant technique, though is to define them in terms of ~ as follow.
def <~ [U](q : => Parser[U]) : Parser [T] = (p ~q) ^^ { case x ~ y => x }
def ~> [U](q : => Parser[U]) : Parser [U] = (p ~q) ^^ { case x ~ y => y } 

alternative compositions 

def | (q : => Input) => new Parser[T] {
  def apply(in : Input) = p(in) match { 
    case s1 @ Success(_, _) => s1
    case failure => q(in)
  }
}
if P and Q both fail, then the failure message is determined by Q, this subtle choice is discussed later. 

Dealing with recursion

Note that the q parameter is method ~ and | is by-name - its type is precede by =>, this means that the actual parser argumetn will be evaluated only when q is needed.  which should only be the case after p has run. this make sit possible to write recursive parser like the following. 

def parens = floatingPointNumber | "(" ~ parens ~ ")"
if | and ~ took by-value parameters, this definitions would immediately cause a stack overflow without reading anything, because the value of parens occurs in the middle of the right-hand side.

Result conversion

def ^^ [U](f : T => U) : Parser[U] = new Parser[U] {
  def apply(in : Input) = p(in) match { 
    case Success(x, in1) => Success(f(x), in1)
    case failure => failure
  }
} // end Parser

Parsers that don't read any input

success and failure does not consume any input. 

def success[T] (v : T) = new Parser[T] { 
   def apply(in :Input) = Success(v, in)
}

def failure(msg : String) = new Parser[Nothing] {
  def apply(in : Input) = Failure(msg, in)
}

Option and Repetition 

def opt[T](p : => Parser[T]): Parser[Option[T]] = (
  p ^^ Some(_)
| success(None)
)

def rep[T](p : => Parser[T]): Parser[List[T]] = (
  p ~ rep(p) ^^ { case x~xs => x :: xs } 
| success(List())
)

def repsep[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] = (
  p ~ rep(q ~> p) ^^ { case r~rs => r :: rs } 
| success(List())
) // end Parsers

String Literals and regular expressions

the definition of the RegexParsers is 

trait RegexParsers extends Parsers { 
while it works for Elem of Char
type Elem = Char 
it defines two methods. 
implicit def literal(s : String) : Parser[String] = ...
implicit def regex(r : Regex) : Parser[String] = ...
Because of the implicit modifier, so they are automatically applied, this is why you can directly write string lieral or regex in your grammer, because that parser "(" ~ expr ~ ")" will automatically expanded to literal("(") ~ expr ~ literal (")")
RegexParsers trait also takes care of handling white space, it uses the regular expression 
protected val whiteSpace = """\s+""".r

but you can choose to override it

// you can choose to override it 
object MyParsers extends RegexParsers { 
	override val whiteSpace = "".r
 ... 
 
}

lexing and Parsing 

the task of syntax analysis is often split into two phase, 
the lexer phaser: recognize individual words into the input and classifies them into some token classes. the phase is also called lexical analysis. 
the syntactical analysis:  sometime is called parsing, 
The parsers described in the previous section can be used for either phase, because its input element are of the abstract type Elem.

Scala's parsing combinations provides several utilities classes for lexical and syntactic analysis. they are comtained in two sub-packages, one for each kind of analysis 

scala.util.parsing.combinator.lexical
scala.util.parsing.combinator.syntactical

Error reporting 

Scala's parsing library implements a simple heuristic: among all failures, the one that occurred at the latest position in the input is chosen.

supose for the json examples:

{ "name" : John

you will have the following 

[1.13] failure : "false" expected but identifier John found 
{ "name" : John
           ^
a better error message message can be engineered by adding a "catch-all" failure point as the last alternative of a value production
def value : Parser[Any] = 
  obj | arr | stringLit | floatingPointNumber | "null" | "true" | "false" | failure("illegal start of value")
and you might now get this error message
[1.13] failure : illegal start of value
{ "name" : John
           ^
how is that happening, combinator keeps a lastFailure variable. 
var lastFailure : Option[Failure] = None 

the field is initialized to None, it is updated in the constructor of the failure class:


case class Failure(msg  : String, in : input) { 
   if (lastFailure.isDefined && lastFailure.get.in.pos <= in.pos) 
      lastFailure = Some(this)
}
and it is used by the phrase method, which emits the final error message if the parser failed, Here is the implementation of phrase in trait Parsers



def phrase[T] (p : Parser[T]) = new Parser[T] { 
   lastFailure = None
   def apply(in : Input) = p(in) match { 
      case s @ Successs(out, in1) => 
        if (in1.atEnd) s
        else Failure("end of inpput expected", in1)
      case f : Failure => 
        lastfailure 
   }
}
the lastFailure is updated as a side effect by the constructor of Failure and by the phrase method itself. 

Backtracking vesus LL(1)

The mainstream compiler do not use backtracking, why? 

However, backtracking has imposed restriction 

   1.  avoid left-recursive productions 

e.g. 


expr ::= expr + "+" term | term

progress never further and it is potentially costly because the same input can be parsed serveral times. 

   2. Maybe too costy

expr ::= term  "+" expr | term

for input "(1 + 2) * 3", because first try will fail and second try succeed, we waste one try here 


So, it is common to modify the gramer, so that backtracking can be avoided. e..g , either one of the following works.



expr ::= term ["+" expr]

expr ::= term {"+" expr}
may admit so-called LL(1) grammer , when a grammer formed this way, it will never backtrack.

To express the expectation that a grammer is LL(1), using a new operator ~!, this operator is like a sequential composition ~ but it will never backtrack to "un-read" input element 
that have already been parsed. using this, we can reformat the arithmetic expression alternatively as follow 
def expr : Parser[Any] = 
	term ~! rep("+" ~! term | "-" ~! term) 
def term : Parser[Any] = 
	term ~! rep("*" ~! factor | "/" ~! factor)
def factor : Parser[Any] = 
	"(" ~! expr ~! ")" | floatingPointNumber 

Conclusion

downside of Combinator in Scala: not very effecient 

reason: 
  1. comparing to tools generator ones, e.g. Yacc, Bison, , backtracking method not effecient 
  2. they mix parser construction and input alaysis in the same set of operation, in effect, parser is generated anew for each input that is parsed. 
Mordern ones uses Tre pareser, advantages
  1. you factor out parser construction from input analysis, you can construct parser once and parse all inputs 
  2. the parser generation can use more efficient algorithm such as LALR(1).. 






你可能感兴趣的:(scala)