Occasionally, you may need to process a small, special-purpose language.
Essentially, you have only a few choices, One choice is to roll your own parser (and lexical analyzer). If you are not an expert, this is hard, if you are an expert, it is still time consuming
An alternative choice is to use a parser generator, there exist quite a few of these generators. some of the better known are Yacc and Bison for parser written in C and ANTLR for parer written in Java.
You will probably also need a scanner generator such as Lex, Flex, or JFlexto go with it. However, you need to learn new tools, including their sometimes obscure - error messages.
This chapter presents a third alternative. Instead of using the standalong domain specific language of a parser generator, you will learn an internal domain specific language, or internal DSL for short. The internal DSL will consist of a library of parser combinator - functions and operatos defined in Scala that will server as building block for parsers. These building block will map one to one to the constructions of a "context-free" grammer, to make it easy to understand
We will show by an example
suppose that the grammer for an arithmetic expression is as follow.
/* expr ::= term { "+" term | "-" term }. term ::= factor { "*" factor | "/" factor }. factor ::= floatingPointNumber | "(" expr ")" | : denote alternative production {...}: denotes repetition (zero or more times) [...]: means an optional occurrence */since we have the grammer, we can convert that to the combinator.
import scala.util.parsing.combinator._ class Arith extends JavaTokenParsers { def expr: Parser[Any] = term ~ rep ("+" ~ term | "-" ~ term) def term: Parser[Any] = factor ~ rep("*" ~ factor | "/" ~ factor) def factor: Parser[Any] = floatingPointNumber | "(" ~ expr ~ ")" // what is the definition of the floatingPointNumber, is that an internal one? // actually, it is defined in trait JavaTokenParser, the definition being: def floatingPointNumber: Parser[String] = """-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r }and we have an general rule converting our grammer to combinator expressions.
Nowe we have the grammer and we have formalize that into combinator syntax.
// running your parser import scala.util.parsing.combinator._ object ParseExpr extends Arith { def main(args : Array[String]) { println("input : " + args(0)) println(parseAll(expr, args(0))) } }the key here is parseAll(expr, input)
to run the main method.
ParseExpr.main(Array[String]("2 * ( 3 * 7)")) // // input : 2 * (3 + 7) // [1.12] parsed: ((2~List((*~(((~((3~List())~List((+~(7~List())))))~)))))~List()) ParseExpr.main(Array[String]("2 * (3 + 7))")) // deliberately something wrong to the syntax // //[1.12] failure: string matching regex `\z' expected but `)' found // //2 * (3 + 7)) // ^it is more like a hack, but you can also run that with Scala ParseExpr..
we have seen from above combinator, which has floatingPointNumber, this is inherited from Arith's super trait, which is called JavaTokenParsers. That uses a regular expresssion parser. The idea is that y ou can use any regular expression as a parser, the regular expression parses all strings it can match. the result is the parsed string.
import scala.util.parsing.combinator._ object MyParsers extends RegexParsers { val ident : Parser[String] = """[a-zA-Z_]\w*""".r // to test def main(args : Array[String] ) { println("input " + args(0)) println(parseAll(ident, args(0))) } } MyParsers.main(Array[String]("joe")) MyParsers.main(Array[String]("-joe")) // fail MyParsers.main(Array[String]("joe.wang")) // fail MyParsers.main(Array[String]("123")) // fail
/* value ::= obj | arr | stringLiteral | floatingPointerNumber | "nul" | "true" | "false" obj ::= "{" [members] "}" arr ::= "{" [values] "}" members ::= member {"," member} member ::= stringLiteral ":" value values ::= value {",", value} */and a well-formed, valid json file could be like this:
{ "address book": { "name": "John Smith", "address" : { "street": "10 Market Street", "city" : "San Francisco, CA", "zip" : 94111 }, "phone numbers": [ "408 338-4238", "408 111-6892" ] } }with the grammer in hand, we can transform that to the combinator syntax.
import java.io.FileReader import scala.util.parsing.combinator._ class JSON extends JavaTokenParsers { def value : Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false" def obj : Parser[Any] = "{" ~ repsep(member, ",") ~ "}" def arr : Parser[Any] = "["~ repsep(value, ",") ~ "]" def member : Parser[Any] = stringLiteral ~ ":" ~ value }and its to parse drive method is as follow.
object ParseJSON extends JSON { def main(args : Array[String]) { val reader = new FileReader(args(0)) println(parseAll(value, reader)) } }to run the drive method.
// scala ParseJSON address-book.json ParseJSON.main(Array[String]("address-book.json")) // you might need to given the right path for the *.json file
If you run the command, you might get results such as
[14.2] parsed: (({~List((("address book"~:)~(({~List((("name"~:)~"John Smith"), (("address"~:)~(({~List((("street"~:)~"10 Market Street"), (("city"~:)~"San Fran cisco, CA"), (("zip"~:)~94111)))~})), (("phone numbers"~:)~(([~List("408 338-423 8", "408 111-6892"))~]))))~}))))~})
the output of the previous JSON program is that it does not have intuitive meanings. It seems to be a sequence composed of bits and pieces of the input glued together with lists and ~ combinations.
and check on the code, each and every combinator factor returns just a Parser[Any], it does not reflect the business model, the JSON object a more likely result type should be Map[String, Any].. and JSON array shuld be List[Any]., values "true", "false", should just be true, false.
there is an operator called "^^" which can help you do the result transformation.
Also, from the output, something like ~(~("{", ms, "}"), this is illegible. why we keep the "{" or "}" in the result? the combinator has introduced the fllowign operator, which allow you to just keep the left or right match . they are ~> and <~
now, equipped with the new operator, we have this:
import scala.util.parsing.combinator._ class JSON1 extends JavaTokenParsers { def obj : Parser[Map[String, Any]] = "{" ~> repsep (member, ",") <~ "}" ^^ (Map() ++ _) def arr : Parser[List[Any]] = "[" ~> repsep (value, ",") <~ "]" def member : Parser[(String, Any)] = stringLiteral ~ ":" ~ value ^^ { case name ~ ":" ~value => (name, value) } def value : Parser[Any] = ( obj | arr | stringLiteral | floatingPointNumber ^^ (_.toDouble) | "null" ^^ (x => null) | "true" ^^ (x => true) | "fale" ^^ (x => false) ) } import java.io.FileReader object JSON1Test extends JSON1 { def main(args : Array[String]) { val reader = new FileReader(args(0)) println(parseAll(value, reader)) } }now, if we run it, we have:
[14.2] parsed: Map("address book" -> Map("name" -> "John Smith", "address" -> Ma p("street" -> "10 Market Street", "city" -> "San Francisco, CA", "zip" -> 94111. 0), "phone numbers" -> List("408 338-4238", "408 111-6892")))Much easier to read, isn't it?
As a side note, since there is no need to explicitly insert some semi-colon, so you write
def value : Parser[Any] = obj | arr | stringLiteral | ...or if you write as follow.
obj; // semicolon implicitly inserted | arryou can put the whole expression in a parenthesis avoids the semicolon and makes the code compile correctly.
and the summary of the Parser combinators
"..." |
literal |
"...".r |
regular expressions |
P~Q |
sequential composition |
P<~Q, P~>Q | sequential composition; keep left/right only |
P | Q |
alternative |
opt(P) |
option |
rep(P) |
repetition |
repsep(P, Q) |
interleaved repetition |
p ^^ f |
result conversion |
Why we hvae those symbolic operators, what if we have alphabetic ones, it has too much visual estate. and the symbolic operators are specially chosen so that it has decreasing order of precedence.
suppose that we have the following alphabetic operators.
// suppose for a moment that we have only those alphabetic ones instead of the symbolic ones. class ArithHypothetical extends JavaTokenParsers { def expr : Parser[Any] = term andThen rep(("+" andThen term) orElse ("-" andThen term)) def term: Parser[Any] = factor andThen rep(("*" andThen factor) orElse ("/" andThen factor)) def factor : Parser[Any] = floatingPointNumber orElse ("(" andThen expr andThen ")") }
The previous sections have shown that Scala's combinator provides a convenient means for constructing your own parsers.
the core of Scala's combinator parsing framework is contained in the trait scala.util.parsing.combinator.Parsers
package scala.util.parsing.combinator trait Parser { ... // code goes here unless otherwise stated. } // a parser is in essence just a function from some input type to a parser result type Parser[T] = Input = ParseResult[T]
// reader is from scala.util.parsing.input, it is similar to Stream, but it also keeps track of the positions of all the elements it reads. type Input = Reader[Elem]e.g of Elem, for RegexParsers, the Elem fixed to Char, but it would also be possible to set Elem to some other type, such as the type of tokens returned from a separate lexer.
sealed abstract class ParseResult[+T] case class Success[T] (result : T, in :Input) extends ParseResult[T] case class Failure (msg: String, in :Input) extends ParseResult[Nothing]
T stands for the type of results that returned by a success match, while the second parameter, in, is used for chaining.
for failure, the None denote nothing result returned, but the second in is used for not chaining, but positioning.
abstract class Parser[+T] extends (Input => ParserResult[T]) { p => // an unspecified method that denote // the behavior of this parser def appply(in : INput) : ParserResult[T] def ~ ... def ! ... }because parser are (i.e. inherit from) functions, they need to define an apply method, you can see an abstract apply method in class Parser, but this is just for documentation, as the same method is in
abstract class Parser[+T] extends (Input => ParserResult[T]) { p =>a clause such as "id => " immediately after the opening braces of a class template define the identifier id as an alias for this in the class. it as if you had written :
class Outer { outer => class Inner { println(Outer.this eq outer) // prints : true
def elem(kind : String, p : Elem => Boolean) = new Parser[Elem] { def apply(in : Input) = if (p(in.first)) Succes(in.first, in.rest) else Failure(Kind + " expected ", in) }
// as we can know that ~ is case class, and with element of type T and U abstract class Parser[+T] extends (Input => ParserResult[T]) { p => def ~[U] (q : => Parser[U]) = new Parser[T ~U] { case Success(x, in1) => q(in1) match { case Success(y, in2) => Success(new ~(x, y), in2) case failure => failure } case failure => failure } }the other two sequential composition operator, <~ and ~>, coudl be defined just like ~, only with some small adjustment in how the result is computed, a more elegant technique, though is to define them in terms of ~ as follow.
def <~ [U](q : => Parser[U]) : Parser [T] = (p ~q) ^^ { case x ~ y => x } def ~> [U](q : => Parser[U]) : Parser [U] = (p ~q) ^^ { case x ~ y => y }
def | (q : => Input) => new Parser[T] { def apply(in : Input) = p(in) match { case s1 @ Success(_, _) => s1 case failure => q(in) } }if P and Q both fail, then the failure message is determined by Q, this subtle choice is discussed later.
Note that the q parameter is method ~ and | is by-name - its type is precede by =>, this means that the actual parser argumetn will be evaluated only when q is needed. which should only be the case after p has run. this make sit possible to write recursive parser like the following.
def parens = floatingPointNumber | "(" ~ parens ~ ")"if | and ~ took by-value parameters, this definitions would immediately cause a stack overflow without reading anything, because the value of parens occurs in the middle of the right-hand side.
def ^^ [U](f : T => U) : Parser[U] = new Parser[U] { def apply(in : Input) = p(in) match { case Success(x, in1) => Success(f(x), in1) case failure => failure } } // end Parser
success and failure does not consume any input.
def success[T] (v : T) = new Parser[T] { def apply(in :Input) = Success(v, in) } def failure(msg : String) = new Parser[Nothing] { def apply(in : Input) = Failure(msg, in) }
def opt[T](p : => Parser[T]): Parser[Option[T]] = ( p ^^ Some(_) | success(None) ) def rep[T](p : => Parser[T]): Parser[List[T]] = ( p ~ rep(p) ^^ { case x~xs => x :: xs } | success(List()) ) def repsep[T](p : => Parser[T], q : => Parser[Any]): Parser[List[T]] = ( p ~ rep(q ~> p) ^^ { case r~rs => r :: rs } | success(List()) ) // end Parsers
the definition of the RegexParsers is
trait RegexParsers extends Parsers {while it works for Elem of Char
type Elem = Charit defines two methods.
implicit def literal(s : String) : Parser[String] = ... implicit def regex(r : Regex) : Parser[String] = ...Because of the implicit modifier, so they are automatically applied, this is why you can directly write string lieral or regex in your grammer, because that parser "(" ~ expr ~ ")" will automatically expanded to literal("(") ~ expr ~ literal (")")
protected val whiteSpace = """\s+""".r
but you can choose to override it
// you can choose to override it object MyParsers extends RegexParsers { override val whiteSpace = "".r ... }
the task of syntax analysis is often split into two phase,
the lexer phaser: recognize individual words into the input and classifies them into some token classes. the phase is also called lexical analysis.
the syntactical analysis: sometime is called parsing,
The parsers described in the previous section can be used for either phase, because its input element are of the abstract type Elem.
Scala's parsing combinations provides several utilities classes for lexical and syntactic analysis. they are comtained in two sub-packages, one for each kind of analysis
scala.util.parsing.combinator.lexical scala.util.parsing.combinator.syntactical
Scala's parsing library implements a simple heuristic: among all failures, the one that occurred at the latest position in the input is chosen.
supose for the json examples:
{ "name" : John
you will have the following
[1.13] failure : "false" expected but identifier John found { "name" : John ^a better error message message can be engineered by adding a "catch-all" failure point as the last alternative of a value production
def value : Parser[Any] = obj | arr | stringLit | floatingPointNumber | "null" | "true" | "false" | failure("illegal start of value")and you might now get this error message
[1.13] failure : illegal start of value { "name" : John ^how is that happening, combinator keeps a lastFailure variable.
var lastFailure : Option[Failure] = None
the field is initialized to None, it is updated in the constructor of the failure class:
case class Failure(msg : String, in : input) { if (lastFailure.isDefined && lastFailure.get.in.pos <= in.pos) lastFailure = Some(this) }and it is used by the phrase method, which emits the final error message if the parser failed, Here is the implementation of phrase in trait Parsers
def phrase[T] (p : Parser[T]) = new Parser[T] { lastFailure = None def apply(in : Input) = p(in) match { case s @ Successs(out, in1) => if (in1.atEnd) s else Failure("end of inpput expected", in1) case f : Failure => lastfailure } }the lastFailure is updated as a side effect by the constructor of Failure and by the phrase method itself.
The mainstream compiler do not use backtracking, why?
However, backtracking has imposed restriction
e.g.
expr ::= expr + "+" term | term
progress never further and it is potentially costly because the same input can be parsed serveral times.
2. Maybe too costy
expr ::= term "+" expr | term
So, it is common to modify the gramer, so that backtracking can be avoided. e..g , either one of the following works.
expr ::= term ["+" expr] expr ::= term {"+" expr}may admit so-called LL(1) grammer , when a grammer formed this way, it will never backtrack.
def expr : Parser[Any] = term ~! rep("+" ~! term | "-" ~! term) def term : Parser[Any] = term ~! rep("*" ~! factor | "/" ~! factor) def factor : Parser[Any] = "(" ~! expr ~! ")" | floatingPointNumber
downside of Combinator in Scala: not very effecient