People sometimes ask if there's a C++ library function that given a string like "2+(3*4)" will return its value as a number. There's isn't such a function, but writing one is possible. This document illustrates how it can be done in about 50 lines of code using flex and bison. These programs generate source code to parse text that's structured (a computer program, for example, or a configuration file, or a maths expression). They're not trivial to use, but they're less error-prone than hand-crafted alternatives. The minimum of details will beexplained here - with luck you should be able to adapt the programs in smallways for your own purposes.
Before introducing the example, I'll briefly deal with parsing in general, and with"regular expressions".
If you've ever written a program that reads text lines that are in particular formats, then you've written a sort of parser. Compilers include more sophisticated parsers. In general the first stage in parsing is "lexical analysis" (or "scanning") - reading in a string of charactersand outputing a stream of strings and tokens. In this phase white spaceis usually deal with (as are comments), and keywords are replaced by symbols. lexor flex produce code for a function that does this, called yylex().
The second phase builds these elements (often known as lexemes) into expressions or phrases and"makes sense" of them. Yacc ("yet another compiler-compiler") and bison (so named because it isn't a Yak) produce code for a function called yyparse() that does this.
Both phases need to detect errors. That's not so hard, but it's difficultto produce error reports that are useful to humans, which is where hand-crafted parsers sometimes have an edge over auto-generated ones.
Sophisticated parsers come in several types. You don't need to know about the technical details, but a little of the vocabulary is usefulto know about, in case things go wrong. When a grammar is described as LR(1) thefirst L means that the input is read from Left to right - i.e, from the start of the file (or line) to the end. The R means that the parser produces right-most derivations. The alternative to this is L (left-most derivations)
The number in brackets after the LR indicates how far the parser needs to look aheadbefore being able to understand the token it's on. bison can onlyunderstand grammars where looking one token ahead removes ambiguities.
bison essentially produces LR(1) parsers. Not all grammarscan be dealt with by such parsers ("C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities" - Meta-Compilation for C++)
Regular Expressions are used with flex.You can think of Regular Expressions as a more sophisticated formof wildcard characters.On the Unix command line, ? matches a single character,and * matches 0 or more characters, so if you wanted to list allthe files whose names have an x as the 4th letter, you couldtype ls ???x*. Regular Expressions use a different notation andcan express more complex patterns. Here's a list of the features used on this page
Combining these concepts leads to Regular Expressions like the following
[0-9]+("."[0-9]+)?
This matches one or digits followed (optionally, hence thefinal ?) by a "." and one or digits - i.e. it matches positive real numbers.
The aim of this example is to write a calculator that willunderstand +, -, *, /, and ^ (power)and give them the correct precedence, understanding brackets. Numbers can beintegers, or reals, and "scientific notation" can be used. Similar code to that shown here is on many web pages.
header
tokens
%%
rules
%%
user-defined functions
bison file formatWe'll deal with the 2nd stage first. Put the following into a specification file called calc.y
%{ #include <math.h> #include <stdio.h> #include <stdlib.h> #define YYSTYPE double %} %token NUMBER %token PLUS MINUS TIMES DIVIDE POWER %token LEFT RIGHT %token END %left PLUS MINUS %left TIMES DIVIDE %left NEG %right POWER %start Input %% Input: | Input Line ; Line: END | Expression END { printf("Result: %f\n", $1); } ; Expression: NUMBER { $$=$1; } | Expression PLUS Expression { $$=$1+$3; } | Expression MINUS Expression { $$=$1-$3; } | Expression TIMES Expression { $$=$1*$3; } | Expression DIVIDE Expression { $$=$1/$3; } | MINUS Expression %prec NEG { $$=-$2; } | Expression POWER Expression { $$=pow($1,$3); } | LEFT Expression RIGHT { $$=$2; } ; %% int yyerror(char *s) { printf("%s\n", s); } int main() { if (yyparse()) fprintf(stderr, "Successful parsing.\n"); else fprintf(stderr, "error found.\n"); }
When the specification file is processed using
bison -d calc.y
it produces calc.tab.h and calc.tab.c. The '-d' makes bisonproduce the header file - flex needs the information (a shared list of tokens) in this.
header
regular expressions
%%
rules
%%
user-defined functions
flex file formatflex is what reads the input and processes it ready for bisonto use. flex and bison need to agree on some token definitions - these are generated for lex in calc.tab.h.Below is the definition file for flex. Put the code into a file called calc.lex
%{ #define YYSTYPE double #include "calc.tab.h" #include <stdlib.h> %} white [ \t]+ digit [0-9] integer {digit}+ exponent [eE][+-]?{integer} real {integer}("."{integer})?{exponent}? %% {white} { } {real} { yylval=atof(yytext); return NUMBER; } "+" return PLUS; "-" return MINUS; "*" return TIMES; "/" return DIVIDE; "^" return POWER; "(" return LEFT; ")" return RIGHT; "\n" return END;
Do
flex -o calc.lex.c calc.lex
to produce calc.lex.c.At the end of this you should have the C source code for a simple calculator. The code isn't easy to read - it's not meant to be. Tocompile it do
gcc -o calc calc.lex.c calc.tab.c -lfl -lm
Then type
./calc
and type in mathematical expressions. E.g. typing 4e2+(2^0.5*3/2)should give you 402.121320. Then try typing in 7+8) and see if you geta useful error message. Press Ctrl and C to quit.
Makefiles are often used in Unix to save time when compiling source code, andthey're useful here, because producing a final calculator program requires quite a few stages. If you put thefollowinginto a file called Makefile
calc: calc.y calc.lex bison -d calc.y flex -o calc.lex.c calc.lex gcc -o calc calc.lex.c calc.tab.c -lfl -lm
(the indented lined must use tabs - spaces aren't good enough) thentyping make will save you time whenever you want a new version of the program.
There are many ways to extend the example
extern char* yytext; int yyerror(char *s) { printf("%s - %s\n", s, yytext); }
extern char* yytext; extern int yylineno; int yyerror(char *s) { printf("%s on line %d - %s\n", s, yylineno, yytext); }and put
%option yylinenoin calc.lex (in the section where white etc are defined)you should see the line numbers.
./calc < filenameif you have a text file called filename with a few lines of arithmetic expressions in it,but neater would be the ability to do
./calc filenameyylex() reads input from yyin so if you set this using
extern FILE* yyin; yyin = fopen("filename", "r");in main, it should work, but you can make it friendlier and more flexible.
flex and bison are the successors of lex and yacc, with which they're mostly back-compatible. They're available onUnix/Linux machines, often being installed by default. On Macs they'rebundled in with Xcode. flex++ and bison++ exist for C++ users - though they can use flex and bison as above. There are C++ options in flex and bison but I've had trouble getting them to work.