BNF, Grammars, the AE Language
Getting back to the theme of the course: we want to investigate programming languages, and we want to do that using a programming language.
The first thing when we design a language is to specify the language. For this we use BNF (Backus-Naur Form). For example, here is the definition of a simple arithmetic language:
| <AE> + <AE>
| <AE> - <AE>
Explain the different parts. Specifically, this is a mixture of low-level (concrete) syntax definition with parsing.
We use this to derive expressions in some language. We start with
<AE>
, which should be one of these:
- a number
<num>
- an
<AE>
, the text “+
”, and another<AE>
- the same but with “
-
”
<num>
is a terminal: when we reach it in the derivation, we’re done.
<AE>
is a non-terminal: when we reach it, we have to continue with one
of the options. It should be clear that the +
and the -
are things
we expect to find in the input — because they are not wrapped in
<>
s.
We could specify what <num>
is (turning it into a <NUM>
non-terminal):
| <AE> + <AE>
| <AE> - <AE>
<NUM> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
| <NUM> <NUM>
But we don’t — why? Because in Racket we have numbers as primitives and we want to use Racket to implement our languages. This makes life a lot easier, and we get free stuff like floats, rationals etc.
To use a BNF formally, for example, to prove that 1-2+3
is a valid
<AE>
expression, we first label the rules:
| <AE> + <AE> (2)
| <AE> - <AE> (3)
and then we can use them as formal justifications for each derivation step:
<AE> + <AE> ; (2)
<AE> + <num> ; (1)
<AE> - <AE> + <num> ; (3)
<AE> - <AE> + 3 ; (num)
<num> - <AE> + 3 ; (1)
<num> - <num> + 3 ; (1)
1 - <num> + 3 ; (num)
1 - 2 + 3 ; (num)
This would be one way of doing this. Alternatively, we can can visualize the derivation using a tree, with the rules used at the nodes.
These specifications suffer from being ambiguous: an expression can be
derived in multiple ways. Even the little syntax for a number is
ambiguous — a number like 123
can be derived in two ways that result
in trees that look different. This ambiguity is not a “real” problem
now, but it will become one very soon. We want to get rid of this
ambiguity, so that there is a single (= deterministic) way to derive all
expressions.
There is a standard way to resolve that — we add another non-terminal to the definition, and make it so that each rule can continue to exactly one of its alternatives. For example, this is what we can do with numbers:
<DIGIT> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Similar solutions can be applied to the <AE>
BNF — we either
restrict the way derivations can happen or we come up with new
non-terminals to force a deterministic derivation trees.
As an example of restricting derivations, we look at the current grammar:
| <AE> + <AE>
| <AE> - <AE>
and instead of allowing an <AE>
on both sides of the operation, we
force one to be a number:
| <num> + <AE>
| <num> - <AE>
Now there is a single way to derive any expression, and it is always
associating operations to the right: an expression like 1+2+3
can only
be derived as 1+(2+3)
. To change this to left-association, we would
use this:
| <AE> + <num>
| <AE> - <num>
But what if we want to force precedence? Say that our AE syntax has addition and multiplication:
| <AE> + <AE>
| <AE> * <AE>
We can do that same thing as above and add new non-terminals — say one for “products”:
| <AE> + <AE>
| <PROD>
<PROD> ::= <num>
| <PROD> * <PROD>
Now we must parse any AE expression as additions of multiplications (or
numbers). First, note that if <AE>
goes to <PROD>
and that goes to
<num>
, then there is no need for an <AE>
to go to a <num>
, so this
is the same syntax:
| <PROD>
<PROD> ::= <num>
| <PROD> * <PROD>
Now, if we want to still be able to multiply additions, we can force them to appear in parentheses:
| <PROD>
<PROD> ::= <num>
| <PROD> * <PROD>
| ( <AE> )
Next, note that <AE>
is still ambiguous about additions, which can be
fixed by forcing the left hand side of an addition to be a factor:
| <PROD>
<PROD> ::= <num>
| <PROD> * <PROD>
| ( <AE> )
We still have an ambiguity for multiplications, so we do the same thing and add another non-terminal for “atoms”:
| <PROD>
<PROD> ::= <ATOM> * <PROD>
| <ATOM>
<ATOM> ::= <num>
| ( <AE> )
And you can try to derive several expressions to be convinced that derivation is always deterministic now.
But as you can see, this is exactly the cosmetics that we want to avoid — it will lead us to things that might be interesting, but unrelated to the principles behind programming languages. It will also become much much worse when we have a real language rather such a tiny one.
Is there a good solution? — It is right in our face: do what Racket does — always use fully parenthesized expressions:
| ( <AE> + <AE> )
| ( <AE> - <AE> )
To prevent confusing Racket code with code in our language(s), we also change the parentheses to curly ones:
| { <AE> + <AE> }
| { <AE> - <AE> }
But in Racket everything has a value — including those +
s and
-
s, which makes this extremely convenient with future operations that
might have either more or less arguments than 2 as well as treating
these arithmetic operators as plain functions. In our toy language we
will not do this initially (that is, +
and -
are second order
operators: they cannot be used as values). But since we will get to it
later, we’ll adopt the Racket solution and use a fully-parenthesized
prefix notation:
| { + <AE> <AE> }
| { - <AE> <AE> }
(Remember that in a sense, Racket code is written in a form of already-parsed syntax…)
Simple Parsing
On to an implementation of a “parser”:
Unrelated to what the syntax actually looks like, we want to parse it as soon as possible — converting the concrete syntax to an abstract syntax tree.
No matter how we write our syntax:
3+4
(infix),3 4 +
(postfix),+(3,4)
(prefix with args in parens),(+ 3 4)
(parenthesized prefix),
we always mean the same abstract thing — adding the number 3
and the
number 4
. The essence of this is basically a tree structure with an
addition operation as the root and two leaves holding the two numerals.
With the right data definition, we can describe this in Racket as the
expression (Add (Num 3) (Num 4))
where Add
and Num
are
constructors of a tree type for syntax, or in a C-like language, it
could be something like Add(Num(3),Num(4))
.
Similarly, the expression (3-4)+7
will be described in Racket as the
expression:
Important note: “expression” was used in two different ways in the above — each way corresponds to a different language, and the result of evaluating the second “expression” is a Racket value that represents the first expression.
To define the data type and the necessary constructors we will use this:
[Num Number]
[Add AE AE]
[Sub AE AE])
-
Note — Racket follows the tradition of Lisp which makes syntax issues almost negligible — the language we use is almost as if we are using the parse tree directly. Actually, it is a very simple syntax for parse trees, one that makes parsing extremely easy.
[This has an interesting historical reason… Some Lisp history — M-expressions vs. S-expressions, and the fact that we write code that is isomorphic to an AST. Later we will see some of the advantages that we get by doing this. See also “The Evolution of Lisp”, section 3.5.1. Especially the last sentence:
Therefore we expect future generations of Lisp programmers to continue to reinvent Algol-style syntax for Lisp, over and over and over again, and we are equally confident that they will continue, after an initial period of infatuation, to reject it. (Perhaps this process should be regarded as a rite of passage for Lisp hackers.)
And an interesting & modern counter-example of this here.]
To make things very simple, we will use the above fact through a double-level approach:
-
we first “parse” our language into an intermediate representation — a Racket list — this is mostly done by a modified version of Racket’s
read
function that uses curly{}
braces instead of round()
parens, -
then we write our own
parse
function that will parse the resulting list into an instance of theAE
type — an abstract syntax tree (AST).
This is achieved by the following simple recursive function:
;; parses s-expressions into AEs
(define (parse-sexpr sexpr)
(cond [(number? sexpr) (Num sexpr)]
[(and (list? sexpr) (= 3 (length sexpr)))
(let ([make-node
(match (first sexpr)
['+ Add]
['- Sub]
[else (error 'parse-sexpr "unknown op: ~s"
(first sexpr))])
#| the above is the same as:
(cond [(equal? '+ (first sexpr)) Add]
[(equal? '- (first sexpr)) Sub]
[else (error 'parse-sexpr "unknown op: ~s"
(first sexpr))])
|#])
(make-node (parse-sexpr (second sexpr))
(parse-sexpr (third sexpr))))]
[else (error 'parse-sexpr "bad syntax in ~s" sexpr)]))
This function is pretty simple, but as our languages grow, they will
become more verbose and more difficult to write. So, instead, we use a
new special form: match
, which is matching a value and binds new
identifiers to different parts (try it with “Check Syntax”). Re-writing
the above code using match
:
;; parses s-expressions into AEs
(define (parse-sexpr sexpr)
(match sexpr
[(number: n) (Num n)]
[(list '+ left right)
(Add (parse-sexpr left) (parse-sexpr right))]
[(list '- left right)
(Sub (parse-sexpr left) (parse-sexpr right))]
[else (error 'parse-sexpr "bad syntax in ~s" sexpr)]))
And finally, to make it more uniform, we will combine this with the function that parses a string into a sexpr so we can use strings to represent our programs:
;; parses a string containing an AE expression to an AE
(define (parse str)
(parse-sexpr (string->sexpr str)))
The match
Form
The syntax for match
is
[pattern result-expr]
...)
The value is matched against each pattern, possibly binding names in the process, and if a pattern matches it evaluates the result expression. The simplest form of a pattern is simply an identifier — it always matches and binds that identifier to the value:
[x x]) ; evaluates to the list
Another simple pattern is a quoted symbol, which matches that symbol. For example:
['x "yes"]
[else "no"])
will evaluate to "yes"
if foo
is the symbol x
, and to "no"
otherwise. Note that else
is not a keyword here — it happens to be a
pattern that always succeeds, so it behaves like an else clause except
that it binds else
to the unmatched-so-far value.
Many patterns look like function application — but don’t confuse them
with applications. A (list x y z)
pattern matches a list of exactly
three items and binds the three identifiers; or if the “arguments” are
themselves patterns, match
will descend into the values and match them
too. More specifically, this means that patterns can be nested:
[(list x y z) (+ x y z)]) ; evaluates to 6
(match (list 1 2 3)
[(cons x (list y z)) (+ x y z)]) ; matches the same shape (also 6)
(match '((1) (2) 3)
[(list (list x) (list y) z) (+ x y z)]) ; also 6
As seen above, there is also a cons
pattern that matches a non-empty
list and then matches the first part against the head for the list and
the second part against the tail of the list.
In a list
pattern, you can use ...
to specify that the previous
pattern is repeated zero or more times, and bound names get bound to the
list of respective matching. One simple consequent is that the (list hd tl ...)
pattern is exactly the same as (cons hd tl)
, but being able
to repeat an arbitrary pattern is very useful:
[(list (list x y) ...) (list x y)])
'((1 3 5 7) (2 4 6 8))
A few more useful patterns:
_ -- matches anything, but does not bind
(number: n) -- matches any number and binds it to `n'
(symbol: s) -- same for symbols
(string: s) -- strings
(sexpr: s) -- S-expressions (needed sometimes for Typed Racket)
(and pat1 pat2) -- matches both patterns
(or pat1 pat2) -- matches either pattern (careful with bindings)
Note that the foo:
patterns are all specific to our #lang pl
, they
are not part of #lang racket
or #lang typed/racket
.
The patterns are tried one by one in-order, and if no pattern matches the value, an error is raised.
Note that ...
in a list
pattern can follow any pattern, including
all of the above, and including nested list patterns.
Here are a few examples — you can try them out with #lang pl untyped
at the top of the definitions window. This:
[(list (symbol: syms) ...) syms])
matches x
against a pattern that accepts only a list of symbols, and
binds syms
to those symbols. If you want to match only a list of, say,
one or more symbols, then just add one before the ...
-ed pattern
variable:
[(list (symbol: sym) (symbol: syms) ...) syms])
;; same as:
(match x
[(cons (symbol: sym) (list (symbol: syms) ...)) syms])
which will match such a non-empty list, where the whole list (on the
right hand side) is (cons sym syms)
.
Here’s another example that matches a list of any number of lists, where
each of the sub-lists begins with a symbol and then has any number of
numbers. Note how the n
and s
bindings get values for a list of all
symbols and a list of lists of the numbers:
(match x
[(list (list (symbol: s) (number: n) ...) ...)
(list 'symbols: s 'numbers: n)]))
> (foo (list (list 'x 1 2 3) (list 'y 4 5)))
'(symbols: (x y) numbers: ((1 2 3) (4 5)))
Here is a quick example for how or
is used with two literal
alternatives, how and
is used to name a specific piece of data, and
how or
is used with a binding:
(match x
[(list (or 1 2 3)) 'single]
[(list (and x (list 1 _)) 2) x]
[(or (list 1 x) (list 2 x)) x]))
> (foo (list 3))
'single
> (foo (list (list 1 99) 2))
'(1 99)
> (foo (list 1 10))
10
> (foo (list 2 10))
10
The define-type
Form
The class language that we’re using, #lang pl
, is based on Typed
Racket: a statically-typed dialect of Racket. It is not exactly the
same as Typed Racket — it is restricted in many ways, and extended in
a few ways. (You should therefore try to avoid looking at the Typed
Racket documentation and expect things to be the same in #lang pl
.)
The most important extension is define-type
, which is the construct we
will be using to create new user-defined types. In general, such
definitions looks like what we just used:
[Num Number]
[Add AE AE]
[Sub AE AE])
This defines a new type called AE
, an AE?
predicate for this type,
and a few variants for this type: Num
, Add
, and Sub
in this
case. Each of these variant names is a constructor, taking in arguments
with the listed types, where these types can include the newly defined
type itself in (the very common) case we’re defining a recursive type.
The return type is always the newly defined type, AE
here.
To summarize, this definition gives us a new AE
type, and three
constructors, as if we wrote the following type declarations:
(: Num : Number -> AE)
(: Add : AE AE -> AE)
(: Sub : AE AE -> AE)
The newly defined types are known as “disjoint unions”, since values in these types are disjoint — there is no overlap between the different variants. As we will see, this is what makes this such a useful construct for our needs: the compiler knows about the variants of each newly defined type, which will make it possible for it to complain if we extend a type with more variants but not update all uses of the type.
Furthermore, since the return types of these constructors are all the
new type itself, there is no way for us to write code that expects
just one of these variants. We will use a second form, cases
, to
handle these values.
The cases
Form
A define-type
declaration defines only what was described above: one
new type name and a matching predicate, and a few variants as
constructor functions. Unlike HtDP, we don’t get predicates for each of
the variants, and we don’t get accessor functions for the fields of the
variants.
The way that we handle the new kind of values is with cases
: this is a
form that is very similar to match
, but is specific to instances of
the user-defined type.
Many students find it confusing to distinguish
match
andcases
since they are so similar. Try to remember thatmatch
is for primitive Racket values (we’ll mainly use them for S-expression values), whilecases
is for user-defined values. The distinction between the two forms is unfortunate, and doesn’t serve any purpose. It is just technically difficult to unify the two.
For example, code that handles AE
values (as defined above) can look
as follows:
[(Num n) "a number"]
[(Add l r) "an addition"]
[(Sub l r) "a subtraction"])
As you can see, we need to have patterns for each of the listed variants (and the compiler will throw an error if some are missing), and each of these patterns specifies bindings that will get the field values contained in a given variant object.
We can also use nested patterns:
[(Num n) "a number"]
[(Add (Num m) (Num n)) "a simple addition"]
[(Add l r) "an addition"]
[(Sub (Num m) (Num n)) "a simple subtraction"]
[(Sub l r) "a subtraction"])
but this is a feature that we will not use too often.
The final clause in a cases
form can be an else
clause, which serves
as a fallback in case none of the previous clauses matched the input
value. However, using an else
like this is strongly discouraged!
The problem with using it is that it effectively eliminates the
advantage in getting the type-checker to complain when a type definition
is extended with new variants. Using these else
clauses, we can
actually mimic all of the functionality that you expect in HtDP-style
code, which demonstrates that this is equivalent to HtDP-style
definitions. For example:
;; identifies instances of the `Add` variant
(define (Add? ae)
(cases ae
[(Add l r) #t]
[else #f]))
(: Add-left : AE -> AE)
;; get the left-hand subexpression of an addition
(define (Add-left ae)
(cases ae
[(Add l r) l]
[else (error 'Add-left "expecting an Add value, got ~s" ae)]))
...
Important reminder: this is code that you should not write!
Doing so will lead to code that is more fragile than just using cases
,
since you’d be losing the protection the compiler gives you in the form
of type errors on occurrences of cases
that need to be updated when a
type is extended with new variants. You would therefore end up writing a
bunch of boiler-plate code only to end up with lower-quality code. The
core of the problem is in the prevalent use of else
which gives up
that protection.
In these examples the else
clause is justified because even if AE
is
extended with new variants, functions like Add?
and Add-left
should
not be affected and treat the new variants as they treat all other
non-Add
instances. (And since else
is inherent to these functions,
using them in our code is inherently a bad idea.) We will, however, have
a few (very few!) places where we’ll need to use else
— but this
will always be done only on some specific functionality rather than a
wholesale approach of defining a different interface for user-defined
types.
Semantics (= Evaluation)
Back to BNF — now, meaning.
An important feature of these BNF specifications: we can use the derivations to specify meaning (and meaning in our context is “running” a program (or “interpreting”, “compiling”, but we will use “evaluating”)). For example:
| <AE1> + <AE2> ; <AE> evaluates to the sum of evaluating
; <AE1> and <AE2>
| <AE1> - <AE2> ; ... the subtraction of <AE2> from <AE1>
(... roughly!)
To do this a little more formally:
b. eval(<AE1> + <AE2>) = eval(<AE1>) + eval(<AE2>)
c. eval(<AE1> - <AE2>) = eval(<AE1>) - eval(<AE2>)
Note the completely different roles of the two +
s and -
s. In fact,
it might have been more correct to write:
b. eval("<AE1> + <AE2>") = eval("<AE1>") + eval("<AE2>")
c. eval("<AE1> - <AE2>") = eval("<AE1>") - eval("<AE2>")
or even using a marker to denote meta-holes in these strings:
b. eval("$<AE1> + $<AE2>") = eval("$<AE1>") + eval("$<AE2>")
c. eval("$<AE1> - $<AE2>") = eval("$<AE1>") - eval("$<AE2>")
but we will avoid pretending that we’re doing that kind of string
manipulation. (For example, it will require specifying what does it mean
to return <num>
for $<num>
(involves string->number
), and the
fragments on the right side mean that we need to specify these as
substring operations.)
Note that there’s a similar kind of informality in our BNF
specifications, where we assume that <foo>
refers to some terminal or
non-terminal. In texts that require more formal specifications (for
example, in RFC specifications), each literal part of the BNF is usually
double-quoted, so we’d get
An alternative popular notation for eval(X)
is ⟦X⟧
:
b. [[<AE1> + <AE2>]] = [[<AE1>]] + [[<AE2>]]
c. [[<AE1> - <AE2>]] = [[<AE1>]] - [[<AE2>]]
Is there a problem with this definition? Ambiguity:
Depending on the way the expression is parsed, we can get either a
result of 2
or -4
:
= eval(1) - eval(2) + eval(3) [c]
= 1 - 2 + 3 [a,a,a]
= 2
eval(1 - 2 + 3) = eval(1) - eval(2 + 3) [c]
= eval(1) - (eval(2) + eval(3)) [a]
= 1 - (2 + 3) [a,a,a]
= -4
Again, be very aware of confusing subtleties which are extremely important: We need parens around a sub-expression only in one side, why? — When we write:
we have two expressions, but one stands for an input syntax, and one stands for a real mathematical expression.
In a case of a computer implementation, the syntax on the left is (as always) an AE syntax, and the real expression on the right is an expression in whatever language we use to implement our AE language.
Like we said earlier, ambiguity is not a real problem until the actual
parse tree matters. With eval
it definitely matters, so we must not
make it possible to derive any syntax in multiple ways or our evaluation
will be non-deterministic.
Quick exercise:
We can define a meaning for <digit>
s and then <num>
s in a similar
way:
eval(0) = 0
eval(1) = 1
eval(2) = 2
...
eval(9) = 9
eval(<digit>) = <digit>
eval(<digit> <NUM>) = 10*eval(<digit>) + eval(<NUM>)
Is this exactly what we want? — Depends on what we actually want…
-
First, there’s a bug in this code — having a BNF derivation like
<NUM> ::= <digit> | <digit> <NUM>is unambiguous, but makes it hard to parse a number. We get:
eval(123) = 10*eval(1) + eval(23)
= 10*1 + 10*eval(2) + eval(3)
= 10*1 + 10*2 + 3
= 33Changing the order of the last rule works much better:
<NUM> ::= <digit> | <NUM> <digit>and then:
eval(<NUM> <digit>) = 10*eval(<NUM>) + eval(<digit>) -
As a concrete example see how you would make it work with
107
, which demonstrates why compositionality is important. -
Example for free stuff that looks trivial: if we were to define the meaning of numbers this way, would it always work? Think an average language that does not give you bignums, making the above rules fail when the numbers are too big. In Racket, we happen to be using an integer representation for the syntax of integers, and both are unlimited. But what if we wanted to write a Racket compiler in C or a C compiler in Racket? What about a C compiler in C, where the compiler runs on a 64 bit machine, and the result needs to run on a 32 bit machine?
Side-note: Compositionality
The example of
being a language that is easier to write an evaluator for leads us to an
important concept — compositionality. This definition is easier to
write an evaluator for, since the resulting language is compositional:
the meaning of an expression — for example 123
— is composed out
of the meaning of its two parts, which in this BNF are 12
and 3
.
Specifically, the evaluation of <NUM> <digit>
is 10 *
the evaluation
of the first, plus the evaluation of the second. In the <digit> <NUM>
case this is more difficult — the meaning of such a number depends not
only on the meaning of the two parts, but also on the <NUM>
syntax:
eval(<digit>) * 10^length(<NUM>) + eval(<NUM>)
This this case this can be tolerable, since the meaning of the
expression is still made out of its parts — but imperative programming
(when you use side effects) is much more problematic since it is not
compositional (at least not in the obvious sense). This is compared to
functional programming, where the meaning of an expression is a
combination of the meanings of its subexpressions. For example, every
sub-expression in a functional program has some known meaning, and these
all make up the meaning of the expression that contains them — but in
an imperative program we can have a part of the code be x++
— and
that doesn’t have a meaning by itself, at least not one that contributes
to the meaning of the whole program in a direct way.
(Actually, we can have a well-defined meaning for such an expression:
the meaning is going from a world where x
is a container of some value
N, to a world where the same container has a different value N+1. You
can probably see now how this can make things more complicated. On an
intuitive level — if we look at a random part of a functional program
we can tell its meaning, so building up the meaning of the whole code is
easy, but in an imperative program, the meaning of a random part is
pretty much useless.)