Lecture #3, Tuesday, January 17th ================================= - Sidenote on Types - Side-note: Names are important - BNF, Grammars, the AE Language - Simple Parsing - The `match` Form - The `define-type` Form - The `cases` Form ------------------------------------------------------------------------ ## Sidenote on Types > Note: this is all just a side note for a particularly hairy example. > You don't need to follow all of this to write code in this class! > Consider this section a kind of an extra type-related puzzle to read > trough, and maybe get back to it much later, after we cover > typechecking. Types can become interestingly complicated when dealing with higher-order functions. Specifically, the nature of the type system used by Typed Racket makes it have one important weakness: it often fails to infer types when there are higher-order functions that operate on polymorphic functions. For example, consider how `map` receives a function and a list of some type, and applies the function over this list to accumulate its output, so it's a polymorphic function with the following type: map : (A -> B) (Listof A) -> (Listof B) But Racket's `map` is actually more flexible that that: it can take more than a single list input, in which case it will apply the function on the first element in all lists, then the second and so on. Narrowing our vision to the two-input-lists case, the type of `map` then becomes: map : (A B -> C) (Listof A) (Listof B) -> (Listof C) Now, here's a hairy example --- what is the type of this function: (define (foo x y) (map map x y)) Begin by what we know --- both `map`s, call them `map1` and `map2`, have the double- and single-list types of `map` respectively, here they are, with different names for types: ;; the first `map', consumes a function and two lists map1 : (A B -> C) (Listof A) (Listof B) -> (Listof C) ;; the second `map', consumes a function and one list map2 : (X -> Y) (Listof X) -> (Listof Y) Now, we know that `map2` is the first argument to `map1`, so the type of `map1`s first argument should be the type of `map2`: (A B -> C) = (X -> Y) (Listof X) -> (Listof Y) From here we can conclude that A = (X -> Y) B = (Listof X) C = (Listof Y) If we use these equations in `map1`'s type, we get: map1 : ((X -> Y) (Listof X) -> (Listof Y)) (Listof (X -> Y)) (Listof (Listof X)) -> (Listof (Listof Y)) Now, `foo`'s two arguments are the 2nd and 3rd arguments of `map1`, and its result is `map1`s result, so we can now write our "estimated" type of `foo`: (: foo : (Listof (X -> Y)) (Listof (Listof X)) -> (Listof (Listof Y))) (define (foo x y) (map map x y)) This should help you understand why, for example, this will cause a type error: (foo (list add1 sub1 add1) (list 1 2 3)) and why this is valid: (foo (list add1 sub1 add1) (map list (list 1 2 3))) ***But...!*** There's a big "but" here which is that weakness of Typed Racket that was mentioned. If you try to actually write such a defninition in `#lang pl` (which is based on Typed Racket), you will first find that you need to explicitly list the type variable that are needed to make it into a generic type. So the above becomes: (: foo : (All (X Y) (Listof (X -> Y)) (Listof (Listof X)) -> (Listof (Listof Y)))) (define (foo x y) (map map x y)) But not only does that not work --- it throws an obscure type error. That error is actually due to TR's weakness: it's a result of not being able to infer the proper types. In such cases, TR has two mechanisms to "guide it" in the right direction. The first one is `inst`, which is used to instantiate a generic (= polymorphic) type some actual type. The problem here is with the second `map` since that's the polymorphic function that is given to a higher-order function (the first `map`). If we provide the types to instantiate this, it will work fine: (: foo : (All (X Y) (Listof (X -> Y)) (Listof (Listof X)) -> (Listof (Listof Y)))) (define (foo x y) (map (inst map Y X) x y)) Now, you can use this definition to run the above example: (foo (list add1 sub1 add1) (list (list 1) (list 2) (list 3))) This example works fine, but that's because we wrote the list argument explicitly. If you try to use the exact example above, (foo (list add1 sub1 add1) (map list (list 1 2 3))) you'd run into the same problem again, since this also uses a polymorphic function (`list`) with a higher-order one (`map`). Indeed, an `inst` can make this work for this too: (foo (list add1 sub1 add1) (map (inst list Number) (list 1 2 3))) The second facility is `ann`, which can be used to annotate an expression with the type that you expect it to have. (define (foo x y) (map (ann map ((X -> Y) (Listof X) -> (Listof Y))) x y)) (Note: this is not type casting! It's using a different type which is also applicable for the given expression, and having the type checker validate that this is true. TR does have a similar `cast` form, which is used for a related but different cases.) This tends to be more verbose than `inst`, but is sometimes easier to follow, since the expected type is given explicitly. The thing about `inst` is that it's kind of "applying" a polymorphic `(All (A B) ...)` type, so you need to know the order of the `A B` arguments, which is why in the above we use `(inst map Y X)` rather than `(inst map X Y)`. > Again, remember that this is all not something that you need to know. > We will have a few (very rare) cases where we'll need to use `inst`, > and in each of these, you'll be told where and how to use it. ------------------------------------------------------------------------ ## Side-note: Names are important An important "discovery" in computer science is that we *don't* need names for every intermediate sub-expression --- for example, in almost any language we can write something like: s = (-b + sqrt(b^2 - 4*a*c)) / (2*a) instead of x₁ = b * b y₁ = 4 * a y₂ = y * c x₂ = x - y x₃ = sqrt(x) y₃ = -b x₄ = y + x y₄ = 2 * a s = x / y Such languages are put in contrast to assembly languages, and were all put under the generic label of "high level languages". (Here's an interesting idea --- why not do the same for function values?) ------------------------------------------------------------------------ # BNF, Grammars, the AE Language Getting back to the theme of the course: we want to investigate programming languages, and we want to do that *using* a programming language. The first thing when we design a language is to specify the language. For this we use BNF (Backus-Naur Form). For example, here is the definition of a simple arithmetic language: ::= | + | - Explain the different parts. Specifically, this is a mixture of low-level (concrete) syntax definition with parsing. We use this to derive expressions in some language. We start with ``, which should be one of these: * a number `` * an ``, the text "`+`", and another `` * the same but with "`-`" `` is a terminal: when we reach it in the derivation, we're done. `` is a non-terminal: when we reach it, we have to continue with one of the options. It should be clear that the `+` and the `-` are things we expect to find in the input --- because they are not wrapped in `<>`s. We could specify what `` is (turning it into a `` non-terminal): ::= | + | - ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | But we don't --- why? Because in Racket we have numbers as primitives and we want to use Racket to implement our languages. This makes life a lot easier, and we get free stuff like floats, rationals etc. To use a BNF formally, for example, to prove that `1-2+3` is a valid `` expression, we first label the rules: ::= (1) | + (2) | - (3) and then we can use them as formal justifications for each derivation step: + ; (2) + ; (1) - + ; (3) - + 3 ; (num) - + 3 ; (1) - + 3 ; (1) 1 - + 3 ; (num) 1 - 2 + 3 ; (num) This would be one way of doing this. Alternatively, we can can visualize the derivation using a tree, with the rules used at the nodes. These specifications suffer from being ambiguous: an expression can be derived in multiple ways. Even the little syntax for a number is ambiguous --- a number like `123` can be derived in two ways that result in trees that look different. This ambiguity is not a "real" problem now, but it will become one very soon. We want to get rid of this ambiguity, so that there is a single (= deterministic) way to derive all expressions. There is a standard way to resolve that --- we add another non-terminal to the definition, and make it so that each rule can continue to exactly one of its alternatives. For example, this is what we can do with numbers: ::= | ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Similar solutions can be applied to the `` BNF --- we either restrict the way derivations can happen or we come up with new non-terminals to force a deterministic derivation trees. As an example of restricting derivations, we look at the current grammar: ::= | + | - and instead of allowing an `` on both sides of the operation, we force one to be a number: ::= | + | - Now there is a single way to derive any expression, and it is always associating operations to the right: an expression like `1+2+3` can only be derived as `1+(2+3)`. To change this to left-association, we would use this: ::= | + | - But what if we want to force precedence? Say that our AE syntax has addition and multiplication: ::= | + | * We can do that same thing as above and add new non-terminals --- say one for "products": ::= | + | ::= | * Now we must parse any AE expression as additions of multiplications (or numbers). First, note that if `` goes to `` and that goes to ``, then there is no need for an `` to go to a ``, so this is the same syntax: ::= + | ::= | * Now, if we want to still be able to multiply additions, we can force them to appear in parentheses: ::= + | ::= | * | ( ) Next, note that `` is still ambiguous about additions, which can be fixed by forcing the left hand side of an addition to be a factor: ::= + | ::= | * | ( ) We still have an ambiguity for multiplications, so we do the same thing and add another non-terminal for "atoms": ::= + | ::= * | ::= | ( ) And you can try to derive several expressions to be convinced that derivation is always deterministic now. But as you can see, this is exactly the cosmetics that we want to avoid --- it will lead us to things that might be interesting, but unrelated to the principles behind programming languages. It will also become much much worse when we have a real language rather such a tiny one. Is there a good solution? --- It is right in our face: do what Racket does --- always use fully parenthesized expressions: ::= | ( + ) | ( - ) To prevent confusing Racket code with code in our language(s), we also change the parentheses to curly ones: ::= | { + } | { - } But in Racket *everything* has a value --- including those `+`s and `-`s, which makes this extremely convenient with future operations that might have either more or less arguments than 2 as well as treating these arithmetic operators as plain functions. In our toy language we will not do this initially (that is, `+` and `-` are second order operators: they cannot be used as values). But since we will get to it later, we'll adopt the Racket solution and use a fully-parenthesized prefix notation: ::= | { + } | { - } (Remember that in a sense, Racket code is written in a form of already-parsed syntax...) ------------------------------------------------------------------------ # Simple Parsing On to an implementation of a "parser": Unrelated to what the syntax actually looks like, we want to parse it as soon as possible --- converting the concrete syntax to an abstract syntax tree. No matter how we write our syntax: - `3+4` (infix), - `3 4 +` (postfix), - `+(3,4)` (prefix with args in parens), - `(+ 3 4)` (parenthesized prefix), we always mean the same abstract thing --- adding the number `3` and the number `4`. The essence of this is basically a tree structure with an addition operation as the root and two leaves holding the two numerals. With the right data definition, we can describe this in Racket as the expression `(Add (Num 3) (Num 4))` where `Add` and `Num` are constructors of a tree type for syntax, or in a C-like language, it could be something like `Add(Num(3),Num(4))`. Similarly, the expression `(3-4)+7` will be described in Racket as the expression: (Add (Sub (Num 3) (Num 4)) (Num 7)) Important note: "expression" was used in two *different* ways in the above --- each way corresponds to a different language, and the result of evaluating the second "expression" is a Racket value that *represents* the first expression. To define the data type and the necessary constructors we will use this: (define-type AE [Num Number] [Add AE AE] [Sub AE AE]) * Note --- Racket follows the tradition of Lisp which makes syntax issues almost negligible --- the language we use is almost as if we are using the parse tree directly. Actually, it is a very simple syntax for parse trees, one that makes parsing extremely easy. [This has an interesting historical reason... Some Lisp history --- *M-expressions* vs. *S-expressions*, and the fact that we write code that is isomorphic to an AST. Later we will see some of the advantages that we get by doing this. See also "*The Evolution of Lisp*", section 3.5.1. Especially the last sentence: > Therefore we expect future generations of Lisp programmers to > continue to reinvent Algol-style syntax for Lisp, over and over and > over again, and we are equally confident that they will continue, > after an initial period of infatuation, to reject it. (Perhaps this > process should be regarded as a rite of passage for Lisp hackers.) And an interesting & modern *counter*-example of this [here]( https://ts-ast-viewer.com/#code/DYUwLgBAghC8EEYDcAoFYCeAHE06NSA).] To make things very simple, we will use the above fact through a double-level approach: * we first "parse" our language into an intermediate representation --- a Racket list --- this is mostly done by a modified version of Racket's `read` function that uses curly `{}` braces instead of round `()` parens, * then we write our own `parse` function that will parse the resulting list into an instance of the `AE` type --- an abstract syntax tree (AST). This is achieved by the following simple recursive function: (: parse-sexpr : Sexpr -> AE) ;; parses s-expressions into AEs (define (parse-sexpr sexpr) (cond [(number? sexpr) (Num sexpr)] [(and (list? sexpr) (= 3 (length sexpr))) (let ([make-node (match (first sexpr) ['+ Add] ['- Sub] [else (error 'parse-sexpr "unknown op: ~s" (first sexpr))]) #| the above is the same as: (cond [(equal? '+ (first sexpr)) Add] [(equal? '- (first sexpr)) Sub] [else (error 'parse-sexpr "unknown op: ~s" (first sexpr))]) |#]) (make-node (parse-sexpr (second sexpr)) (parse-sexpr (third sexpr))))] [else (error 'parse-sexpr "bad syntax in ~s" sexpr)])) This function is pretty simple, but as our languages grow, they will become more verbose and more difficult to write. So, instead, we use a new special form: `match`, which is matching a value and binds new identifiers to different parts (try it with "Check Syntax"). Re-writing the above code using `match`: (: parse-sexpr : Sexpr -> AE) ;; parses s-expressions into AEs (define (parse-sexpr sexpr) (match sexpr [(number: n) (Num n)] [(list '+ left right) (Add (parse-sexpr left) (parse-sexpr right))] [(list '- left right) (Sub (parse-sexpr left) (parse-sexpr right))] [else (error 'parse-sexpr "bad syntax in ~s" sexpr)])) And finally, to make it more uniform, we will combine this with the function that parses a string into a sexpr so we can use strings to represent our programs: (: parse : String -> AE) ;; parses a string containing an AE expression to an AE (define (parse str) (parse-sexpr (string->sexpr str))) ------------------------------------------------------------------------ # The `match` Form The syntax for `match` is (match value [pattern result-expr] ...) The value is matched against each pattern, possibly binding names in the process, and if a pattern matches it evaluates the result expression. The simplest form of a pattern is simply an identifier --- it always matches and binds that identifier to the value: (match (list 1 2 3) [x x]) ; evaluates to the list Another simple pattern is a quoted symbol, which matches that symbol. For example: (match foo ['x "yes"] [else "no"]) will evaluate to `"yes"` if `foo` is the symbol `x`, and to `"no"` otherwise. Note that `else` is not a keyword here --- it happens to be a pattern that always succeeds, so it behaves like an else clause except that it binds `else` to the unmatched-so-far value. Many patterns look like function application --- but don't confuse them with applications. A `(list x y z)` pattern matches a list of exactly three items and binds the three identifiers; or if the "arguments" are themselves patterns, `match` will descend into the values and match them too. More specifically, this means that patterns can be nested: (match (list 1 2 3) [(list x y z) (+ x y z)]) ; evaluates to 6 (match (list 1 2 3) [(cons x (list y z)) (+ x y z)]) ; matches the same shape (also 6) (match '((1) (2) 3) [(list (list x) (list y) z) (+ x y z)]) ; also 6 As seen above, there is also a `cons` pattern that matches a non-empty list and then matches the first part against the head for the list and the second part against the tail of the list. In a `list` pattern, you can use `...` to specify that the previous pattern is repeated zero or more times, and bound names get bound to the list of respective matching. One simple consequent is that the `(list hd tl ...)` pattern is exactly the same as `(cons hd tl)`, but being able to repeat an arbitrary pattern is very useful: > (match '((1 2) (3 4) (5 6) (7 8)) [(list (list x y) ...) (list x y)]) '((1 3 5 7) (2 4 6 8)) A few more useful patterns: id -- matches anything, binds `id' to it _ -- matches anything, but does not bind (number: n) -- matches any number and binds it to `n' (symbol: s) -- same for symbols (string: s) -- strings (sexpr: s) -- S-expressions (needed sometimes for Typed Racket) (and pat1 pat2) -- matches both patterns (or pat1 pat2) -- matches either pattern (careful with bindings) Note that the `foo:` patterns are all specific to our `#lang pl`, they are not part of `#lang racket` or `#lang typed/racket`. The patterns are tried one by one *in-order*, and if no pattern matches the value, an error is raised. Note that `...` in a `list` pattern can follow *any* pattern, including all of the above, and including nested list patterns. Here are a few examples --- you can try them out with `#lang pl untyped` at the top of the definitions window. This: (match x [(list (symbol: syms) ...) syms]) matches `x` against a pattern that accepts only a list of symbols, and binds `syms` to those symbols. If you want to match only a list of, say, one or more symbols, then just add one before the `...`-ed pattern variable: (match x [(list (symbol: sym) (symbol: syms) ...) syms]) ;; same as: (match x [(cons (symbol: sym) (list (symbol: syms) ...)) syms]) which will match such a non-empty list, where the whole list (on the right hand side) is `(cons sym syms)`. Here's another example that matches a list of any number of lists, where each of the sub-lists begins with a symbol and then has any number of numbers. Note how the `n` and `s` bindings get values for a list of all symbols and a list of lists of the numbers: > (define (foo x) (match x [(list (list (symbol: s) (number: n) ...) ...) (list 'symbols: s 'numbers: n)])) > (foo (list (list 'x 1 2 3) (list 'y 4 5))) '(symbols: (x y) numbers: ((1 2 3) (4 5))) Here is a quick example for how `or` is used with two literal alternatives, how `and` is used to name a specific piece of data, and how `or` is used with a binding: > (define (foo x) (match x [(list (or 1 2 3)) 'single] [(list (and x (list 1 _)) 2) x] [(or (list 1 x) (list 2 x)) x])) > (foo (list 3)) 'single > (foo (list (list 1 99) 2)) '(1 99) > (foo (list 1 10)) 10 > (foo (list 2 10)) 10 ------------------------------------------------------------------------ # The `define-type` Form The class language that we're using, `#lang pl`, is based on *Typed Racket*: a statically-typed dialect of Racket. It is not exactly the same as Typed Racket --- it is restricted in many ways, and extended in a few ways. (You should therefore try to avoid looking at the Typed Racket documentation and expect things to be the same in `#lang pl`.) The most important extension is `define-type`, which is the construct we will be using to create new user-defined types. In general, such definitions looks like what we just used: (define-type AE [Num Number] [Add AE AE] [Sub AE AE]) This defines a *new type* called `AE`, an `AE?` predicate for this type, and a few *variants* for this type: `Num`, `Add`, and `Sub` in this case. Each of these variant names is a constructor, taking in arguments with the listed types, where these types can include the newly defined type itself in (the very common) case we're defining a recursive type. The return type is always the newly defined type, `AE` here. To summarize, this definition gives us a new `AE` type, and three constructors, as if we wrote the following type declarations: * `(: Num : Number -> AE)` * `(: Add : AE AE -> AE)` * `(: Sub : AE AE -> AE)` The newly defined types are known as *"disjoint unions"*, since values in these types are disjoint --- there is no overlap between the different variants. As we will see, this is what makes this such a useful construct for our needs: the compiler knows about the variants of each newly defined type, which will make it possible for it to complain if we extend a type with more variants but not update all uses of the type. Furthermore, since the return types of these constructors are all the new type itself, there is *no way* for us to write code that expects just *one* of these variants. We will use a second form, `cases`, to handle these values. ------------------------------------------------------------------------ # The `cases` Form A `define-type` declaration defines *only* what was described above: one new type name and a matching predicate, and a few variants as constructor functions. Unlike HtDP, we don't get predicates for each of the variants, and we don't get accessor functions for the fields of the variants. The way that we handle the new kind of values is with `cases`: this is a form that is very similar to `match`, but is specific to instances of the user-defined type. > Many students find it confusing to distinguish `match` and `cases` > since they are so similar. Try to remember that `match` is for > primitive Racket values (we'll mainly use them for S-expression > values), while `cases` is for user-defined values. The distinction > between the two forms is unfortunate, and doesn't serve any purpose. > It is just technically difficult to unify the two. For example, code that handles `AE` values (as defined above) can look as follows: (cases some-ae-value [(Num n) "a number"] [(Add l r) "an addition"] [(Sub l r) "a subtraction"]) As you can see, we need to have patterns for each of the listed variants (and the compiler will throw an error if some are missing), and each of these patterns specifies bindings that will get the field values contained in a given variant object. We can also use nested patterns: (cases some-ae-value [(Num n) "a number"] [(Add (Num m) (Num n)) "a simple addition"] [(Add l r) "an addition"] [(Sub (Num m) (Num n)) "a simple subtraction"] [(Sub l r) "a subtraction"]) but this is a feature that we will not use too often. The final clause in a `cases` form can be an `else` clause, which serves as a fallback in case none of the previous clauses matched the input value. However, using an `else` like this is ***strongly discouraged!*** The problem with using it is that it effectively eliminates the advantage in getting the type-checker to complain when a type definition is extended with new variants. Using these `else` clauses, we can actually mimic all of the functionality that you expect in HtDP-style code, which demonstrates that this is equivalent to HtDP-style definitions. For example: (: Add? : AE -> Boolean) ;; identifies instances of the `Add` variant (define (Add? ae) (cases ae [(Add l r) #t] [else #f])) (: Add-left : AE -> AE) ;; get the left-hand subexpression of an addition (define (Add-left ae) (cases ae [(Add l r) l] [else (error 'Add-left "expecting an Add value, got ~s" ae)])) ... ***Important reminder:*** this is code that ***you should not write!*** Doing so will lead to code that is more fragile than just using `cases`, since you'd be losing the protection the compiler gives you in the form of type errors on occurrences of `cases` that need to be updated when a type is extended with new variants. You would therefore end up writing a bunch of boiler-plate code only to end up with lower-quality code. The core of the problem is in the prevalent use of `else` which gives up that protection. In these examples the `else` clause is justified because even if `AE` is extended with new variants, functions like `Add?` and `Add-left` should not be affected and treat the new variants as they treat all other non-`Add` instances. (And since `else` is inherent to these functions, using them in our code is inherently a bad idea.) We will, however, have a few (*very few!*) places where we'll need to use `else` --- but this will always be done only on some specific functionality rather than a wholesale approach of defining a different interface for user-defined types.