2010-01-19 - Lists & Recursion (contd.) - Names are Important - Note on Types - BNF, Grammars, the Simple AE Language - Simple Parsing - The `match' Form ======================================================================== When you have some common value that you need to use in several places, it is bad to duplicate it. For example: (define (how-many a b c) (cond [(> (* b b) (* 4 a c)) 2] [(= (* b b) (* 4 a c)) 1] [(< (* b b) (* 4 a c)) 0])) What's bad about it? * It's longer than necessary, which will eventually make your code less readable. * It's slower -- by the time you reach the last case, you have evaluated the two sequences three times. * It's more prone to bugs -- the above code is short enough, but what if it was longer so you don't see the three occurrences on the same page? Will you remember to fix all places when you debug the code months after it was written? In general, the ability to use names is probably the most fundamental concept in computer science -- the fact that makes computer programs what they are. We already have a facility to name values: function arguments. We could split the above function into two like this: (define (how-many-helper b^2 4ac) ; note the identifier name! (cond [(> b^2 4ac) 2] [(= b^2 4ac) 1] [else 0])) (define (how-many a b c) (how-many-helper (* b b) (* 4 a c))) But instead of the awkward solution of coming up with a new function just for its names, we have a facility to bind local names -- `let'. In general, the syntax for a `let' special form is (let ([id expr] ...) expr) For example, (let ([x 1] [y 2]) (+ x y)) But note that the bindings are done "in parallel", for example, try this: (let ([x 1] [y 2]) (let ([x y] [y x]) (list x y))) Using this for the above problem: (define (how-many a b c) (let ([b^2 (* b b)] [4ac (* 4 a c)]) (cond [(> b^2 4ac) 2] [(= b^2 4ac) 1] [else 0]))) ======================================================================== Some notes on writing code (also see the style-guide in the handouts section) *** Code quality will be graded to in this course! * Use abstractions whenever possible, as said above. This is bad: (define (how-many a b c) (cond ((> (* b b) (* 4 a c)) 2) ((= (* b b) (* 4 a c)) 1) ((< (* b b) (* 4 a c)) 0))) (define (what-kind a b c) (cond ((= a 0) 'degenerate) ((> (* b b) (* 4 a c)) 'two) ((= (* b b) (* 4 a c)) 'one) ((< (* b b) (* 4 a c)) 'none))) * But don't over abstract: (define one 1) (define two "two") * Always do test cases (show coverage tool), you might want to comment them, but you should always make sure your code works. * Do not under-document, but also don't over-document. * INDENTATION! (Let DrScheme decide for you, and get used to its rules) --> This is part of the culture that was mentioned last time, but it's done this way for good reason: decades of programming experience have shown this to be the most readable format. * As a general rule, `if' should be either all on one line, or the condition on the first and each consequent on a separate line. Similarly for `define' -- either all on one line or a newline after the object that is being define (either an identifier or a an identifier with arguments). * Another general rule: you should never have white space after an open-paren, or before a close paren (white space includes newlines). Also, before an open paren there should be either another open paren or white space, and the same goes for after a closing paren. * Use the tools that are available to you: for example, use `cond' instead of nested ifs (definitely do not force the indentation to make a nested `if' look like its C counterpart -- remember to let DrScheme indent for you). Another example -- do not use `(+ 1 (+ 2 3))' instead of `(+ 1 2 3)' (this might be needed in *extremely* rare situations, only when you know your calculus and have extensive knowledge about round-off errors). Another example -- do not use `(cons 1 (cons 2 (cons 3 null)))' instead of `(list 1 2 3)'. Also -- don't write things like: (if x #t y) --same-as--> (or x y) (if x y #f) --same-as--> (and x y) (if x #f #t) --same-as--> (not x) * Use these as examples for many of these issues: (define (interest x) (* x (cond [(and (> x 0) (<= x 1000)) 0.04] [(and (> x 1000) (<= x 5000)) 0.045] [else 0.05]))) (define (how-many a b c) (cond ((> (* b b) (* (* 4 a) c)) 2) ((< (* b b) (* (* 4 a) c)) 0) (else 1))) (define (what-kind a b c) (if (equal? a 0) 'degenerate (if (equal? (how-many a b c) 0) 'zero (if (equal? (how-many a b c) 1) 'one 'two) ) ) ) (define (interest deposit) (cond [(< deposit 0) "invalid deposit"] [(and (>= deposit 0) (<= deposit 1000)) (* deposit 1.04) ] [(and (> deposit 1000) (<= deposit 5000)) (* deposit 1.045)] [(> deposit 5000) (* deposit 1.05)])) (define (interest deposit) (if (< deposit 1001) (* 0.04 deposit) (if (< deposit 5001) (* 0.045 deposit) (* 0.05 deposit)))) (define (what-kind a b c) (cond ((= 0 a) 'degenerate) (else (cond ((> (* b b)(*(* 4 a) c)) 'two) (else (cond ((= (* b b)(*(* 4 a) c)) 'one) (else 'none))))))); ======================================================================== >>> Names are Important Sidenote: An important "discovery" in computer science is that we *don't* need names for every intermediate sub-expression -- for example, in almost any language we can write the equivalent of: s = (-b + sqrt(b^2 - 4*a*c)) / 2a instead of x = b * b y = 4 * a y = y * c x = x - y x = sqrt(x) y = -b x = y + x y = 2 * a s = x / y Such languages are put in contrast to assembly languages, and were all put under the generic label of "high level languages". (Here's an interesting idea -- why not do the same for function values?) ======================================================================== The fact that in Scheme we can use functions as values is very useful in Scheme -- for example, `map', `foldl' & `foldr', many more. Example: ;; every?: (A -> Boolean) (Listof A) -> Boolean ;; Returns false if any element of lst fails the given pred, true if ;; all pass pred. (define (every? pred lst) (or (null? lst) (and (pred (car lst)) (every? pred (cdr lst))))) ======================================================================== >>> Note on Types Types can become interesting when dealing with higher-order functions. For example, `map' receives a function and a list of some type, and applies the function over this list to accumulate its output, so its type is: ;; map : (A -> B) (Listof A) -> (Listof B) Actually, `map' can use more than a single list, it will apply the function on the first element in all lists, then the second and so on. So the type of `map' with two lists can be described as: ;; map : (A B -> C) (Listof A) (Listof B) -> (Listof C) Here's a hairy example -- what is the type of this function: (define (foo x y) (map map x y)) Begin by what we know -- both `map's, call them `map1' and `map2', have the double- and single-list types of `map' respectively, here they are, with different names for types: ;; the first `map', consumes a function and two lists map1 : (A B -> C) (Listof A) (Listof B) -> (Listof C) ;; the second `map', consumes a function and one list map2 : (X -> Y) (Listof X) -> (Listof Y) Now, we know that `map2' is the first argument to `map1', so the type of `map1's first argument should be the type of `map2': (A B -> C) = (X -> Y) (Listof X) -> (Listof Y) From here we can conclude that A = (X -> Y) B = (Listof X) C = (Listof Y) If we use these equations in `map1's type, we get: map1 : ((X -> Y) (Listof X) -> (Listof Y)) (Listof (X -> Y)) (Listof (Listof X)) -> (Listof (Listof Y)) Now, `foo's two arguments are the 2nd and 3rd arguments of `map1', and its result is `map1's result, so we can now write the type of `foo': ;; foo : (Listof (X -> Y)) ;; (Listof (Listof X)) ;; -> (Listof (Listof Y)) (define (foo x y) (map map x y)) This should help you understand why, for example, this will cause a type error: (foo (list add1 sub1 add1) (list 1 2 3)) and why this is value: (foo (list add1 sub1 add1) (map list (list 1 2 3))) ======================================================================== >>> BNF, Grammars, the Simple AE Language Getting back to the theme of the course: we want to investigate programming languages, and we want to do that *using* a programming language. The first thing when we design a language is to specify the language. For this we use BNF (Backus-Naur Form). For example, here is the definition of a simple arithmetic language: ::= | + | - Explain the different parts. Specifically, this is a mixture of low-level (concrete) syntax definition with parsing. We use this to derive expressions in some language. We start with , which should be on of these: * a number * , the text "+", and another * the same but with "-" is a terminal: when we reach it in the derivation, we're done. is a non-terminal: when we reach it, we have to continue with one of the options. It should be clear that the "+" and the "-" are things we expect to find in the input -- because they are not wrapped in <>s. We could specify what is: ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | But we don't -- why? Because in Scheme we have numbers as primitives and we want to use Scheme to implement our languages. This makes life a lot easier, and we get free stuff like floats, rationals etc. For example, we can use this to prove that "1-2+3" is a valid expression: + ; (2) + ; (1) - + ; (3) - + 3 ; (num) - + 3 ; (1) - + 3 ; (1) 1 - + 3 ; (num) 1 - 2 + 3 ; (num) This would be one way of doing this. Instead, we can can visualize the derivation using a tree, with the rules used at every node. (Leave this on -- later show that this removes some confusion but not all.) These specifications suffer from being ambiguous: an expression can be derived in multiple ways. Even the little syntax for a number is ambiguous -- a number like "123" can be derived in two ways that result in trees that look different. This ambiguity is not a "real" problem now, but it will become one very soon. We want to get rid of this ambiguity, so that there is a single (= deterministic) way to derive all expressions. There is a standard way to resolve that -- we add another non-terminal to the definition, and make it so that each rule can continue to exactly one of its alternatives. For example, this is what we can do with numbers: ::= | ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Similar solutions can be applied to the BNF -- we either restrict the way derivations can happen or we come up with new non-terminals to force a deterministic derivation trees. As an example of restricting derivations, we look at the current grammar: ::= | + | - and instead of allowing an on both sides of the operation, we force one to be a number: ::= | + | - Now there is a single way to derive any expression, and it is always associating operations to the right: an expression like "1+2+3" can only be derived as "1+(2+3)". To change this to left-association, we would use this: ::= | + | - But what if we want to force precedence? Say that our AE syntax has addition and multiplication: ::= | + | * We can do that same thing as above and add new non-terminals -- say one for "factors": ::= | + | ::= | * Now we must parse any AE expression as additions of multiplications (or numbers). First, note that if goes to and that goes to , then there is no need for an to go to a , so this is the same syntax: ::= + | ::= | * Now, if we want to still be able to multiply additions, we can force them to appear in parentheses: ::= + | ::= | * | ( ) Next, note that AE is still ambiguous about additions, which can be fixed by forcing the left hand side of an addition to be a factor: ::= + | ::= | * | ( ) We still have an ambiguity for multiplications, so we do the same thing and add another non-terminal for "atoms": ::= + | ::= * | ::= | ( ) And you can try to derive several expressions to be convinced that derivation is always deterministic now. But as you can see, this is exactly the cosmetics that we want to avoid -- it will lead us to things that might be interesting, but unrelated to the principles behind programming languages. It will also become much much worse when we have a real language rather such a tiny one. Is there a good solution? -- It is right in our face: do what Scheme does -- always use fully parenthesized expressions: ::= | ( + ) | ( - ) To prevent confusing Scheme code with code in our language(s), we also change the parentheses to curly ones: ::= | { + } | { - } But in Scheme *everything* has a value -- including those `+'s and `-'s, which makes this extremely convenient with future operations that might have either more or less arguments than 2 as well as treating these arithmetic operators as plain functions. In our toy language we will not do this initially (that is, `+' and `-' are second order operators: they cannot be used as values). but since we will get to it later, we'll adopt the Scheme solution and use a fully-parenthesized prefix notation: ::= | { + } | { - } (Remember that in a sense, Scheme code is written in a form of already-parsed syntax...) ======================================================================== >>> Simple Parsing Implementing a "parser" Unrelated to what the syntax actually looks like, we want to parse it as soon as possible -- converting the concrete syntax to an abstract syntax tree. No matter how we write our syntax: - 3+4 (infix), - 3 4 + (postfix), - +(3,4) (prefix with args in parens), - (+ 3 4) (parenthesized prefix), we always mean the same abstract thing -- adding the number 3 and the number 4. The essence of this is basically a tree structure with an addition operation as the root and two leaves holding the two numerals. With the right data definition, we can describe this in Scheme as the expression (Add (Num 3) (Num 4)) Similarly, the expression (3-4)+7 will be described in Scheme as the expression: (Add (Sub (Num 3) (Num 4)) (Num 7)) Important note: "expression" was used in two *different* ways in the above -- each way corresponds to a different language. To define the data type and the necessary constructors we will use this: (define-type AE [Num (n Number)] [Add (lhs AE) (rhs AE)] [Sub (lhs AE) (rhs AE)]) * Note -- scheme follows the tradition of Lisp which makes syntax issues almost negligible -- the language we use is almost as if we are using the parse tree directly. Actually, it is a very simple syntax for parse trees, one that makes parsing extremely easy. (This has an interesting historical reason... Some Lisp history -- M-expressions vs. S-expressions, and the fact that we write code that is isomorphic to an AST. Later we will see some of the advantages that we get by doing this.) To make things at a very simple level, we will use the above fact through a double-level approach: * we first "parse" our language into an intermediate representation -- a Scheme list -- this is mostly done by a modified version of Scheme's `read' that uses curly braces "{}"s instead of round parens "()"s, * then we write our own `parse' function that will parse the resulting list into an instance of the AE type -- an abstract syntax tree (AST). This is achieved by the following simple recursive function: (: parse-sexpr : Sexpr -> AE) ;; to convert s-expressions into AEs (define (parse-sexpr sexpr) (cond [(number? sexpr) (Num sexpr)] [(and (list? sexpr) (= 3 (length sexpr))) (let ([make-node (match (first sexpr) ['+ Add] ['- Sub] [else (error 'parse-sexpr "don't know about ~s" (first sexpr))]) #| the above is the same as: (cond [(equal? '+ (first sexpr)) Add] [(equal? '- (first sexpr)) Sub] [else (error 'parse-sexpr "don't know about ~s" (first sexpr))]) |#]) (make-node (parse-sexpr (second sexpr)) (parse-sexpr (third sexpr))))] [else (error 'parse-sexpr "bad syntax in ~s" sexpr)])) This function is pretty simple, but as our languages grow, they will become more verbose and more difficult to write. So, instead, we use a new special form: `match', which is matching a value and binds new identifiers to different parts (try it with "Check Syntax"). Re-writing the above code using `match': (: parse-sexpr : Sexpr -> AE) ;; to convert s-expressions into AEs (define (parse-sexpr sexpr) (match sexpr [(number: n) (Num n)] [(list '+ left right) (Add (parse-sexpr left) (parse-sexpr right))] [(list '- left right) (Sub (parse-sexpr left) (parse-sexpr right))] [else (error 'parse-sexpr "bad syntax in ~s" sexpr)])) To make things less confusing, we will combine this with the function that parses a string into a sexpr so we can use strings to represent our programs: (: parse : String -> AE) ;; parses a string containing an AE expression to an AE (define (parse str) (parse-sexpr (string->sexpr str))) ======================================================================== >>> The `match' Form The syntax for `match' is (match value [pattern result-expr] ...) The value is matched against each pattern, possibly binding names in the process, and if a pattern matches it evaluates the result expression. The simplest form of a pattern is simply an identifier -- it always matches and binds that identifier to the value: (match (list 1 2 3) [x x]) ; evaluates to the list Another simple pattern is a quoted symbol, which matches that symbol. For example: (match foo ['x "yes"] [else "no"]) will evaluate to "yes" if `foo' is the symbol `x', and to "no" otherwise. Note that `else' is not a keyword here -- it happens to be a pattern that always succeeds, so it behaves like an else clause except that it binds `else' to the unmatched-so-far value. Many patterns look like function application -- but don't confuse them with applications. A `(list x y z)' pattern matches a list of exactly three items and binds the three identifiers; or if the "arguments" are themselves patterns, `match' will decend into the values and match them too. This means that the patterns can be nested: (match (list 1 2 3) [(list x y z) (+ x y z)]) ; evaluates to 6 (match '((1) (2) 3) [(list (list x) (list y) z) (+ x y z)]) ; evaluates to 6 There is also a `cons' pattern that matches a non-empty list and then matches the first part against the head for the list and the second part against the tail of the list. In a `list' pattern, you can use `...' to specify that the previous pattern is repeated zero or more times, and bound names get bound to the list of respective matching. One simple consequent is that the `(list hd tl ...)' pattern is exactly the same as `(cons hd tl)', but being able to repeat an arbitrary pattern is very useful: (match '((1 2) (3 4) (5 6) (7 8)) [(list (list x y) ...) (append x y)]) ; evaluates to (1 3 5 7 2 4 6 8) A few more useful patterns: id -- matches anything, binds `id' to it _ -- matches anything, but does not bind (number: n) -- matches any number and binds it to `n' (symbol: s) -- same for symbols (string: s) -- strings (sexpr: s) -- S-expressions (needed sometimes for typed scheme) (and pat1 pat2) -- matches either pattern (or pat1 pat2) -- matches either pattern (careful with bindings) If no pattern matches the value, an error is raised. Here is a quick example for how `or' is used with two literal alternatives, how `and' is used to name a specific piece of data, and how `or' is used with a binding: > (define (foo x) (match x [(list (or 1 2 3)) 'single] [(list (and x (list 1 _)) 2) x] [(or (list 1 x) (list 2 x)) x])) > (foo (list 3)) single > (foo (list (list 1 99) 2)) (1 99) > (foo (list 1 10)) 10 > (foo (list 2 10)) 10 ========================================================================