2010-04-16 - Typing Data, Type Soundness - Type soundness - Explicit polymorphism ======================================================================== >>> Typing Data, Type Soundness An important concept that we have avoided so far is user-defined types. This issue exists in any language, including the ones we did so far, but it is even more important in a typed context. Specifically, we talked about typing recursive code, but we could also consider typing recursive data. For example, consider a `length' function in an extension of the language that has `empty?' and `rest': {rec {length : ??? {fun {l : ???} : Number {if {empty? l} 0 {+ 1 {call length {rest l}}}}}} {call length {NumCons 1 {NumCons 2 {NumCons 3 NumEmpty}}}}} Since adding all of these new functions as built-ins is getting messy, we want our language to have a form for defining new kinds of data. In this example -- to be able to define the `NumList' type for lists of numbers. So we extend the language with a new `with-type' form that defines new types, using variants in a similar way to our own language: {with-type {NumList [NumEmpty] [NumCons {fst : Number} {rst : ???}]} {rec {length : (NumList -> Number) {fun {l : NumList} : Number ...}} ...}} We assume here that the `NumList' definition provides us with a number of "new builtins" -- `NumEmpty' and `NumCons' constructors, and assume also a `cases' form that can be used to both test a value and access its components (with the constructors serving as patterns). The question is what should the "???" in the above be filled with? Clearly, recursive data types are very common and we need to support them. Therefore, the scope of `with-type' should be similar to `rec', except that it works at the type level: the new type is available for its own definition. This is the complete code now: {with-type {NumList [NumEmpty] [NumCons {fst : Number} {rst : NumList}]} {rec {length : (NumList -> Number) {fun {l : NumList} : Number {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {call length r}}}}}} {call length {NumCons 1 {NumCons 2 {NumCons 3 {NumEmpty}}}}}}} (Note that in the course language we could do just that, and in addition, the `Rec' type constructor can be used to make up recursive types.) An important property of this type is that it "well founded": that you don't get stuck in some kind of type-level infinite loop. To see that this holds in this example, note that some of the variants are self-referential (`NumCons'), but there is at least one of them is not (`NumEmpty') -- if there wasn't any simple variant, then we would have no way to construct instances of this type to begin with! Consider also the case of a lazy language -- where we could think of such types, for example: {with-type {NumList [NumCons {fst : Number} {rst : NumList}]} {rec {ones : NumList {NumCons 1 ones}} ...}} ======================================================================== >> Judgments for Recursive Types If we want to have a language that is basically similar to the language that we use for implementing evaluators, then -- as seen above -- we'd use a similar `cases' expression. The question now is how would we type-check such expressions. In this case, we want to verify this: G |- {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {length r}}}} : Number Similarly to the judgment for `if' expression, we need to require that the two result expressions are numbers. G |- 0 : Number G |- {+ 1 {length r}} : Number -------------------------------------------------------- G |- {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {length r}}}} : Number But this will not work -- we have no type for `r' here, so we can't prove the second subgoal. We need to consider the `NumList' type definition, or more specifically, the `NumCons' variant -- and from there, we know that using {NumCons x r} is a pattern that matches `NumList' values that are a result of this variant constructor *and* it binds `x' and `r' to the two fields and these have the declared types. This means that we need to extend G in this rule so we're able to prove the two subgoals: G |- 0 : Number G[x:=Number; r:=NumList] |- {+ 1 {length r}} : Number -------------------------------------------------------- G |- {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {length r}}}} : Number The last bit that we're missing here is that we can actually use `l' for this expression. In this specific case, we need to know that it is a `NumList': G |- l : NumList G |- 0 : Number G[x:=Number; r:=NumList] |- {+ 1 {length r}} : Number -------------------------------------------------------- G |- {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {length r}}}} : Number But why `NumList' and not some other defined type? This judgment is therefore doing a little more work: it will "look" at the variants that are mentioned in the branches, find the type that defines them, then use that type as the subgoal. Furthermore, to make the type checker more useful, it can check that we have complete coverage of the variants, and that no variant is used twice: G |- l : NumList (also need to show that NumEmpty and NumCons are all of the variants of NumList, with no repetition.) G |- 0 : Number G[x:=Number; r:=NumList] |- {+ 1 {length r}} : Number -------------------------------------------------------- G |- {cases l {{NumEmpty} 0} {{NumCons x r} {+ 1 {length r}}}} : Number Note that this a different route than the one taken in the textbook -- in there, there is a `type-case' expression with the type name mentioned explicitly -- for example: {type-case l NumList {{NumEmpty} 0} ...}. This is essentially the same as having each defined type come with its own `cases' expression. Our rule needs to do a little more work, but overall it is a little easier to use. Note about representation: note that a by-product of our type checker is that whenever we have a `NumList' value, we know that it *must* be an instance of either `NumEmpty' or `NumCons'. Thefore, we could represent such values as a wrapped value container, with a single bit that distinguishes the two. This is in contrast to (untyped) Scheme, where we always need to distinguish all values. ======================================================================== >> "Runaway" Instances Consider this code: {with-type {NumList [NumEmpty] ...} {NumEmpty}} We now know how to type check it, but what about the type of this whole expression? The obvious choice would be `NumList': {with-type {NumList [NumEmpty] ...} {NumEmpty}} : NumList But there is a subtle problem here: the expression evaluates to a `NumList', but we can no longer use this value, since we're out of the scope of the `NumList' type definition! In other words, we would typecheck a program that is pretty much useless. Even if we were to allow such a value to flow to a different context with a `NumList' type definition, we wouldn't want the two to be confused -- following the principle of lexical scope, we'd want each type definition to be unique to its own scope even if it has the same concrete name. (In fact, we might want to have a new type even if the value goes outside of this scope and back in. Struct definitions in PLT Scheme have exactly this property -- they're "generative" -- which means that each "call" to `define-struct' creates a new type, so: (define (two-foos) (define (foo x) (define-struct foo (x)) (make-foo x)) (list (make-foo 1) (make-foo 2))) returns two instances of two *different* `foo' types!) One way to resolve this is to just forbid the type from escaping the scope of its definition -- so we would forbid the type of the expression from being `NumList', which makes {with-type {NumList [NumEmpty] ...} {NumEmpty}} : NumList invalid. But that's not enough -- what about returning a compound value that *contains* an instance of `NumList'? For example -- what if we return a vector holding a `NumList' instance? Obviously, we would need to extend this restriction: the resulting type should not mention the defined type *at all* -- not even in a vector. This is actually easy to do: if the overall expression is type-checked in the surrounding lexical scope, then it is type-checked in the surrounding type environment (G), and that environment has nothing in it about `NumList'. Note that this is, very roughly speaking, what our course language does: `define-type' can only define new types when it is used at the top-level. This works fine with the above assumption that such a value would be completely useless -- but there are aspects of such values that can be used. Such types are close to things that are known as "existential types", and they can be useful to define opaque values that you can do nothing with except pass them around, and only code in a specific lexical context can actually use them. For example, you could lump together the value with a function that can work on this value. If it wasn't for the `define-type' top-level restriction, we could write the following: (: foo : Integer -> (List ??? (??? -> Integer))) (define (foo x) (define-type FOO [Foo (n Integer)]) (list (Foo 1) (lambda (f) (cases f [(Foo n) (* n n)])))) There is nothing that we can do with resulting `Foo' instance (we don't even have a way to name it) -- but in the result of the above function we get also a function that could work on such values, even ones from different calls: ((second (foo 1)) (first (foo 2))) -> 4 Since such kind of values are related to hiding information, they're useful (among other things) when talking about module systems, where you want to have a local scope for a piece of code with bindings that are not available outside it. ======================================================================== >>> Type soundness Having a type checker is obviously very useful -- but to be able to rely on it, we need to provide some kind of a formal account of the kind of guarantees that we get by using one. Specifically, we want to guarantee that a program that type-checks is guaranteed to never fail with a type error. (Such type errors in Scheme result in an exception -- but in C they can result in anything!) In this context we have a specific meaning for "fail with a type error", but these failures can be very different based on the kind of properties that your type checker verifies. This property of a type system is called "soundness": a *sound* type system is one that will never allow such errors for type-checked code: For any program `p', if we can type-check `p : t', then `p' will evaluate to a value that has type `t'. The importance of this can be seen in that it is the *only* connection between the type system and code execution. Without it, a type system is a bunch of syntactic rules that are completely disconnected from how the program runs. But this statement isn't exactly what we need -- it states a property that is too strong: what if execution gets stuck in an infinite loop? We need to revise it: For any program `p', if we can type-check `p : t', and if `p' terminates and returns `v', then `v' has type `t'. But there are still problems with this. Some programs evaluate to a value, some get stuck in an infinite loop, and some ... throw an error. Even with type checking, there are still cases when we get runtime errors. For example, in practically all statically typed languages the length of a list is not encoded in its type, so {first null} would throw an error. (It's possible to encode more information like that in types, but there a downside to this too: putting more information in the type system means that things can get less flexible, and/or it becomes more difficult to write programs since you're moving towards proving more facts about them.) Even if we were to encode list lengths in the type, we would still have runtime errors: opening a missing file, writing to a read-only file fetching a non-existent url, etc, so we must find some way to account for these errors. Some "solutions" are: * For all cases where an error should be raised, just return some value (of the appropriate type). For example, (first l) could return 0 if the list is empty; (substring "foo" 10 20) would return "huh?", etc. It seems like a dangerous way to resolve the issue, but in fact that's what most C library calls do: return some bogus value (for example, malloc() returns NULL when there is no available memory), and possibly set some global flag that specifies the exact error. (The main problem with this is that C programmers often don't check all of these conditions, leading to propagating undetected errors further down -- and all of this is a very rich source of security issues.) * For all cases where an error should be raised, just get stuck into an infinite loop. This approach is obviously impractical -- but it is actually popular in some theoretical circles. The reason for that is that theory people will often talk about "domain", and to express facts about computation on these domains, they're extended with a "bottom" value that represents a diverging computation. Since this introduction is costly in terms of work that it requires, adding one more such value can lead to more effort than re-using the same "bottom" value. * Raise an exception. This works out better than the above two extremes, and it is the approach taken by practically all modern languages. So, assuming exceptions, we need to further refine what it means for a type system to be sound: For any program `p', if we can type-check `p : t', and if `p' terminates without exceptions and returns `v', then `v' has type `t'. An important thing to note here is that languages can have very different ideas about where to raise an exception. For example, Scheme implementations often have a trivial type-checker and throw runtime exceptions when there is a type error. On the other hand, there are systems that express much more in their type system, leaving less room for runtime exceptions. A soundness proof ties together a particular type system with the statement that it is sound. As such, it is where you tie the knot between type checking (which happens at the syntactic level) and execution (dealing with runtime values). These are two things that are usually separate -- we've seen throughout the course many examples for things that could be done only at runtime, and things that should happen completely on the syntax. `eval' is the important "semantic function" that connects the two worlds (`compile' also did this, when we converted our evaluator to a compiler) -- and in here, it is the soundness proof that ties does this connection. To demonstrate the kind of differences between the two sides, consider an `if' expression -- when it is executed, only one branch is evaluated, and the other is irrelevant, but when when we check its type, *both* sides need to be verified. The same goes for a function whose execution get stuck in an infinite loop: the type checker will not get into this loop since it is not executing the code, only scans the (finite) syntax. The bottom line here is that type soundness is really a claim that the type system provides some guarantees about the runtime behavior of programs, and its proof demonstrates that these guarantees do hold. A fundamental problem with the type system of C and C++ is that it is not sound: these languages *have* a type system, but it does not provide such runtime guarantees. (In fact, C is even worse in that it really has two type systems: there is the system that C programmers usually interact with, which has a conventional set of type -- including even higher-order function types; and there is the machine-level types, which talks only about various bit lengths of data. For example, using "%s" in a printf() format string will blindly copy characters from the address pointed to by the argument until it reaches a 0 character -- even if the actual argument is really a floating point number or a function.) Note that people often talk about "strongly typed languages". This term is often meaningless in that different people take it to mean different things: it is sometimes used for a language that "has a static type checker", or a language that "has a non-trivial type checker", and sometimes it means that a language has a sound type system. For most people, however, it means some vague idea like "a language like C or Pascal or Java", without a more concrete definition of the term. ======================================================================== >>> Explicit polymorphism (Given from the book) ========================================================================