AutoCoder.AI Blog: The Spherical Cow in Vacuum, Part 2

By the beginning of 20th century mathematicians found that they have a yuuge problem: the attempt to rigorously formulate the fundamentals of mathematics failed miserably. The very concept of sets (in the so-called naive set theory) was found to be self-contradictory, as is demonstrated by Bertrand Russel's paradox. The problem comes from the way sets were defined: as collections of objects satisfying some conditions (usually using set builder notation), and these conditions could be self-referential to the set being defined. Obviously, the way the concept of sets was defined needed some change to make it more restrictive in order to avoid variants of Russel's and other paradoxes.

One (but not the only one) approach was to coerce mathematical objects into hierarchies (i.e. we may have boojums, set of boojums, set of sets of boojums, set of pairs of boojums, etc) thus preventing self-referentiality. The mathematical formulation of this became known as type theory.

The type theory turned out to be a nice fit for λ-calculus; and so it was only natural to design programming languages to incorporate it as the basic programming concept: something we now call "records", "objects", "struct-s", or "classes". (The more modern object-oriented formulations also throw in some syntactic sugar in form of member methods).

It is important to understand that the purpose of type theory is to eliminate paradoxes arising when we're overly clever with the way we specify predicates in our set builders. The use of the notion of types along the lines of type theory has nothing do with that: when writing programs we are not at all concerned with correctness of all possible programs (i.e. that all programs produce a result instead of some of them looping indefinitely).

We cannot even say that the types we use in programming are not self-referential: there are common use cases when we have to have circular dependencies between types. That's why we have forward declarations.

So why the type-theoretical view of types is being used so widely in the practical programming?

For one, the constructive notion of types (when we construct higher-level types from collections of elements of lower-level types) is quite easy to learn and understand, and is quite intuitive. Secondly, the conventional type systems do offer some degree of information hiding and abstraction. And last (and in the author's opinion not least) is horrible inertia: the choice of the the readily available mathematical concept was quite reasonable when the computing pioneers were inventing first programming languages. Since then the short-comings of that choice became painfully obvious, to the point that some programming language designers have chosen to dismiss the whole idea of static typing completely, notably in the so-called scripting languages which recently became the most popular kind of programming languages. (This was facilitated by the tremendous growth of the computing power which masks many programming sins these languages positively encourage).

The type theory-inspired type systems in conventional programming is the second most important mechanism for expressing engineering abstraction after the notion of subroutines (inspired by λ-calculus). And, just like the case of λ-calculus, the type theory provides the conceptual framework not matching the reality of programming. We're again in the spherical cow territory.

The use of types in programming is not remotely as neat as the theory and common programming languages conflate several very different concepts, whereby types are used to represent implementation of the value storage and methods of accessing and modifying the value, the constraints on the value usage, the parametrization for polymorphism, and the semasiology (i.e. meaning to the programmer) of these values (this somewhat obscure term is chosen to explicitly distinguish notion of meaning of a value from the notion of its semantics as used in computer science: i.e. limited to the computation process itself).

The implementation aspect of a type in programming languages is the declaration made by the programmer to the effect that a specific hardware-supported encoding should be used to store and manipulate these values. That's what all those char, int, long, double, etc declarations are. It quickly became apparent that it would be completely impractical to explicitly specify types of all values, and so all existing high-level programming languages rely on automatic type derivation for intermediate values. The declarations of compound types (arrays, structures, etc) also mandate presence (and in some languages, specific order) of all their elementary or compound fields in memory - the ordering and placement of the fields may have a dramatic effect on the program's performance because of data caching by CPUs. The implementation specifiers in the types are also used to control where in RAM the value will be stored - be it heap, stack, or statically mapped data segment.

So far so good, it all makes a lot of practical sense, but has no relationship whatsoever to the type theory. The implementation aspect of types in programming languages was historically first, with type composition coming only later.

The second important practical aspect of types in programming languages is checking constraints on individual values they place, and using these constraints to optimize code. The most common constraint is on the kind of values a variable can contain: integer numbers, Boolean values, strings, etc. The second is range specification (usually implicit, though some languages like Pascal allow explicit range specifications). Additionally, the constraints may include access controls (i.e. public vs private members of classes) and whether the value is a constant or can be changed run-time. For array types constraints may include dimensionality and dimension sizes, and for structures they may manifest as requiring type identifier match and satisfying parent-child relationship predicates in case of type inheritance.

The constraint checking is important for detecting programming errors, essentially allowing users of abstract interfaces to spend less mental efforts an time on making sure that uses of interface match the interface definition. This, however, is undermined by the inflexibility of the compositional construction of the types in modern programming languages usually forcing the interface designers to compromise between usability and type safety.

Having constraints on the possible types of values a variable or argument to a subroutine may have allows disambiguating between alternative implementations of a subroutine, so instead of having the subroutine name reflect the type of arguments, we can have polymorphic subroutines selected based on their signatures (which combine name and argument type constraints). (The next step after subroutine polymorphism is generic programming with templates, but this is a topic for a future post).

Finally, programmers use types (particularly type names) to indicate what values mean to the programmer. From a computer's point of view, number of apples is integer and number of oranges is integer, and you can add them with abandon. For a programmer doing so would in many cases be an error. With some degree of syntactic ugliness and verbosity we could define classes Apples and Oranges which contain the integer counters, define arithmetical operations on these classes, and achieve the detection of apple plus orange kind of programming errors for these classes. Doing so is usually so burdensome for simple values that nobody bothers, leaving the program correctness to luck and prayer. The structures and classes, however, are already named, so this is often used to enforce semasiologic correctness: if routine placeCall takes an AddressBook as an argument, and we're trying to pass it value of type BankAccount the compiler will rightfully complain.

This type-based support for semasiological constraint checking is quite limited: it mainly works with simple hierarchy-based ontologies, and any attempt to represent a less trivial relationship between types again requires non-trivial hackery. It falls to the programmer to keep track of the meaning of the values to ensure that the program does what is intended by the programmer.

The boundary between semasiological meaning and program semantics is somewhat blurry, and some aspects of value types, traditionally considered as outside of scope of the programming language do affect computation. For example, transposing a large matrix in memory is an expensive operation, so we may want the compiler to perform transposition elimination by changing iteration order in downstream matrix operations. Another example is sorted-ness of an array (complicated by the fact that order is always associated with a specific predicate). There is no natural way to express something like "this array of integers is sorted in ascending order" in a conventional programming language, so nobody does that, preferring to keep that information in one's head (and, depending on professionalism of the coder, comments).

The conflation of completely different aspects of types is one of the reasons why modern programming seems to be stuck at a rather low abstraction ceiling. The existence of this problem is recognized, but fixing it within a singular type systems seems to be elusive - to the point that there are proposals (notably by Gilad Bracha) to ditch the built-in rigid type system and replace it with optional, pluggable domain-specific type systems.

It may be more fruitful to stop treating types as strictly composable objects, type theory-style, and start looking at them as arbitrary annotations representing constraints which can be (at least partially) computed at compile time. We will return to this in the future posts, after the discussion of generic programming provides some background.

AutoCoder.AI Blog

August 2, 2017

The Spherical Cow in Vacuum, Part 2 - Type Theory

No comments:

Post a Comment