AutoCoder.AI Blog: August 2017

August 28, 2017

How life is doing it

The biological life on Earth is astonishing. It's full of living organisms of incredible complexity and robustness (just some numbers: on average a human body contains 30 trillion cells (gut flora adds 40T cells more), 86 billion of which are neurons forming 150 trillion connections; besides humans there is estimated 1 trillion species currently living on Earth, not counting not-exactly-living things like virii). The living organisms are constantly defending themselves from both directed attacks (by pathogens, predators, and such) and impressive levels of damage caused by environmental factors and by-products of biochemical machinery.

What is truly impressive is that all that panoply is encoded in essentially the same programming language, aka DNA/RNA code. Even more impressive is that this programming language is both very low-level (it is basically a biochemical machine code) and very compact: a full human genome contains about 3.08 billion base pairs (each encoding 2 bits, or 717 MiB total - yes, it fits on a CD-ROM!). Of that only 2% of human DNA actually encodes proteins, the rest is non-coding DNA which is about half junk (such as remnants of retrovirii and transposons). Other ncDNA usually has some biological function, mostly as modifiers to expression of coding regions, but the information density in non-coding regions is not large: genetic code of multi-cellular organisms compresses by about 75% by the lossless compressors (and by 99% by lossy probabilistic compressors such as GReEn).

Back-of-envelope estimate of the useful information content of human DNA would be about 20-25 MiB, and the total number of human genes is about 20000, at about 1KiB on average per gene. Conventional computer programming languages like C++ average about 30 bytes per line of code, so that would be equivalent to about 700-800 KLOCs, something a smallish group of programmers can write in a few years. And that relatively small amount of code is sufficient to encode not only a versatile organism which is well-equipped for both obtaining and digesting wide variety of foods, defending itself from nearly infinite variety of pathogens, parasites, and predators, but also has general intelligence sufficient to sustain the secondary memetic evolution.

Impressive, isn't it? There are certainly lessons to be learned from Mother Nature.

Now, just to get it out up front, there are significant differences between solid-state semiconductor and water-based biochemical computers. The most important is the degree of physical randomness inherent in the operation of circuits vs signal transduction pathways. The machines are designed to have very low error rates in their signal circuits, while molecular interactions in biology are driven by mechanical collisions of rather floppy protein molecules pushed around by Brownian motion of water molecules. A degree of reliability in biological signal transduction is achieved mostly by averaging over huge numbers of molecules involved (the averaging is impossible when relevant molecules are singular per cell (such as DNA in chromosomes) - the evolution created rather elaborate error correction and repair mechanisms for these). If anything, this makes programming biochemical machinery much more challenging.

The structure of biological programs is also very different: they are the products of evolution, and the outputs of evolutionary algorithms are notoriously incomprehensible, noisy, and unstructured. Everything designed by humans is much more structured and hierarchial by necessity, owing to the limitations of human cognition which we discussed before. That said, it is well known that it is quite possible to write functionally identical programs in a variety of styles from "clean" and easy to understand to "hacky" and obfuscated; so we could hope that architectural lessons from evolved systems could be transferred to designed systems, stylistic differences notwithstanding.

So, how does life do it? (We will only look at multicellular eucaryotes just to avoid the discussion of differences between these and other taxa).

The first observation is that biological programs do not provide exact plans for the organisms they encode. Instead they are generative: they provide a set of rules (encoded in signal transduction pathways) which tell cells how to specialize and migrate to the correct places after dividing. This process starts after the first division of a zygote into blastomere cells and uses various chemical gradients as a kind of coordinate system during early embryonic development, followed by use of paracrine and juxstacrine inter-cell signaling to guide the detailed development of organs.

The lesson here is that to implement functional generative system one needs a way to propagate information about "what" and "where" between the components being generated. Just specifying the interfaces is not enough. The process of generation is not one-shot, but rather iterative, with increasing degrees of differentiation (or decrease of cell potency levels in biological lingo).

The process of generating an organism also involves a lot of programmed cell deaths - apoptosis. The closest equivalent in computer programming would be dead code elimination combined with other optimization techniques. The take-away here is that it is a very good idea to perform optimization as integral part of generation, interleaving optimization with specialization.

A significant portion of genes encode the components for the basic biochemical machinery necessary for the individual cells to function, this machinery is basically the same for all plants and animals; a (possibly hyperbolic) claim by Prof. Steve Jones is that humans and bananas have 50% of their genes in common. These common genes are important because their breakage usually results in unviable or seriously crippled organisms, and so changes in these genes are strongly selected against. The closest analog in computer programming is low-level libraries (libc and such) - a small change in semantics or interfaces of functions provided by these libraries (for example swapping source and destination arguments of strcpy is likely to break pretty much every program in an operating system).

The genes encoding higher-order features (such as additional signal transduction pathways and receptors) which generate complex body plans during development and are necessary for operating complex organs are much more variant between species. Another trend is increased role of regulation of gene expression in more complex species - i.e. proteins are increasingly "standardized" while controls which affect production rate of specific proteins depending on environment and internal state of the cell grow in importance.

These observations certainly match what we see in the software: the low-level libraries quickly ossify, and any change in them has to be very carefully considered. The higher-level functions are more mutable, and a successful software product often sees quite a lot of changes there. What is missing in software is an equivalent of the expression regulation (there are some crude techniques which could be used to get some of that in computer code - we will discuss generic programming in the next post). What biology has and computer programming has not is the way to control the use of basic blocks in complex ways, depending on the local environment and interactions of the block with other blocks.

Another interesting aspect of generating a functioning cellular machinery from genes is how much the information from the genes gets transformed during the gene expression (by necessity, the outline here is significantly simplified). First, the expression of different parts of DNA is promoted or repressed by a variety of mechanisms: the fact that the gene is here doesn't mean it's going to be expressed at all! These mechanisms are controlled by intracellular signals through signal transduction pathways. The result of DNA transcription is short-ish RNA strands, containing a copy of information from the transcribed parts of DNA.

The second step is RNA processing, which sees the RNA strands being modified. The most important of these modification is splicing which generally removes parts of the genes (known as introns) and discards them and then joins the remaining parts (aka exons) into an edited whole. The interesting part is that RNA splicing is not necessarily fixed, and different alternate parts of RNA could be excised during splicing thus producing different proteins in the end (this is known as alternative splicing). Just like transcription, the process of splicing can be controlled by biochemical signals.

The resulting RNA needs to be transported from the nucleus to the proper place in the cell; this process can be controlled by tags specifying the destination present on the RNA.

The next major step is translation, which reads the linear genetic code from mRNA and builds a linear chain of amino acids (the 64-entry translation table is fixed) by reading groups of 3 RNA bases (codons) to select the proper amino acid to be attached to the end of the chain. The resulting proteins are still non-functional, because the function of a protein depends on the way it is folded into a three-dimensional blob. This process (called protein folding) is driven by the chain seeking a local minimum in its energy; however an energy landscape is non-trivial and to help the proteins to find the right minimum (and thus fold in a proper way) the external help is often needed, this help is provided by specialized proteins called chaperones. The abundance of chaperone proteins may control the rate of production of functional proteins (and the abundance of chaperones can be in turn controlled by environmental factors; for example many chaperones are classified as heat shock proteins which are produced by cells in response to stressful conditions).

The number of possible configurations of even modestly-sized amino acid chains is astronomical, so no full search for the energetically optimal configurations is possible as it would take longer than the age of the Universe - while in reality it happens in micro or milliseconds (this is known as Levinthal's paradox). Note how similar is this to the search in very high-dimensional spaces commonly used in the modern machine learning; while the ML practitioners can try a large variety of heuristic optimization algorithms, the protein sequences are selected by the evolution so that their energy landscapes form funnels quickly guiding the folding to the desired outcome (see Anfinsen's dogma).

Finally, the proteins need to be transported to their proper places via protein transport and translocation directed by signal peptides tacked to the N-terminus of amino acid chains. This process can be modified through a variety of mechanisms chemically altering the signal peptide.

The main lesson here is that the end product of gene expression (the functional protein located at a proper place in the cell) is vastly different from the original gene, and the "code" for the proteins is a subject of significant transformation. The multi-stage expression process is controlled or influenced by a panoply of biochemical signals coming off the complex signal transduction pathways. The process is controlled and modified by external signals at all stages (however, it appears that the most controls are at the earliest stage). Some parts of the expression process are computationally intensive, and so the "design" by evolution favored structures which are amenable to simple simulated-annealing like search.

So far, the mechanisms described above are sufficient for building an organism which can grow from a single cell and have some fixed instinctive behaviors and react to environmental circumstances in a "pre-programmed" manner. However, in a real life each organism is also forced to defend itself from a nearly infinite variety of pathogens. Because the pathogens are extraneous to the organisms (and most of them can mutate and evolve quickly) there is no way to pre-program effective defenses against most of them (the generic innate immunity defenses such as inflammation can only go so far). The organism needs to somehow learn what pathogens are attacking it, and learn how to recognize them (or cells infected by them) efficiently. This is done by an ingenuous mechanism called adaptive immune system.

The adaptive immunity needs some way to create biochemical sensors (receptors) which have very high specificity to particular species or strains of pathogens - at a molecular level (i.e. inaccessible to neural learning). These sensors are known as antibodies, and basically are Y-shaped protein complexes which may have a nearly infinite variety of mechanical configurations at their antigen-binding sites. In effect, these act like locks "opened" by molecules specific to pathogens (similar to malware signatures in computer anti-virus software). The challenge is how to produce this variety from a rather small number of genes, and how to keep only those antibodies which are effective.

The high variety of antibodies is generated by the genetic mechanism called V(D)J recombination of genes encoding immunoglobulins. The parts of these genes are randomly permuted and excised within individual somatic cells (additional randomness is thrown in by the abnormally high mutation rates at some parts of these genes, a phenomenon known as somatic hypermutation). This creates a population of functionally similar cells (such as T-lymphocytes) which secrete variety of antibodies. The population of these cells is pruned against "self" antigens in thymus, thus ensuring that the immune system doesn't attack the organism itself. Then the T-lymphocytes which encountered the antigens they recognize activate (and proliferate) while those which are dormant die (so this works is an evolution inside the individual organism, a process known as clonal selection). After the infection is cleared, some T cells with the antibodies recognizing the pathogen remain, to speed up immune response, a phenomenon we know as acquired immunity.

So here we have yet another lesson: if the system needs to adapt to unpredictable requirements, it may make sense to generate parts of it randomly, and select the best combination.

Surprisingly, quite a lot of how life works has analogies in how software tool chains work. Optimizing compilers certainly perform non-trivial code transformations, etc. The major difference is that biological "programs" are mostly not descriptions of what to build, but rather descriptions of how to build. These descriptions make use of complex controls on expression of genes and subsequent transformations. The cells (which start with the same genome) are driven to specialize to fit the specific needs. The detailed plan of the organism is not something fixed by the "programmer" but rather a result of code generators (stem cells) reacting to both outside stimuli and relations to other generators. The transformation from code in genes to the actual "implementation" involves search in huge multidimensional landscapes, and sometimes use randomness and evolutionary algorithm to find the needed solution. The most striking aspect of biology is that there just isn't any clear-cut "universal" method for generating organisms from their genetic descriptions (nothing like use of β-reduction for abstraction unwrapping in programming), and that huge diversity of domain-specific methods is used to solve specific problems.

The tools we use for creating software are much more restrictive, possibly due to the misguided pursuit of "cleanliness", and emphasis on precise control of the resulting machine code. There are some relatively recent advances in the direction of generative programming (notably generic programming in C++), but these fall way short of what is needed for releasing programmers from the tyranny of details. (Generic programming and its limitations will be the topic of the next post).

August 2, 2017

The Spherical Cow in Vacuum, Part 2 - Type Theory

By the beginning of 20th century mathematicians found that they have a yuuge problem: the attempt to rigorously formulate the fundamentals of mathematics failed miserably. The very concept of sets (in the so-called naive set theory) was found to be self-contradictory, as is demonstrated by Bertrand Russel's paradox. The problem comes from the way sets were defined: as collections of objects satisfying some conditions (usually using set builder notation), and these conditions could be self-referential to the set being defined. Obviously, the way the concept of sets was defined needed some change to make it more restrictive in order to avoid variants of Russel's and other paradoxes.

One (but not the only one) approach was to coerce mathematical objects into hierarchies (i.e. we may have boojums, set of boojums, set of sets of boojums, set of pairs of boojums, etc) thus preventing self-referentiality. The mathematical formulation of this became known as type theory.

The type theory turned out to be a nice fit for λ-calculus; and so it was only natural to design programming languages to incorporate it as the basic programming concept: something we now call "records", "objects", "struct-s", or "classes". (The more modern object-oriented formulations also throw in some syntactic sugar in form of member methods).

It is important to understand that the purpose of type theory is to eliminate paradoxes arising when we're overly clever with the way we specify predicates in our set builders. The use of the notion of types along the lines of type theory has nothing do with that: when writing programs we are not at all concerned with correctness of all possible programs (i.e. that all programs produce a result instead of some of them looping indefinitely).

We cannot even say that the types we use in programming are not self-referential: there are common use cases when we have to have circular dependencies between types. That's why we have forward declarations.

So why the type-theoretical view of types is being used so widely in the practical programming?

For one, the constructive notion of types (when we construct higher-level types from collections of elements of lower-level types) is quite easy to learn and understand, and is quite intuitive. Secondly, the conventional type systems do offer some degree of information hiding and abstraction. And last (and in the author's opinion not least) is horrible inertia: the choice of the the readily available mathematical concept was quite reasonable when the computing pioneers were inventing first programming languages. Since then the short-comings of that choice became painfully obvious, to the point that some programming language designers have chosen to dismiss the whole idea of static typing completely, notably in the so-called scripting languages which recently became the most popular kind of programming languages. (This was facilitated by the tremendous growth of the computing power which masks many programming sins these languages positively encourage).

The type theory-inspired type systems in conventional programming is the second most important mechanism for expressing engineering abstraction after the notion of subroutines (inspired by λ-calculus). And, just like the case of λ-calculus, the type theory provides the conceptual framework not matching the reality of programming. We're again in the spherical cow territory.

The use of types in programming is not remotely as neat as the theory and common programming languages conflate several very different concepts, whereby types are used to represent implementation of the value storage and methods of accessing and modifying the value, the constraints on the value usage, the parametrization for polymorphism, and the semasiology (i.e. meaning to the programmer) of these values (this somewhat obscure term is chosen to explicitly distinguish notion of meaning of a value from the notion of its semantics as used in computer science: i.e. limited to the computation process itself).

The implementation aspect of a type in programming languages is the declaration made by the programmer to the effect that a specific hardware-supported encoding should be used to store and manipulate these values. That's what all those char, int, long, double, etc declarations are. It quickly became apparent that it would be completely impractical to explicitly specify types of all values, and so all existing high-level programming languages rely on automatic type derivation for intermediate values. The declarations of compound types (arrays, structures, etc) also mandate presence (and in some languages, specific order) of all their elementary or compound fields in memory - the ordering and placement of the fields may have a dramatic effect on the program's performance because of data caching by CPUs. The implementation specifiers in the types are also used to control where in RAM the value will be stored - be it heap, stack, or statically mapped data segment.

So far so good, it all makes a lot of practical sense, but has no relationship whatsoever to the type theory. The implementation aspect of types in programming languages was historically first, with type composition coming only later.

The second important practical aspect of types in programming languages is checking constraints on individual values they place, and using these constraints to optimize code. The most common constraint is on the kind of values a variable can contain: integer numbers, Boolean values, strings, etc. The second is range specification (usually implicit, though some languages like Pascal allow explicit range specifications). Additionally, the constraints may include access controls (i.e. public vs private members of classes) and whether the value is a constant or can be changed run-time. For array types constraints may include dimensionality and dimension sizes, and for structures they may manifest as requiring type identifier match and satisfying parent-child relationship predicates in case of type inheritance.

The constraint checking is important for detecting programming errors, essentially allowing users of abstract interfaces to spend less mental efforts an time on making sure that uses of interface match the interface definition. This, however, is undermined by the inflexibility of the compositional construction of the types in modern programming languages usually forcing the interface designers to compromise between usability and type safety.

Having constraints on the possible types of values a variable or argument to a subroutine may have allows disambiguating between alternative implementations of a subroutine, so instead of having the subroutine name reflect the type of arguments, we can have polymorphic subroutines selected based on their signatures (which combine name and argument type constraints). (The next step after subroutine polymorphism is generic programming with templates, but this is a topic for a future post).

Finally, programmers use types (particularly type names) to indicate what values mean to the programmer. From a computer's point of view, number of apples is integer and number of oranges is integer, and you can add them with abandon. For a programmer doing so would in many cases be an error. With some degree of syntactic ugliness and verbosity we could define classes Apples and Oranges which contain the integer counters, define arithmetical operations on these classes, and achieve the detection of apple plus orange kind of programming errors for these classes. Doing so is usually so burdensome for simple values that nobody bothers, leaving the program correctness to luck and prayer. The structures and classes, however, are already named, so this is often used to enforce semasiologic correctness: if routine placeCall takes an AddressBook as an argument, and we're trying to pass it value of type BankAccount the compiler will rightfully complain.

This type-based support for semasiological constraint checking is quite limited: it mainly works with simple hierarchy-based ontologies, and any attempt to represent a less trivial relationship between types again requires non-trivial hackery. It falls to the programmer to keep track of the meaning of the values to ensure that the program does what is intended by the programmer.

The boundary between semasiological meaning and program semantics is somewhat blurry, and some aspects of value types, traditionally considered as outside of scope of the programming language do affect computation. For example, transposing a large matrix in memory is an expensive operation, so we may want the compiler to perform transposition elimination by changing iteration order in downstream matrix operations. Another example is sorted-ness of an array (complicated by the fact that order is always associated with a specific predicate). There is no natural way to express something like "this array of integers is sorted in ascending order" in a conventional programming language, so nobody does that, preferring to keep that information in one's head (and, depending on professionalism of the coder, comments).

The conflation of completely different aspects of types is one of the reasons why modern programming seems to be stuck at a rather low abstraction ceiling. The existence of this problem is recognized, but fixing it within a singular type systems seems to be elusive - to the point that there are proposals (notably by Gilad Bracha) to ditch the built-in rigid type system and replace it with optional, pluggable domain-specific type systems.

It may be more fruitful to stop treating types as strictly composable objects, type theory-style, and start looking at them as arbitrary annotations representing constraints which can be (at least partially) computed at compile time. We will return to this in the future posts, after the discussion of generic programming provides some background.