August 28, 2017

How life is doing it

The biological life on Earth is astonishing.  It's full of living organisms of incredible complexity and robustness (just some numbers: on average a human body contains 30 trillion cells (gut flora adds 40T cells more), 86 billion of which are neurons forming 150 trillion connections; besides humans there is estimated 1 trillion species currently living on Earth, not counting not-exactly-living things like virii).  The living organisms are constantly defending themselves from both directed attacks (by pathogens, predators, and such) and impressive levels of damage caused by environmental factors and by-products of biochemical machinery.

What is truly impressive is that all that panoply is encoded in essentially the same programming language, aka DNA/RNA code. Even more impressive is that this programming language is both very low-level (it is basically a biochemical machine code) and very compact: a full human genome contains about 3.08 billion base pairs (each encoding 2 bits, or 717 MiB total - yes, it fits on a CD-ROM!).  Of that only 2% of human DNA actually encodes proteins, the rest is non-coding DNA which is about half junk (such as remnants of retrovirii and transposons).  Other ncDNA usually has some biological function, mostly as modifiers to expression of coding regions, but the information density in non-coding regions is not large: genetic code of multi-cellular organisms compresses by about 75% by the lossless compressors (and by 99% by lossy probabilistic compressors such as GReEn).

Back-of-envelope estimate of the useful information content of human DNA would be about 20-25 MiB, and the total number of human genes is about 20000, at about 1KiB on average per gene. Conventional computer programming languages like C++ average about 30 bytes per line of code, so that would be equivalent to about 700-800 KLOCs, something a smallish group of programmers can write in a few years.  And that relatively small amount of code is sufficient to encode not only a versatile organism which is well-equipped for both obtaining and digesting wide variety of foods, defending itself from nearly infinite variety of pathogens, parasites, and predators, but also has general intelligence sufficient to sustain the secondary memetic evolution.

Impressive, isn't it?  There are certainly lessons to be learned from Mother Nature.

Now, just to get it out up front, there are significant differences between solid-state semiconductor and water-based biochemical computers.   The most important is the degree of physical randomness inherent in the operation of circuits vs signal transduction pathways.  The machines are designed to have very low error rates in their signal circuits, while molecular interactions in biology are driven by mechanical collisions of rather floppy protein molecules pushed around by Brownian motion of water molecules.  A degree of reliability in biological signal transduction is achieved mostly by averaging over huge numbers of molecules involved (the averaging is impossible when relevant molecules are singular per cell (such as DNA in chromosomes) - the evolution created rather elaborate error correction and repair mechanisms for these).  If anything, this makes programming biochemical machinery much more challenging.

The structure of biological programs is also very different: they are the products of evolution, and the outputs of evolutionary algorithms are notoriously incomprehensible, noisy, and unstructured. Everything designed by humans is much more structured and hierarchial by necessity, owing to the limitations of human cognition which we discussed before.  That said, it is well known that it is quite possible to write functionally identical programs in a variety of styles from "clean" and easy to understand to "hacky" and obfuscated; so we could hope that architectural lessons from evolved systems could be transferred to designed systems, stylistic differences notwithstanding.

So, how does life do it?  (We will only look at multicellular eucaryotes just to avoid the discussion of differences between these and other taxa).

The first observation is that biological programs do not provide exact plans for the organisms they encode.  Instead they are generative: they provide a set of rules (encoded in signal transduction pathways) which tell cells how to specialize and migrate to the correct places after dividing.  This process starts after the first division of a zygote into blastomere cells and uses various chemical gradients as a kind of coordinate system during early embryonic development, followed by use of paracrine and juxstacrine inter-cell signaling to guide the detailed development of organs.

The lesson here is that to implement functional generative system one needs a way to propagate information about "what" and "where" between the components being generated.  Just specifying the interfaces is not enough.  The process of generation is not one-shot, but rather iterative, with increasing degrees of differentiation (or decrease of cell potency levels in biological lingo).

The process of generating an organism also involves a lot of programmed cell deaths - apoptosis. The closest equivalent in computer programming would be dead code elimination combined with other optimization techniques.  The take-away here is that it is a very good idea to perform optimization as integral part of generation, interleaving optimization with specialization.

A significant portion of genes encode the components for the basic biochemical machinery necessary for the individual cells to function, this machinery is basically the same for all plants and animals; a (possibly hyperbolic) claim by Prof. Steve Jones is that humans and bananas have 50% of their genes in common.  These common genes are important because their breakage usually results in unviable or seriously crippled organisms, and so changes in these genes are strongly selected against.  The closest analog in computer programming is low-level libraries (libc and such) - a small change in semantics or interfaces of functions provided by these libraries (for example swapping source and destination arguments of strcpy is likely to break pretty much every program in an operating system).

The genes encoding higher-order features (such as additional signal transduction pathways and receptors) which generate complex body plans during development and are necessary for operating complex organs are much more variant between species.  Another trend is increased role of regulation of gene expression in more complex species - i.e. proteins are increasingly "standardized" while controls which affect production rate of specific proteins depending on environment and internal state of the cell grow in importance.

These observations certainly match what we see in the software: the low-level libraries quickly ossify, and any change in them has to be very carefully considered.  The higher-level functions are more mutable, and a successful software product often sees quite a lot of changes there.  What is missing in software is an equivalent of the expression regulation (there are some crude techniques which could be used to get some of that in computer code - we will discuss generic programming in the next post).   What biology has and computer programming has not is the way to control the use of basic blocks in complex ways, depending on the local environment and interactions of the block with other blocks.

Another interesting aspect of generating a functioning cellular machinery from genes is how much the information from the genes gets transformed during the gene expression (by necessity, the outline here is significantly simplified).  First, the expression of different parts of DNA is promoted or repressed by a variety of mechanisms: the fact that the gene is here doesn't mean it's going to be expressed at all!  These mechanisms are controlled by intracellular signals through signal transduction pathways.  The result of DNA transcription is short-ish RNA strands, containing a copy of information from the transcribed parts of DNA.

The second step is RNA processing, which sees the RNA strands being modified. The most important of these modification is splicing which generally removes parts of the genes (known as introns) and discards them and then joins the remaining parts (aka exons) into an edited whole.   The interesting part is that RNA splicing is not necessarily fixed, and different alternate parts of RNA could be excised during splicing thus producing different proteins in the end (this is known as alternative splicing).  Just like transcription, the process of splicing can be controlled by biochemical signals.

The resulting RNA needs to be transported from the nucleus to the proper place in the cell;  this process can be controlled by tags specifying the destination present on the RNA.

The next major step is translation, which reads the linear genetic code from mRNA and builds a linear chain of amino acids (the 64-entry translation table is fixed) by reading groups of 3 RNA bases (codons) to select the proper amino acid to be attached to the end of the chain.   The resulting proteins are still non-functional, because the function of a protein depends on the way it is folded into a three-dimensional blob.  This process (called protein folding) is driven by the chain seeking a local minimum in its energy; however an energy landscape is non-trivial and to help the proteins to find the right minimum (and thus fold in a proper way) the external help is often needed, this help is provided by specialized proteins called chaperones.  The abundance of chaperone proteins may control the rate of production of functional proteins (and the abundance of chaperones can be in turn controlled by environmental factors; for example many chaperones are classified as heat shock proteins which are produced by cells in response to stressful conditions).

The number of possible configurations of even modestly-sized amino acid chains is astronomical, so no full search for the energetically optimal configurations is possible as it would take longer than the age of the Universe - while in reality it happens in micro or milliseconds (this is known as Levinthal's paradox).  Note how similar is this to the search in very high-dimensional spaces commonly used in the modern machine learning;  while the ML practitioners can try a large variety of heuristic optimization algorithms, the protein sequences are selected by the evolution so that their energy landscapes form funnels quickly guiding the folding to the desired outcome (see Anfinsen's dogma).

Finally, the proteins need to be transported to their proper places via protein transport  and translocation directed by signal peptides tacked to the N-terminus of amino acid chains.  This process can be modified through a variety of mechanisms chemically altering the signal peptide.

The main lesson here is that the end product of gene expression (the functional protein located at a proper place in the cell) is vastly different from the original gene, and the "code" for the proteins is a subject of significant transformation.  The multi-stage expression process is controlled or influenced by a panoply of biochemical signals coming off the complex signal transduction pathways.  The process is controlled and modified by external signals at all stages (however, it appears that the most controls are at the earliest stage).   Some parts of the expression process are computationally intensive, and so the "design" by evolution favored structures which are amenable to simple simulated-annealing like search.

So far, the mechanisms described above are sufficient for building an organism which can grow from a single cell and have some fixed instinctive behaviors and react to environmental circumstances in a "pre-programmed" manner. However, in a real life each organism is also forced to defend itself from a nearly infinite variety of pathogens.  Because the pathogens are extraneous to the organisms (and most of them can mutate and evolve quickly) there is no way to pre-program effective defenses against most of them (the generic innate immunity defenses such as inflammation can only go so far). The organism needs to somehow learn what pathogens are attacking it, and learn how to recognize them (or cells infected by them) efficiently.  This is done by an ingenuous mechanism called adaptive immune system.

The adaptive immunity needs some way to create biochemical sensors (receptors) which have very high specificity to particular species or strains of pathogens - at a molecular level (i.e. inaccessible to neural learning).  These sensors are known as antibodies, and basically are Y-shaped protein complexes which may have a nearly infinite variety of mechanical configurations at their antigen-binding sites.  In effect, these act like locks "opened" by molecules specific to pathogens (similar to malware signatures in computer anti-virus software).  The challenge is how to produce this variety from a rather small number of genes, and how to keep only those antibodies which are effective.

The high variety of antibodies is generated by the genetic mechanism called V(D)J recombination of genes encoding immunoglobulins.  The parts of these genes are randomly permuted and excised within individual somatic cells (additional randomness is thrown in by the abnormally high mutation rates at some parts of these genes, a phenomenon known as somatic hypermutation).  This creates a population of functionally similar cells (such as T-lymphocytes) which secrete variety of antibodies. The population of these cells is pruned against "self" antigens in thymus, thus ensuring that the immune system doesn't attack the organism itself.  Then the T-lymphocytes which encountered the antigens they recognize activate (and proliferate) while those which are dormant die (so this works is an evolution inside the individual organism, a process known as clonal selection).  After the infection is cleared, some T cells with the antibodies recognizing the pathogen remain, to speed up immune response, a phenomenon we know as acquired immunity.

So here we have yet another lesson: if the system needs to adapt to unpredictable requirements, it may make sense to generate parts of it randomly, and select the best combination.

Surprisingly, quite a lot of how life works has analogies in how software tool chains work. Optimizing compilers certainly perform non-trivial code transformations, etc.  The major difference is that biological "programs" are mostly not descriptions of what to build, but rather descriptions of how to build.  These descriptions make use of complex controls on expression of genes and subsequent transformations. The cells (which start with the same genome) are driven to specialize to fit the specific needs.  The detailed plan of the organism is not something fixed by the "programmer" but rather a result of code generators (stem cells) reacting to both outside stimuli and relations to other generators.   The transformation from code in genes to the actual "implementation" involves search in huge multidimensional landscapes, and sometimes use randomness and evolutionary algorithm to find the needed solution.   The most striking aspect of biology is that there just isn't any clear-cut "universal" method for generating organisms from their genetic descriptions (nothing like use of β-reduction for abstraction unwrapping in programming), and that huge diversity of domain-specific methods is used to solve specific problems.

The tools we use for creating software are much more restrictive, possibly due to the misguided pursuit of "cleanliness", and emphasis on precise control of the resulting machine code.  There are some relatively recent advances in the direction of generative programming (notably generic programming in C++), but these fall way short of what is needed for releasing programmers from the tyranny of details.  (Generic programming and its limitations will be the topic of the next post).






No comments:

Post a Comment

This blog is about software and technology, any off-topic (especially politics-related) comments will be removed without any discussion.