AutoCoder.AI Blog: July 2017

July 12, 2017

The Spherical Cow in Vacuum, Part 1 - Lambda Calculus

There is an old joke about a farmer who figured out that egghead boffins who can split an atom could surely figure out a way to get his cows to produce more milk. So he contacted the physics department in the famous university and sweet-talked them into agreeing to consider his problem. The intrepid team of scientists eagerly took on the challenge and spent countless hours arguing, writing equations on chalk boards, and publishing papers. Days passed, then months, and then the team announced the break-through - they made huge progress! When the farmer heard about it, he promptly inquired so as to what the solution is and how can he make use of it at his farm. "Not so fast..." replied boffins "Our solution only applies to spherical cows in vacuum."

As the joke goes, it's only partially a joke. One of the pitfalls of doing science is that you could build a nice theory which seems to fit your problem quite well... all the simple problems have solutions, but when you get to interesting ones, it all breaks down. Pesky details ignored by the theory suddenly become major issues, and the nice theoretical framework now gets in the way of finding a right way of thinking about the problem at hand.

The most fundamental mathematical theory to the practical computing is lambda calculus (often written as "λ-calculus"). It was created in 1930 by Alonzo Church as part of his life-long work on foundations of mathematical logic (specifically, he came up with it in order to show that Entscheidungsproblem is not solvable). It has an enormous influence on computer science and programming practice, from the first programming languages of late 1950s, and all the way to the contemporary fad of functional programming (which can be described as a thin veneer of syntactic sugar on top of λ-calculus). It is beautiful, and some expressions have even gained cult status among the cognoscentes (see Y combinator).

It is also the biggest spherical cow in vacuum one could find. Let me explain.

The central concept of λ-calculus is abstraction (denoted by letter λ). It is easily recognizable in a notion of subroutine in pretty much every general-purpose programming language. The abstraction contains the body (a formula) it wraps, and bound named variable(s) used in the formula and exposed by the abstraction as arguments (in programming parlance this would be formal parameter(s) of the subroutine).

Specific values of parameters can be applied to an abstraction, using an operation called β-reduction which basically replaces every mention of a bound variable (parameter) with the formula of the specific value. (There are some other details - such as dealing with renaming of variables to avoid name conflicts, these are not important for the purpose of this discussion).

These λ-calculus operations of abstraction and application are directly reproduced in pretty much every conventional programming language and serve as the primary mechanism for abstraction (which corresponds to a subroutine declaration) and abstraction unwrapping (I deliberately use non-standard term here to avoid more specific connotations) - which corresponds to either subroutine call or subroutine code in-lining (the semantics is the same, but these choices offer different trade-offs between executable code size and performance). For brevity, I'll omit closures - they are just syntactic sugar over plain old calling of subroutines by pointers.

In λ-calculus, expressions have no side effects and are pure: they only produce result value when evaluated, and that value will be the same when all variable values are the same. And this is here where the nice theoretical construct starts to clash with the reality.

Execution of any computer code has side effects. Let it sink.

Some of these side effects are intrinsic: they change the state of the computer in ways directly accessible to the program (functional programming style attempts to minimize these), but more interesting side effects are extrinsic: the time it takes to run the code, the amount of heat generated by the computation, data being stored in and evicted from cache, network packets being sent, changes in pixel luminosity on displays, etc, etc. Because programs run strictly within confines of CPU and RAM - which are not directly accessible from outside, a program without extrinsic side effects would be a null operation. The sole purpose of running programs on computers is to generate extrinsic side effects.

The problem is: λ-calculus (and operations of subroutine declaration and calling in programming languages) do not admit existence of side effects. It is not possible to describe side effects or do anything about them within this formalism. (Of course, λ-calculus is Turing-complete and so any kind of non-interactive program can be converted into it - but this is a ridiculously low bar for fitness for any practical purpose, as anybody who ever tried to write a slightly non-trivial program for Turing machine or played with Church numerals surely knows.)

From the point of view of programming ideal abstract machines there is no difference, for example, between bubble sort and quicksort. They are equivalent. On real computers, of course, there's a world of difference.

There's another name for the side effects which are not formally representable in the language: abstraction leaks. And they are the root of the problem with modern software development, as we discussed in the previous post.

So here we have it: by using a seemingly perfect theoretical concept to represent abstraction and abstraction unwrapping in our programming languages we got ourselves into the corner - because that particular theoretical concept does not fit the reality of computation using physical computers. Part of this is confusion between mathematical and engineering abstraction, part of it is inertia and lack of reflection on the basic tenets of the discipline.

We obviously need to find a way to represent side effects (both intrinsic and extrinsic) within a programming language if we ever going to overcome the abstraction level ceiling. Prof. Gregor Kiczales offered an outline of a design attempting to do just that in his 1992 article, but no practical implementation or design proposal ever came out of it.

The question is what the right mathematical structure and corresponding practical representation for the engineering abstraction would be? Stay tuned. We're not done with sacred spherical bovines yet (see Part 2).

July 11, 2017

The Tale of Two Abstractions

Observing the software industry for a few decades from the inside is both depressing and instructive: what never ceases to astonish is how little actual progress is made by the efforts of huge numbers of really smart and talented people. Especially jarring is the contrast with the progress in hardware: most people now carry small devices which have more computing power than supercomputers of 1980s (and an array of sensors, an impressively detailed and bright display, and always-on radio data connection with more bandwidth than trunk links of biggest Internet backbones of mid-1990s). But the software which runs on it is seriously unimpressive. In fact, compared with the feats of software engineering of 1970s - like landing on the Moon under guidance of 16-bit 1MHz computer with 19 instructions and 4KiB of magnetic core RAM - or designing software for Air Traffic Control which is still running the critical infrastructure on computers so old that finding parts for repair is a huge challenge - it seems that we've got a serious case of regression. It's not really a regression in skills, just an effect of being able to get away with crappy software design because hardware is so powerful and reliable - and result of our inability to cope with ever-increasing complexity.

Complexity is Νέμεσις. It's programmer's enemy № 1. Human ability to deal with complexity is quite limited, both in terms of ability to understand non-spatial structures (an attempt to mentally visualize simple tesseract (not a projection, but the tesseract as it is, with equal-length straight edges and 90° angles) is going to be futile) - leave alone structures without any notion of continuous space such as computer programs - and in terms of very limited short-term/active memory (4±1 units). Meanwhile, software is the most complex of all human artifacts, by far. How do we manage to build it at all?

The key to creating large and complex systems is abstraction: when we can think of a complex system as containing blocks which may be complex inside, but which perform conceptually simple functions and have relatively simple interfaces, we can focus our limited attention and memory on the currently relevant block at the abstraction level we're working at. Splitting the design into blocks also allows to divide the labor of designing these blocks among many people. This is daily bread for a software architect. The principles of good software architecture are well understood (though often ignored), but the real issue is the gap.

Architecture is informal and somewhat nebulous (and when trying to steer system design towards sounder architecture the author of this post had been told to stop wasting time on philosophy by the lets-get-hacking geeks - on many occasions). The code, however, is very specific. This does not make architectural decisions any less consequential: it's just there's that gap between intention and implementation, and the more complex the system is, the wider is the gap. So far attempts to close or at least reduce that gap were unsuccessful (the dismal failure of CASE is a topic worthy of separate discussion). We just don't seem to be able to raise the level of abstraction of our programming systems to the place where intentions of a software architect could be trivially converted into a working code for any project of a meaningful size. When creating software we are still mired in the tar pit of details. It's as if there's an impenetrable abstraction level ceiling in our software development environments. To understand how that ceiling came to be we'll need to re-examine our notion of abstraction.

To start with, there are two not very subtly different notions of abstraction: the first one is mathematical and another is common in engineering. Abstraction in mathematics is, basically, stripping details of objects in order to arrive to a structure which has interesting set of algebraic properties. The most familiar example of this is natural numbers (the set of natural numbers is usually designated as ℕ). The notion of natural numbers abstracts away all properties of discrete natural objects (such as apples or oranges) to leave only their countability, which doesn't seem much, but gives raise to arithmetic. The arithmetic is very useful in practice, its power coming from the ability to use the results of abstract calculations to predict results of manipulating any countable objects. Adding one apple to three apples gives four apples. Adding one orange to three oranges gives four oranges, etc. Note, however, that arithmetic fails when other properties are mixed in: adding one apple to three oranges gives us neither four apples nor four oranges. A major part of mathematical reasoning is figuring out when a specific abstraction is applicable, and when it no longer gives meaningful answers.

The engineering abstraction is also about stripping details, but it aims to facilitate approximate, imprecise reasoning about larger blocks of the system. It's a cognitive aid, not a search for algebraic structures. No engineer ever believes that details are irrelevant, it's just that he chooses to defer thinking about them in order to avoid being cognitively overwhelmed by the complexity.

All software objects are engineering abstractions, they run on physical computers (rather than mathematical abstractions like Turing machine), and because of physical laws (such as the limit on the speed of light) they have physical properties, too: the real programs take time to execute, the side effects of their execution is release of heat (for both technological and fundamental reasons), they always have non-zero error and failure rates, etc, etc, etc. These physical properties combine to create properties of higher-level software objects, and those are not trivial.

These abstracted-out properties are not inconsequential, the problem first described by Prof. Gregor Kiczales, and then formulated by Joel Spolsky as the Law of Leaky Abstractions: All non-trivial abstractions, to some degree, are leaky. Spolsky seems to consider the leakiness to be the emergent property of complex software. It is not: it is the inescapable consequence of the nature of engineering abstractions. The very same phenomenon is found in every engineering discipline, although it is much less pronounced (but we still have to test designs of all complex machinery to detect overlooked "reality leaks").

The only reason this law comes as a surprise to software engineers is because they (and the practitioners of the discipline of Computer Science in general) tend to think of software in terms of mathematical objects, thus conflating two very different notions of abstraction. This confusion exists at the very core of how we write programs: the mathematical idea of abstraction is fundamental to all existing programming languages.

The law of leaky abstractions also has a non-trivial consequence: imagine that we're building software running directly on a physical computer. We have a pretty good idea of how it works, and thus we can select algorithms and data structures to make our program efficient on this computer. Now, let's abstract common operations such as reading and writing disk blocks, adding some cache and a file system on top. Now the application programmer can use these more abstract operations (instead of figuring our specifics of a particular disk controller) - but we also introduced new non-trivial details (cache behavior, file fragmentation, etc). Then we write a database - which by necessity will be aware of these details. And then one nice day we get a shiny new SSD - which, while offering the same "abstract" I/O, has radically different performance characteristics (and wants to be explicitly told when information is no longer needed, aka trim command). Some database queries are much faster, some are just as slow as they were before. We can not just blindly put more load on the application just because we made storage faster. Instead, we have a huge amount of work to adapt the whole stack to make advantage of the new technology. It may actually be simpler to rewrite everything from scratch. Here comes the new "framework" doing pretty much the same work as the old one, but quite different, so all users now need to rewrite their applications.

The problem with abstraction leaks is that they are kept within the mind of a programmer: there is no way to write code which would adapt to the changing leaks. Pile more layers on top, and the whole becomes intractably rigid, unable to cope with changes in a reasonable manner.

This phenomenon of specific "values" of abstraction leaks being baked into successive layers of software is the reason why we can't work up to high abstraction levels. And we have to bake in these values when we write software because there's no way to express or handle the leaks explicitly (that's why they are "leaks"). Let's call it the Law of Abstraction Level Ceiling:

Abstraction leaks create abstraction level ceiling.

The situation looks fairly hopeless... we need engineering abstractions to cope with cognitive overload, and this results in the upper limit on the abstraction level at which we could program.
Fortunately there is an escape: the whole mess with leaky abstractions is created by the use of the mathematical concepts which are inappropriate for dealing with engineering abstractions precisely because they were developed for dealing with ideal abstract machines. The physical reality of engineering abstractions seems to be also amenable to mathematical description - it's just the description matching the reality may not be the same as the nice theory we have in our minds. Planets are not moving in perfect circles, but the real law governing their motion is not that complicated either, and in vast majority of cases we could get by with astonishingly simple approximation.

What exactly we got wrong half of the century ago, and what can we do about it will be the subject of further discussion in our future posts concerning spherical bovines in evacuated spaces.

July 5, 2017

Programming like it is still Middle Ages, Part 2 - Software Factory Pipe Dreams

Keeping in mind the discussion of modes of production (in part 1) we're going to take a closer look at the poster child of post-industrialism: modern software development.

It is hardly a secret that the shiny facade of high technology hides some truly ugly problems: first of all, the success rate of software projects is abysmally low: a relatively recent study found that as many as 68% of all software/IT projects fail. The quality of software (although much harder to quantify) is universally poor, especially where security is concerned. The incredible growth of CPU performance and amounts of memory and data storage are not matched by visible improvements in software responsiveness and usability. To top it off, a significant part of software developers cannot even be described as competent programmers despite being able to negotiate rather impressive wages, indicating both the increasingly dire shortage of software developers and fundamental brokenness of the hiring and management practices.

To get some insight into the nature of the problem (the necessary first step in being able to come up with a serviceable solution to it) we need to take a closer look at how software is produced. At a risk of sounding like Captain Obvious I would like to state that each mode of production has its own best methods of organizing it. These methods are evolved products of selective pressures (those which worked were imitated, these which didn't failed and disappeared). Being confused about which mode of production is involved leads to the failed attempts to organize and manage production is ways which will be consistently deleterious or even catastrophic.

The common wisdom is that high-tech in general and software development in particular are a shiny new production mode: the wonderfully vague "post-industrial". This unfortunately yields no practical insight into what it is: what is the relation of labor, capital, and inputs in this mode of production? Let's try to characterize these based on what we see in reality.

We need to keep in mind that production cycle in software is rather long (many months or years), and this somewhat complicates teasing out which factors are capital and which are inputs. The depreciation time of some nominally capital factors (such as computers) is close enough to the length of the production cycle to consider them consumable inputs instead.

Labor: Software development is very labor-intensive, and budgets of high-tech companies are dominated by labor costs. A modern software project could easily be from 100K to 10M lines of code (however flawed this metric is, it at least offers some way of coming up with order-of-magnitude estimates). A recent estimate of an average software developer's productivity comes up to something like 325 to 750 lines of code per month (Jones, Capers and Bonsignour, Olivier. The Economics of Software Quality.). For a small 2-year 100K LOC project it implies staff of 6-12 developers, for a 10-year 10M LOC project the estimate would be around 300 developers (in reality the headcount will be much more because productivity rapidly falls with increasing complexity, the staff of 1500-3000 engineers for a project on this scale would be closer to reality).

Besides hours spent, labor in high-tech contributes personal knowledge and experience, which are costly to acquire. I.e. the "talent" contributes capital in addition to labor, which is tacitly recognized by the structure of compensation in the high-tech industry which commonly gives equity in the business (in form of stock options) to the engineers as a part of the package in order to create incentive for the "human capital" to stay longer with the company.

Capital: The capital goods used in production of software are mostly in the form of the office buildings, institutional knowledge, and acquired intellectual property (IP). Offices used in software development are not any different from pretty much any other office space, which is practically a commodity; most software businesses rent. Of course, a start-up company needs financial capital in order to pay labor and other costs before it becomes profitable, but mature software companies do not need outside financial capital to sustain their production.

Acquired IP is not as important as it is appears to be because the code itself is not of much practical use without people who understand it. Gaining understanding of somebody else's code to the point of being able to significantly modify and successfully maintain it long-term requires efforts comparable to writing the code from scratch (especially if code quality isn't great and supporting documentation is lacking). As it happens, IP critical for the actual software development is mostly in public domain and/or free. (Why is this so is outside of scope of this post).

The only really important capital good necessary for software development is knowledge created in-house, usually as a byproduct of creating the software. Unfortunately, most companies are not good in promoting accumulation and preservation of this knowledge, and often create perverse incentives which inhibit knowledge transcription and sharing (again, this is a separate discussion) - the situation eerily resembling the reluctance of medieval artisans to share their knowledge.

Inputs: the inputs in software development (besides trifles such as coffee and electricity to run computers in the office) are quickly depreciating computer hardware, licensed IP, on-line product service fees, and rental office space. These are relatively minor lines in the budgets.

The final observation about production of software is that while software costs next to nothing to replicate, producing it is always custom work. Where production is concerned, all software is bespoke - no two items are the same. In fact, there's not much standardization in software engineering: there are some standards for programming languages and network protocols, but the practical implementations are never 100% compliant, and a common practice among the vendors is to deliberately break compatibility by creatively "extending" the standards for the purpose of customer lock-in.

Just like there is notable paucity of software object standardization, there is also notable lack of standards for professional qualification. This leads to the strange spectacle of academia churning out massive numbers of "computer science" graduates who lack the skills and knowledge to program in the real world. This knowledge is then acquired through on-the-job training, which ranges from autodidact learning to apprenticeship.

All of the above strongly matches the artisanal mode of production: the most advanced and complex artifacts of human civilization are, in fact, produced in the way not much different from the artisans of old firing clay pots and smiting horseshoes. In no way it is "industrial", leave alone "post-industrial". We program like it's still Middle Ages.

This, however, does not stop the business people to try to impose management methods developed for industrial production (after all, this is what they are taught in the business schools) onto software development, with little success. No amount of wishful thinking could make an occupation requiring advanced abstract reasoning into blue-collar work. The "software factory" is still a pipe dream, by now becoming firmly associated with outsourced production of junk software on the cheap. It may actually be better to re-create the social structures which evolved during Middle Ages to support the artisanal production. Yes, guilds (a historical tidbit: guilds created modern academia by establishing the early secular universities).

Another feature of artisanal production is its unpredictability: the lack of standards and high variance in professional quality (at least guilds provided some guarantees about performance of their members) makes planning of software development an exercise in futility and outright fraud: quite a lot of software engineering projects are delivered "on-time" by means of cutting corners, changing the definition of what is going to be delivered, and compromising on quality and future maintainability of the delivered code.

The problem with the artisanal mode of production is that it does not scale well, and no management fads are going to change this basic fact. We cannot deal with the crisis in software production without changing the nature of software development by figuring out why our attempts to mechanize software production and to standardize software objects were such abject failures (and not for the lack of trying by very smart people) and finding ways to overcome this obstacle. This will be the main theme of many future posts in this blog.