Opinion
Computing Applications Viewpoint

The Premature Obituary of Programming

Why deep learning will not replace programming.
Posted
  1. Article
  2. References
  3. Author
  4. Footnotes
blue rose on a keyboard

Deep learning (DL) has arrived, not only for natural language, speech, and image processing but also for coding, which I refer to as deep programming (DP). DP is used to detect similar programs, find relevant code, translate programs from one language to another, discover software defects, and to synthesize programs from a natural language description. The advent of large transformer language models10 is now being applied to programs with encouraging results. Just like DL is enabled by the enormous amount of textual and image data available on the Internet, DP is enabled by the vast amount of code available in open source repositories such as GitHub, as well as the ability to reuse libraries via modern package managers such as npm and pip. Two trail-blazing transformer-based DP systems are OpenAI's Codex8 and Deepmind's AlphaCode.18 The former is used in the Github Copilot project14 and integrates with development environments to automatically suggest code to developers. The latter generates code to solve problems presented at coding competitions. Both achieve amazing results. Multiple efforts are under way to establish code repositories for benchmarking DP, such as CodeXGLUE19 and CodeNET.20

The advent of DP systems has led to a few sensational headlines declaring that in the not-too-distant future coding will be done by computers, not humans.1 As DL technologies get even better and more code is deposited into public repositories, programmers will be replaced by specification writers outlining what code they want in natural language and presto, the code appears. This Viewpoint argues that while DP will influence software engineering and programming, its effects will be more incremental than the current hype suggests. To get away from the hype, I provide a careful analysis of the problem. I also argue that for DP to broaden its influence, it needs to take a more multidisciplinary approach, incorporating techniques from software engineering, program synthesis, and symbolic reasoning, to name just a few. Note I do not argue with the premise that DL will be used to solve many problems that are solved today by traditional programming methods16 and that software engineering will evolve to make such systems robust.17 In this Viewpoint, I am addressing the orthogonal question of using DL to synthesize programs themselves.

DP models are built by training on millions of lines of code. This code encapsulates known programming and engineering techniques for solving problems. But software is forever evolving due to the following reasons.

New machine and network architectures. The most dramatic changes in programming come about due to changes in underlying hardware and communication technologies. Think of Cobol for the mainframe, Visual Basic for UI/event-based client-server programs, and Java for distributed programming. A good example of how languages emerge from architectural changes can be found in Alex Aiken's PLDI 2021 keynote address that describes the evolution of new programming paradigms with the advent of specialized and heterogeneous processors for high performance computing.2 As new architectures evolve, new sorts of programs are needed to take advantage of these architectures, but these programs do not yet exist and therefore cannot be learned. How many quantum computing programs are being produced by DL today?

New programming frameworks for solving problems. Change in software paradigms also occurs with the advent of new programming frameworks. Programs written today make use of digital currency frameworks, social networks, and the Internet of Things (for example, smart home appliances). These programs did not exist 15 years ago. DL itself is new; libraries and frameworks have been created to support this new programming paradigm, such as PyTorch and TensorFlow. These frameworks require intimate domain knowledge to leverage and fine-tune them. Until new programming frameworks become widespread, with well-known and often used access patterns, it is very hard for DP to make accurate use of them.

Changes to the real world. There are continuously new challenges facing the planet due to physical phenomena causing disruptions in climate, the depletion of natural resources, and demographic changes, for example. Advances in science and technology also affect the way we live—consider the tremendous changes in last decade to healthcare, finance, travel, and entertainment. New solutions based upon code are constantly created to address the changing real world. Although snippets of these programs may be learned, the solution must model phenomena never seen before in other programs and manipulate newly created devices, and therefore cannot be learnt by DP methods in use today.

DP will not be able to generate programs that deal with new machine architectures, new programming frameworks, or solve new real-work problems. The APIs, patterns, and programming techniques will not exist in the code repositories DP is trained on. There are other pragmatic reasons that limit the applicability of DP.

Massive computational power. The models built by DP are huge, containing 12 billion,8 41 billion,18 or even 137 billion4 parameters. The AlphaCode paper states "Both sampling and training from our model required hundreds of petaFLOPS days" and the Codex paper states similar statistics. Only the largest organizations have access to such computing power. Not only building DP models is compute intensive, but also sampling, which is done each time a program is synthesized. The cost of building and running large DL models has led some researchers to declare further improvements in DL are becoming unsustainable.23 This implies segments of the market that cannot afford this massive investment are unlikely to benefit much from DP.


The advent of DP systems has led to a few sensational headlines declaring that in the not-too-distant future coding will be done by computers, not humans.


Specifying a program is not easy. The underlying assumption that AI will replace programming is that it is much easier to specify a program via natural language than to write the program. One can argue, as Dijkstra did, that using natural language to specify a program is too imprecise for programs of any complexity.11 Although the domain chosen by AlphaCode, programming competitions, include difficult problems, their specification is relatively simple. This contrasts with many real-world systems that are orders of magnitude more complex to specify. Furthermore, in the "messy" real world some requirements are best determined by trial and error.

Codex and AlphaCode generate many candidate programs for each specification. To eliminate most of them, they rely on the specification to include input/output pairs. These pairs are used to test the candidate programs and weed out those that do not produce the correct output for a given input. But many programs do not have simple input/output descriptions. Consider a program to "check the software running on each server, and if that software can be updated to a new version, and if the server configuration satisfies the requirements of the new version, update the server with the new version." While tests are developed for programs and one could ideally use them to weed out incorrect programs, these tests are often not available in detail at the start of the project, and the tests themselves may need to be programmed.

Program rigidity. As anyone who has spent hours or days debugging a program knows, a single misplaced character can wreak havoc. A program cannot be almost right, it must be exactly right. This is different than writing an essay, or even making a complex argument. A few flaws here and there do not change the meaning of the composition. When AI synthesizes a new textual or visual work from millions of examples, and even puts them together in novel ways, it does not need to adhere to a strict set of rules that govern the exact format of the composition. Programs, however, are not forgiving. That is why both Codex and AlphaCode use post-processing to weed out bad programs. And this is part of the reason why I advocate integrating software engineering techniques into DP (see "Unifying software engineering and DP").

Programming as a social endeavor. In complex projects, the final program is far from the initial idea sketched on paper. The free flow of ideas and knowledge sharing across teams cause component boundaries and APIs to change. Initial algorithms are found to be a poor fit for the target environment and new algorithms must be invented.

Stale repositories. In studies on genetic makeup of population, there are known mechanisms that dilute the genetic variation in a population, such as the Founder Effect and Genetic Drift.13 Similarly, if DP becomes the dominant mechanism for creating new programs, code repositories used to train DP will become stale. It will lack the variation that comes from a vast set of programmers populating these repositories, each with their own unique problem-solving techniques and idiosyncratic programming styles. As the genetic variation in repositories diminishes, it will be harder for DP to discover new patterns and idioms required to adapt to changing circumstances. It is yet to be seen if techniques such as genetic algorithms can circumvent this issue.a

Some domains are more likely than others to benefit from DP. Software is pervasive today across many realms. Not all of these domains will benefit equally from DP. In this regard it is useful to consider Model-Driven Development (MDD), which appeared in the 1980s and 1990s and continues to live on in low code/no code environments. It too promised to replace programming with automatic code generation. Although it never lived up to its hype, it has proven a successful methodology in some areas.9 It is likely DP will follow a similar trajectory.b

There are three main factors that will determine the success of DP in a particular domain: significant financial incentive to develop and maintain DP models, tolerable cost of making a mistake,7 and large code repositories to train on. The Copilot paradigm, incorporating DP into widely used integrated development environments (IDEs) is a scenario likely to mature rapidly. Its wide applicability creates a large market and DP can be trained on open source repositories. Since code creation is a collaborative effort between DP and programmers, the code will be more easily trusted, and the result easily integrates into existing software development processes. AI-infused IDEs will relieve programmers of many of the tedious aspects of programming such as finding the exact APIs and libraries to use, automatically generating test cases, and finding bugs. Its ability to populate function bodies with correct—or mostly correct—code will allow programmers to become more efficient and innovative.


I advocate a multidisciplinary approach with the objective of limiting the amount of training data required and to improve the accuracy of the results.


Another area ripe for DP is creating framework specific programs, such as programs built upon packages in widespread use (for example, ERP, CRM, SCM and e-commerce software), where the domain is much more narrow and available skills are limited. Based upon intimate knowledge of the data model and the framework APIs, DP will be able to synthesize programs to perform domain specific tasks. This is comparable to the successful application of MDD to Domain Specific Languages.6 When such packages are used in large corporations, there exists financial incentives to support DP applied to these domains.

On the other hand, applications tightly integrated with company-specific systems are unlikely to benefit from DP. In this case there is limited data to train on as the software embodies corporate-specific APIs and usage patterns. Furthermore, organizations often have their own standards for building systems to facilitate reuse and integration across applications6 and to guarantee regulatory compliance. Until it is cheap enough to train DP on relatively small code bases with highly accurate results, DP will not apply to this domain. The impressive results of AlphaCode were done with careful data curation and required much DL expertise. It is doubtful that organizations will be able to curate their code repositories as carefully, and they often lack the skills required.

Areas where the cost of making a mistake is dramatic (such as healthcare, national security, and regulated environments) will also not adopt DP for code synthesis. Nonetheless it is a ripe area for using DP to find bugs and validate the correctness of the code.

While Codex and AlphaCode are great engineering achievements, to make them useful in practice, further advances are required. I advocate a multidisciplinary approach with the objective of limiting the amount of training data required and to improve the accuracy of the results. Avenues of research in this direction are already appearing, as I now discuss.

Unifying software engineering and DP. We can use traditional software engineering techniques to evaluate and improve DP-generated code. DP systems already filter out incorrect programs by running test cases provided in the input and checking if the runtime exceeds some predefined limit.8,18 Much more can be done such as performing code scans and pen-tests to find vulnerabilities, using bug-finding tools to detect errors, and profiling the synthesized code to find bottlenecks. This feedback can then be used to update the code to improve security and robustness. Not only will such capabilities benefit the specific synthesized programs, but it will also help fine-tune the underlying DP engine to improve code generation in the future.

Integrating search-based program synthesis with DP. An active area of research is search-based program synthesis5 which, in its simplest form, uses brute force to search the space of possible programs that satisfy specific syntactic and semantic constraints. While this is computationally expensive, searching an exponential space of potential solutions, techniques are being developed to make it more scalable.3 Using DP for rapidly generating snippets of code and using search-based synthesis for integrating these snippets would extend the reach of DP code generation on the one hand and scale search-based program synthesis on the other. One recent work15 pursuing this approach uses large pretrained language models (such as GPT-3 and Codex) to find small Python programs invoking Panda APIs that meet a textual description of the desired behavior, and then uses program analysis and synthesis, driven by input/output examples, to correct syntactic and semantic errors in the DP-generated program. Additional breakthroughs are needed in program synthesis to make these techniques viable for large real-world programs.

Combining symbolic reasoning with DP. A new area of AI research is Neuro-Symbolic AI,21 which combines a rules-based approach with DL. One work that demonstrates its potential benefit combines DP with Inductive Logic Programming (ILP)12 to extend ILP to deal with noisy data. More speculatively, one can use symbolic reasoning to complement DL and synthesize programs. For instance, say we have a Baysian-like network that can infer desired states in a house such as "if it is nighttime and it is not the bedroom then lights should be on when someone is in the room (95%)." When installing sensors in a smart home that can detect "people are in the room" and smart lights that can be turned on and off digitally, the system could reason about the desired state using the Baysian network and use DL to create a program to turn on the lights when someone enters the room and turn them off when he or she leaves.

For software engineers and AI researchers, many opportunities exist to leverage DL for synthesizing code and building more robust programs. It will require new paradigms that go beyond transformer-based DP systems. It will most likely become a collaborative effort, where the AI suggests, clarifies, and challenges the programmer. It will learn from multiple sources (such as Stack Overflow, code reviews, change management systems, agile standups, and even user manuals) and not just from code repos. Based on what we know today, programmers need not worry about their jobs. DP will not replace programming. Its aim should be to increase the productivity of software development and thereby make up for the significant shortage of programmers today.22

    1. AI Will Replace Coders By 2040, Warn Academics; https://bit.ly/3FyfQM9

    2. Aiken, A. Programming Tomorrow's Machines," PLDI; https://bit.ly/3uXOJFi

    3. Alur, R. et. al. Search-based program synthesis. Commun. ACM 61, 12 (Dec. 2018), 84–93.

    4. Austin, J. et al. Program Synthesis with Large Language Models; https://bit.ly/3Fygc5r

    5. Bodik, R. and Jobstmann, B. Algorithmic program synthesis: Introduction. International Journal on Software Tools for Technology Transfer 15, (2013), 397–411.

    6. Boh, W.F. and Yellin, D.M. Using Enterprise Architecture Standards in Managing Information Technology. Journal of Management Information Systems 23, 3 (2006), 163–207.

    7. Brooks, R. A human in the loop. IEEE Spectrum (Oct. 2021), 48–49.

    8. Chen, M. et al. Evaluating Large Language Models Trained on Code. arXiv, July 2021; https://bit.ly/3j40AyR

    9. Crofts, N. Whatever happened to model driven development? (Oct. 2020); https://bit.ly/3uYo2k4

    10. Devlin, J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, In Proceedings of NAACL-HLT, Minneapolis, MN, 2019.

    11. Dijkstra, E.W. On the foolishness of 'natural language programming'; https://bit.ly/3V5ZP5J

    12. Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research 61, 1 (2018), 1–64.

    13. Founder Effect; https://bit.ly/2LxOwjK

    14. GitHub Copilot; https://bit.ly/3YIdxPx

    15. Jain, N. et al. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering, (2022), 1219–1231.

    16. Karpathy, A. Software 2.0. Medium, (Nov. 2017); https://bit.ly/3uYkdv6

    17. Kastner, C. Machine learning in production/AI engineering; https://bit.ly/3FAv5US

    18. Li, Y. et al. Competition-level code generation with AlphaCode. arXiv (2022); https://bit.ly/3hzrka4

    19. Lu, S. et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Proceedings of the NeurIPS Datasets and Benchmarks 2021.

    20. Ruchir, P. et. al. CodeNet: A large-scale AI for code dataset for learning a diversity of coding tasks. In Proceedings of the NeurIPS Datasets and Benchmarks 2021.

    21. Susskind, Z. et al. Neuro-Symbolic AI: An emerging class of AI workloads and their characterization. arXiv (2021); https://bit.ly/3BHnRxk

    22. The software developer shortage in the U.S. and the global tech talent shortage in 2022. DAXX (Jan. 2022); https://bit.ly/3uZ9rVu

    23. Thompson, N.C. et al. Deep learning's diminishing returns. IEEE Spectrum 55, (Oct. 2021), 51–55.

    a. Another issue to be dealt with is how DP will guard against malware found in code repositories. One can imagine nefarious actors purposely populating repositories with malicious code to be used for training models and thereby causing DP to produce insecure or even harmful code. There are multiple defenses against this, such as using DL to detect and remove malicious code from repositories, but this needs more exploration before programs produced by DP will be trusted.

    b. DP actually suffers from many of the same issues that plagued widespread adoption of MDD: careful specification of the problem is required, a rigorous process must be enforced (in the case of DP, continuous training of the model), and the difficulty of managing the life-cycle of the generated code. The latter spawned much work on how to maintain consistency between the code and the model. This has yet to be addressed by DP.

    I would like to thank Michael Elhadad, Assaf Marron, Vugranam Sreedhar, Mark Wegman, and Natan Yellin for comments on an earlier version of this Viewpoint. I also would like to thank the referees and editors who made valuable suggestions improving the content and presentation of this Viewpoint.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More