In other contexts, I have written at length about the cultural and technical divergence of the data analytics (aka machine learning and big data) and high-performance computing (aka big iron) communities. I have euphemistically called them "twins separated at both." (See HPC, Big Data and the Peloponnesian War and Scientific Clouds: Blowin’ in the Wind). Like all twins, they share technical DNA and innate behaviors, despite their superficial differences. After all, in a time long, long ago, they were once united by their use of BSD UNIX and SUN workstations for software development.
Since then, both communities have successfully built scalable infrastructures using high-performance, low cost x86 hardware and a rich suite of (mostly) open source software tools. Both have addressed ecosystem deficiencies by developing special-purpose software libraries and tools (e.g., SLURM and Zookeeper for resource management and MPI and Hadoop for parallelism), and both have optimized hardware for their problem domains (e.g., Open Compute hardware building block standardization, FPGAs for search and machine learning, and GPU accelerators for computational science).
Like many of you, I have seen this evolution firsthand, as a card-carrying geek in both the HPC and cloud computing worlds. One of the reasons I went to Microsoft was to bring HPC ideas and applications to the nascent world of cloud computing. While at Microsoft, I led a research team to explore energy-efficient cloud hardware designs and new programming models, and I launched a public-private partnership between Microsoft and the National Science Foundation on cloud applications. Now that I am back in academia, I am seeking to bring cloud computing ideas back to HPC.
In that spirit, Jack Dongarra and I recently co-authored an article for the Communications of the ACM on the twin ecosystems of HPC and big data and the challenges facing both. Entitled, Exascale Computing and Big Data, the article examines the commonalities and differences, and discusses many of the unresolved issues associated with resilience, programmability, scalability, and post-Dennard hardware futures. Most importantly, the article makes an impassioned plea for hardware and software integration and cultural convergence.
The possibilities for this convergence are legion. The algorithms underlying deep machine learning would benefit from the parallelization and data movement minimization techniques commonly used in HPC applications and libraries. Similarly, the approaches to failure tolerance and systemic resilience common in cloud software have broad applicability to high-performance computing. Both domains face growing energy constraints on the maximum size of feasible systems, necessitating shared focus on domain-specific architectural optimizations that maximize operations per joule.
Perhaps most important of all, there is increasing overlap of application domains. New generations of scientific instruments and sensors are producing unprecedented volumes of observational data, and intelligent, in situ algorithms are increasingly required to reduce raw data and identify important phenomena in real time. To see this, one need look no further than applications of machine learning to astronomy, which now include automated object identification. Conversely, client plus cloud services are increasingly model-based, with rich physics, imaging processing and context that depend on parallel algorithms to meet real-time needs; augmented reality applications are one such exemplar.
The explosive growth of Docker and containerized software management speaks to the need for lightweight, flexible software configuration management for increasingly complex and rich software environments. My hope is that we can develop a unified hardware/software ecosystem that leverages the technical and social strengths of each community. Each would benefit from the experiences and insights of the other.It is past time for the twins to have a family reunion.
Dear Prof' Reed,
My name is Eitan Zahavi and I am both a PhD student and one of Mellanox founders. I find your analysis very interesting and I tend to agree with most of it.
One area where the two communities differ is in the use of tight synchronization between the tasks of an application. An example would be the implementation of MapReduce Shuffle (on Hadoop for example) utilize many TCP ports between every Mapper and Reducer while an equivalent MPI based application would probably use MPI_Alltoallv which for larger messages makes sure no end-point congestion happens (two sources send to same destination).
Do you think such difference are inherent? What is their cause? Would they be resolved?
A more general difference is the network technology being used: Mostly Ethernet for BigData and Interconnection Networks for HPC. Do you see any chance these communities converge on that front too?