The Naiads in Greek mythology are the nymphs of fresh water. They are unpredictable and a bit scary, like big data, whose size has been exploding and continues to double every two years. Novel systems that process this data tsunami have been the focus of much research and development over the last decade. Many such big data processing systems are programmed through a workflow, where smaller programs with local state (nodes) are composed into bigger workflows through well-defined interfaces (edges). The resulting dataflows are then scaled to huge inputs through data parallelism (the execution of one node in the dataflow is scaled out across many servers), task parallelism (independent nodes in the dataflow are executed at the same time), and pipelining (a node later in the dataflow can already start its work based on partial output from its predecessors).
The most well-known class of such dataflow systems is based on the map-reduce pattern, enabling large-scale batch processing. These systems can process terabytes of data for preprocessing and cleaning, data transformation, model training and evaluation, and report generation, achieving high throughput while making the computation fault tolerant across hundreds of machines.