Artificial Intelligence and Machine Learning

Enabling Java For High-Performance Computing

Exploiting distributed shared memory and remote method invocation adapts pure Java for large-scale, distributed-memory parallel machines and their applications.

By Thilo Kielmann, Philip Hatcher, Luc Bougé, and Henri E. Bal

Posted Oct 1 2001

Introduction
Suited for Parallel Computing
Conclusion
References
Authors
Footnotes
Figures
Tables
Sidebar: Remote Method Invocation
Sidebar: Java Multithreading

Java has become increasingly popular as a general-purpose programming language. Current Java implementations focus mainly on the portability and interoperability required for Internet-centric client/server computing. Key to Java’s success is its intermediate “bytecode” representation, which can be exchanged and executed by Java Virtual Machines (JVMs) on almost any computing platform. However, along with that popularity has come an increasing need for an efficient execution mode. For sequential execution, just-in-time compilers improve application performance [4]. But high-performance computing applications typically require multiple-processor systems, so efficient interprocessor communication is also needed, in addition to efficient sequential execution.

As an OO language, Java uses method invocation as its main communication concept; for example, inside a single JVM, concurrent threads of control can communicate through synchronized method invocations. On a multiprocessor system with shared memory (SMP), this approach allows for some limited form of parallelism by mapping threads to different physical processors. For distributed-memory systems, Java offers the concept of a remote method invocation (RMI). With RMI, the method invocation, along with its parameters and results, is transferred across a network to and from the serving object on a remote JVM (see the sidebar “Remote Method Invocation”).

With these built-in concepts for concurrency and distributed-memory communication, Java provides a unique opportunity for a widely accepted general-purpose language with a large base of existing code and programmers to also suit the needs of parallel (high-performance) computing. Unfortunately, Java is not yet widely perceived by programmers as such, due to the inefficiency of early implementations based on bytecode interpretation. Here, we provide evidence of Java’s usefulness for parallel computing by describing efficient implementation techniques, showing the combination of powerful compilers and efficient runtime systems yields Java execution environments able to exploit the computational power of distributed-memory parallel computers scaling to system sizes unreachable in pure shared-memory approaches.

Suited for Parallel Computing

A major Java advantage is that it provides communication mechanisms inside the language environment, whereas other languages, such as Fortran and C++, require external mechanisms like message passing. Recently, however, although programming interfaces between Java and the Message Passing Interface (MPI) standard were developed [5], the MPI message-passing style of communication is difficult to integrate cleanly with Java’s OO model, especially as MPI assumes a single-program, multiple-data (SPMD) programming model quite different from Java’s multithreading model. One of our purposes here is to show that, with efficient compilers and runtime systems, pure Java is a platform well suited for parallel computing. We explore two approaches for achieving this goal.

The first allows truly parallel execution of multithreaded Java programs on distributed memory platforms. This idea is implemented in the Hyperion system,¹ which compiles multithreaded bytecode for execution on a “distributed virtual machine,” allowing threads on different nodes to share objects—the purest approach to Java-based parallel computing on distributed-memory systems. Hyperion completely hides the distributed-memory environment from the application programmer, allowing access of any object from any machine.

The second approach gives programmers the explicit notion of shared objects, so they indicate which objects will be shared among multiple threads; communication between threads is reduced to method invocations on the shared objects. This approach is implemented in the Manta system.² Manta statically compiles Java source code to executable programs; its runtime system provides highly efficient RMI, as well as a similar mechanism called replicated method invocation (RepMI), allowing for more efficient use of object locality.

The basic implementation techniques we outline here for both Hyperion and Manta ultimately yield efficient parallel execution of Java programs on distributed-memory platforms. We also provide performance data for the respective communication operations, discuss the suitability of the approaches to parallel programming, and compare their promise and limitations for Java-centric parallel computing.

Hyperion: Transparent distributed multithreading. Hyperion allows a Java programmer to view a cluster of processors as executing a single JVM [1]. In Java, concurrency is exposed to the programmer through threads sharing a common address space. The standard library provides facilities to perform various functions, including start a thread and switch control between threads; the Java memory model specifies how threads might interact with the common memory (see the sidebar “Java Multithreading”). Thus, it is possible to map a multithreaded Java program directly onto a cluster. The threads are spread across the processing nodes to deliver actual concurrent execution and load balancing. The Java memory model is implemented by a distributed shared memory (DSM) substrate, so the language’s original semantic model remains unchanged.

For efficient sequential execution, Hyperion compiles Java bytecode into optimized native code through a two-step process: Java bytecode is first translated to C, and a C compiler is then used to generate native code for the processors of the cluster. Using a C compiler for generating native code delivers the benefits of platform-specific compiler optimizations while keeping the system itself platform-independent.

Portability between any type of networked system has always been a major objective in the Hyperion design. Therefore, the runtime system is built on top of a portable environment called DSM-PM2, extending the multithreaded library PM2 [9] with a DSM facility. DSM-PM2 provides lightweight multithreading, high-performance communication and page-based DSM. It is portable across a spectrum of high-performance networks, including the Scalable Coherent Interface (SCI), Myrinet, and Gigabit-Ethernet, and can be used with most common communication interfaces, including the standard TCP protocol, MPI, and the Virtual Interface Architecture (VIA).

The central aspect of Hyperion’s design is the management of distributed-object memory (see Figure 1). Hyperion’s programming model has to provide the illusion of a uniformly accessible, shared-object memory independent of physical object locations. According to the specification of the Java memory model, each thread in Hyperion is conceptually equipped with a local cache, interacting with a common main memory. Caching greatly improves performance if the application exhibits temporal locality, accessing a cached object multiple times before the cache is invalidated.

Hyperion is the only system providing a fully transparent cluster implementation of Java. Similar systems include Java/DSM [12], cJVM [2], and JESSICA [6]. However, these other systems are based on interpreted bytecode, rather than on native compilation, resulting in less efficient program execution. While these systems differ in their approaches to implementing the Java Memory Model, as well as in the type of their target applications, together they collectively demonstrate the potential of using Java to efficiently utilize clusters.

The net result of Hyperion’s implementation techniques is efficient execution of unmodified Java programs on a range of distributed clusters. This flexibility is a major incentive for Java users in search of high performance; Table 1 lists the timings of local, cached, and remote elementary DSM operations, measured with Hyperion on a cluster of 200MHz Pentium Pros communicating over Myrinet.

The first two lines in the table are time in microseconds to access local, cached, and remote objects on this platform. Remote access times include the costs of transferring the page containing the object; page size is 4,096B. A remote read operation includes: detecting the absence of the object and transmitting the request (114 microseconds, 30%); transferring the page across the network (134 microseconds, 37%); and additional Hyperion-level processing (122 microseconds, 33%). Writing to a cached copy of an object involves recording the modifications for later consistency updates, adding 0.48 microseconds to reading. Writing to a remote object is more expensive in terms of runtime than reading it because a remote write must transmit the modification back to the home location.

The last line in the table is the time to perform remote and local synchronization, or the time to enter and exit a Java monitor. In the remote case, the lock being accessed by the monitor is on a different node.

Manta: Efficiently shared objects. Instead of providing a shared-memory programming model, as in the philosophy behind Hyperion, Manta requires the programmer to store shared data in remote objects that can be accessed using Java’s RMI. Manta’s programming model is the same as the one in standard RMI, along with a simple extension allowing the programmer to improve locality by indicating which objects should be replicated. Manta implements this programming model in a highly efficient way, using a completely new implementation of Java and RMI [8]. Manta uses a native offline Java compiler that statically compiles Java source code into executable binary code. Using a static compiler allows aggressive, time-consuming optimizations. Manta’s fast RMI implementation consists of three components:

Lightweight RMI protocol. This protocol is implemented completely in C, avoiding the layering overhead of other RMI systems that invoke low-level C routines from Java code via the slow Java Native Interface (JNI). It minimizes the overhead of thread switching, buffer management, data format conversions (byte swapping), and copying.
Object serialization. The Manta compiler generates specialized serialization routines for serializable argument classes, avoiding the overhead for the runtime type inspection typical of most other Java systems.
Efficient communication software. Manta is implemented on top of the Panda communication library [3, 8], which provides message passing, remote procedure call (RPC), and broadcasting. On Myrinet, Panda uses a highly optimized low-level communication substrate; on Ethernet, Panda uses the standard UDP protocol.

The RMI implementation described here is compatible with the Java language specification but uses a different communication protocol. Manta uses additional mechanisms to interoperate with other JVMs [8] (see Figure 2). A parallel Java program compiled with Manta’s native compiler runs on a Pentium-based cluster. Application processes use Manta’s fast RMI protocol to communicate with one another. They can also communicate with applications running on standard JVMs using the standard RMI protocol. They can even exchange bytecode with the applications, as required for polymorphic RMIs [11]. For the purpose of bytecode exchange in Java programs, the Manta compiler also generates bytecode, which can be sent to remote JVMs; the Manta runtime system contains a compiler to process incoming bytecode from a JVM. The result is that Manta provides efficient sequential code, fast communication, interoperability with JVMs, and polymorphic RMIs. Manta’s RMI combines the efficiency of a C-like RPC and the flexibility of Java RMI.

The JavaParty project [10] implemented similar optimizations to Java RMI but without interoperability to standard JVMs. Because JavaParty’s RMI is implemented in pure Java, it is also less efficient than Manta’s RMI.

Even with all the optimizations performed by Manta, method invocations on shared objects are much slower than sequential method invocation, or invocations on normal Java objects not declared to be remote. Even within the same address space, accessing a remote object is costly in terms of runtime overhead. Manta addresses this problem through the concept of replicated method invocation (RepMI) [7], whereby shared objects are replicated across the processes of a parallel application. The RepMI advantage is that methods that don’t modify a replicated object (read-only methods) can be performed on the local copy. Such methods are recognized by the Manta compiler and executed without communication, resulting in completion times close to sequential method invocation. Manta also provides a mechanism to replicate collections of objects, including trees and other graphs.

To obtain high performance, RepMI implements methods that modify a replicated object (write methods) using an update protocol with function shipping—the same approach used in the Orca shared-object language and system [3]. This protocol updates all copies of a replicated object by broadcasting the write operation and performing the operation on all replicas. The broadcast protocol is provided by the Panda library [3] using totally ordered broadcasting so all replicas are updated consistently.

Table 2 lists the timings of local, remote, and replicated method invocations, measured with Manta on a Myrinet cluster with 200MHz Pentium Pros. The remote write method costs 41 microseconds of runtime. Calling a remote read method requires additional serialization of the result data and costs a total of 42 microseconds to complete. A parameterless invocation of the underlying Panda RPC protocol takes 31 microseconds.

Example applications. We evaluated Hyperion and Manta with two small (in terms of code size) example applications measured on two clusters with identical processors (200MHz Pentium Pros) and networks (Myrinet). We compared application execution runtimes against a state-of-the-art commercial just-in-time compiler—the IBM JIT 1.3.0. The first application is the All-pairs Shortest Paths, or ASP, program, computing the shortest path between any pair of nodes in a graph using a parallel version of Floyd’s algorithm, which computes shortest paths. ASP uses a distance matrix divided row-wise among the available processors. At the beginning of iteration k, all processors need the value of the kth row of the matrix.

For the shared-memory version of ASP used by Hyperion, a single thread is allocated on each processor. Each thread owns a contiguous block of rows in the graph’s shared distance matrix. On each iteration in ASP, each thread fetches the necessary row, updates its own rows, then synchronizes to wait for all other threads to finish the iteration. Figure 3 (top) shows the program performs well on small clusters with limited numbers of nodes; the cluster available to Hyperion has only eight nodes. However, having all threads request the current row separately is likely to limit the ability of the cluster to use more nodes efficiently. This situation might best be addressed by extending Hyperion’s programmer interface to include methods for collective communication among the threads of a thread group.

In the RMI version, each row of the distance matrix implements the interface java.rmi.Remote, making it accessible for threads on remote nodes. The processor owning the row for the next iteration stores it in its remotely accessible object. Because each machine has to fetch each row for itself, each row has to be sent across the network multiple times (as with Hyperion), causing high overhead on the machine that owns the row. The replicated ASP implementation uses replicated objects for the rows. Whenever a processor writes a row into its object, the new row is forwarded to all machines. Each processor can then read the row locally. Figure 3 (top) shows that the RMI version performs well on up to 16 nodes. On more nodes, the overhead for sending the rows becomes prohibitive. With 64 nodes, the RMI version completes after 38 seconds, while the RepMI variant needs only 18 seconds. This difference is due to the efficient broadcast of Manta’s runtime system.

The other example application is called the Traveling Salesperson Problem (TSP), which computes the shortest path among all cities in a given set. We use a branch-and-bound algorithm, pruning large parts of the search space by ignoring partial routes already longer than the current best solution. The program is parallelized by dynamically distributing the search space over the different nodes. It tracks the best solution found so far. Each node needs an up-to-date copy of this solution to prevent it from doing unnecessary work, thus causing it to read the value frequently. In contrast, updates happen infrequently.

The Hyperion shared-memory version, as in ASP, uses a single thread per node. The object containing the hitherto best solution is protected by a monitor. The program scales well on small clusters due to Hyperion’s lightweight implementation of its DSM primitives and the application’s favorable ratio of local computation to remote data access (see Figure 3, bottom).

In an RMI version of TSP, the overhead involved in frequently reading a single, remote Minimum object, continuing the hitherto best solution, would result in poor performance. Instead, a manually optimized version has to be used in which the threads read the Minimum value from a local variable. When a thread finds a better Minimum value, it invokes an updating RMI on all peer threads, which, for this purpose, have to be remote objects. In contrast, the replicated version of TSP is straightforward and intuitive, whereby the global Minimum object implements the replication interface. All changes to this object are forwarded automatically. Each node can locally invoke the read method of the object only slightly slower than reading a variable directly. While being as simple as the Hyperion version, the replicated version on 64 nodes completes in 31 seconds, almost as quickly as the very complex, manually optimized RMI version, which needs 28 seconds.

Conclusion

With efficient implementations like the ones provided by Hyperion and Manta, Java represents an unprecedented opportunity for exploiting a widely accepted general-purpose language for high-performance computing. Moreover, Java provides a unique way of rapidly prototyping parallel applications. Starting with a single JVM, parallel applications can be developed based on multithreading. On a small scale (four to eight nodes), a JVM enables truly parallel thread execution on a multiprocessor machine with shared memory. For larger numbers of CPUs, Hyperion-like systems provide transparent execution of multithreaded programs on distributed systems.

However, allowing Hyperion programmers to view a cluster as a black box is a two-edged sword. They can abstract from the internal details of the cluster’s individual nodes with private memories. But efficient parallel execution is provided only if each thread predominantly references data that is local or locally cached. If this local-data property does not hold, the communication costs of accessing remote data severely limit the performance improvement obtainable by spreading the threads across a cluster’s multiple nodes. Multithreaded Java programs can then be converted into programs making explicit use of shared objects and replicated objects. This conversion requires the programmer to determine which objects will be shared or replicated and adapt the program to use RMI to access them. Given a high-performance implementation of RMI, as with Manta, such programs can obtain high efficiencies even on large-scale, distributed-memory machines.

Figures

Figure 1. Management of distributed-object memory in Hyperion.

Figure 2. Interoperability of Manta and Sun RMI.

Figure 3. Application exectution times with Hyperion, RMI, and RepMI; the cluster available to Hyperion has only eight nodes.

Tables

Table 1. Completion times (in microseconds) of elementary DSM operations on a Pentium Pro/Myrinet cluster.

Table 2. Completion times of read and write operations (in microsedconds) on a Pentium Pro/Myrinet cluster.

Sidebar: Remote Method Invocation

Java’s RMI model allows a client machine to invoke a method on a remote server machine using syntax and semantics similar, but not identical, to that of a sequential method invocation. A remote server object, also called a remote object, is an instance of a class implementing (an extension of) the special Remote interface. The server has to register its remote interface with a centralized registry; the client looks up the object in this registry. The lookup call generates a stub for the remote object on the client machine; invocations on this stub are automatically forwarded to the server. Because the programmer also has to provide exception handlers for communication failures, RMI is not transparent. Moreover, the parameters and return values of a remote call in an RMI are passed by value—that is, they are copied. The exception is when remote objects are used as parameters; they are then passed by reference. For non-remote calls, all objects are passed by reference. Any object of a class implementing the Serializable interface can be passed as a parameter of an RMI. The object is automatically serialized (encoded in a network data format), transmitted, and deserialized at the server. The reply is handled through a similar process.

Sidebar: Java Multithreading

Threads in Java are represented as objects of the class java.lang.Thread, which contains methods for initializing, running, suspending, querying, and destroying threads. Because all threads share the same central memory, all objects are accessible by every thread. Critical sections of code can be protected by monitors, available in Java through the keyword synchronized utilizing the lock associated with every object. For example, a synchronized method first locks the instance of the object it was called on, the method body is then executed, and finally the lock is released. Threads may also use methods from the class java.lang.Object to wait for an event and to notify other threads an event has occurred.

The Java memory model allows threads to keep locally cached copies of objects. Consistency is provided by a thread’s object cache being flushed upon entry to a monitor and the local modifications made to cached objects being transmitted to the central memory when a thread exits a monitor. This “relaxed” consistency model allows concurrent reading and writing of cached copies of objects by all threads. If shared data is accessed concurrently by multiple threads without proper synchronization, nondeterministic program behavior is possible, including inconsistent object state, leading to unpredictable and erroneous program behavior.

This uncontrolled behavior is often not well understood by Java programmers whose experience with the language is limited to single-processor environments. Consequently, the Java memory model has gained considerable attention, and the future of the current specification is unclear. (William Pugh of the University of Maryland maintains a Web page as a starting point for discussions concerning the Java Memory Model and its evolution at www.cs.umd.edu/~pugh/java/memoryModel/.)

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Enabling Java For High-Performance Computing

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/383845.383871

October 2001 Issue

Published: October 1, 2001

Vol. 44 No. 10

Pages: 110-117

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

BLOG@CACM Apr 16 2024

The Value of Data in Embodied Artificial Intelligence

Shaoshan Liu

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Suited for Parallel Computing

Conclusion

Figures

Tables

Sidebar: Remote Method Invocation

Sidebar: Java Multithreading

Enabling Java For High-Performance Computing

DOI

October 2001 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.