High-Performance Mobile System-on-Chip Clusters

Unfortunately, the events of recent days have taken us away from the stable and predictable development of mankind. The desire for world domination and immoderate consumption on the part of financial groups led to an open military conflict. This conflict has already divided the world. Apparently in the near future, the world will be split into two parts, each of which will live on its own. Unfortunately, this may also affect international scientific communities; nevertheless, the ACM community still remains one of the platforms open for communication. Therefore, we would like to publish a discussion post on the ways of import substitution, which Russia will go first, and then probably China.

The modern realities of international relations are that in the coming years, Russia will be in conditions of limited access to modern high-tech products. First of all, this has already affected powerful processors for data centers and productive workstations. It can be assumed that in the near future, such measures will also affect China, since the history of the sanctions struggle against Chinese high-tech companies led by Huawei is quite revealing.

In the meantime, simply stating the problem is not enough; it is necessary to look for a quick way out of the situation, relying on available resources. In this regard, the idea of creating a high-performance cluster-type computing system to replace Intel and AMD processors seems promising. China produces its own mobile systems-on-chip used in tablets and modern smartphones. Although these devices are not the most powerful, their production is completely independent and does not affect the patent rights of American or European companies.

To create cluster-type computing systems, it is possible to use well-established cluster technology, with the Linux operating system and software based on the MPI or OpenMP libraries. To interconnect boards with mobile systems-on-chip (or systems-on-module), it is necessary to use a special network with high speed and low response time. Naturally, high-performance buses, such as PCI Express 3.0 and higher—USB 4.0, InfiniBand (or, for example, the Russian design—Angara ES8430), could be preferred. But the use of these technologies will require a lot of time to develop new boards, which is unacceptable. Due to the lack of wired connections, one can try using wireless protocols. This will provide switching between individual processors at speeds up to hundreds of Gbps, but the network delay time will be unacceptably large; it will be more than 100 milliseconds. Such a time will greatly reduce the performance of the cluster. Therefore, wireless communication can only be used as an addition to wired. However, it is necessary to consider the full range of available system-on-chip hardware in order to select the optimal ratio of chip performance and switching bus.

One of the most accessible options for communication between individual multi-core systems-on-chip are USB 3.0 family interfaces. The information transfer rate on these ports is above 5 Gb/s, which is even redundant for developing a cluster. Network latency is worse than InfiniBand. However, for small packets this is 25 µs compared to 2-3 µs, but as the packet size increases, this difference is leveled out. In addition, USB switches are much cheaper, and the technology itself has wider software support. Taking into account the design of USB switches, it is possible to implement almost any topology for cluster system. In this case, one should take into account the peculiarity of the USB protocol, which consists in the fact that the presence of master and slave devices in the network is always required.

Given that the available mobile systems-on-chip are on the order of 100 Gflops, performance of several teraflops for small clusters of high-performance systems-on-chip is quite achievable. The use of standard open operating systems, such as Linux, will greatly facilitate the use of custom applications and allow such systems to run in the near future. It is possible that such clusters can be heterogeneous, including different systems-on-chip for different tasks (or, for example, FPGAs to create specialized on-the-fly configurable accelerators for specific tasks).

To improve performance, load balancing between different systems-on-module, chips, and cores on a single chip is required. It seems promising to use two technologies. The first of these is the efficient parallelization of computing processes, which will be determined by the topology of the connection of individual devices. This can be either a classic mesh topology or other topologies with better topological parameters, such as circulants, for which the possibility to become the topological basis for computing clusters has been shown in [1]. The second, routing can be based on the virtual coordinate system proposed in the article [2]. Additional research will be required to optimize the network topology and the system of virtual coordinates based on it to speed up parallel computational algorithms. Possibilities of how to solve this problem are being actively explored.

There are many more problems, such as the organization of power supply for the entire system, heat removal, compact layout of the cluster components, ensuring cluster reliability, etc., but they are less significant and quite solvable.

In conclusion, we would like to summarize the main provisions of the proposed solution. To organize a high-performance cluster based on mobile systems-on-a-chip, it is necessary to:

Select the type of systems-on-chip that will form the basis of a compact computing cluster.
Decide on the type of hardware solution and telecommunications protocol for networking individual components of a computing cluster.
Determine the network topology.
Select the operating system and parallelization libraries under which the cluster will operate.
Decide on a list of application software.

It should be noted that our proposed approach is intended for rapid implementation as a pilot project. During this implementation, software solutions and new protocols for data exchange, as well as computing technologies, will be worked out. In the future, it will be possible to refine the cluster device (for example, to try to launch the release of a new motherboard, which will host several chips connected by a common bus).

Another problem is the operating system for the cluster. Android, as a proprietary system, should be replaced by Linux or other open-source operating systems. The task of replacing Android is a paramount task; this is a separate area of activity to which the attention of software companies that have now lost their contracts due to sanctions should be switched.

References

[1] Yuefan Deng, Meng Guo, Alexandre F. Ramos, Xiaolong Huang, Zhipeng Xu, and Weifeng Liu. 2020. Optimal low-latency network topologies for cluster performance enhancement. J. Supercomput. 76, 12 (December 2020), 9558–9584.
[2] Aleksandr Romanov, Nikolay Myachin, and Andrei Sukhov. 2021. Fault-Tolerant Routing in Networks-on-Chip Using Self-Organizing Routing Algorithms. In IECON Proceedings (Industrial Electronics Conference), IEEE, 1–6.

Andrei Sukhov is a professor of HSE University, Moscow, Russia, and a senior member of the ACM; e-mail: asukhov@acm.org. Aleksandr Romanov is an associate professor and head of the CAD Laboratory of HSE University, Moscow, Russia; e-mail: a.romanov@hse.ru