Application architects at Melbourne University have seen massive performance boosts and latency reductions after using OpenStack tools to rearchitect their high-performance computing (HPC) environment to function more like a cloud computing platform.
Called Spartan, the project grew out of escalating demand for computing resources from the university’s many scientific users, who for years have run a variety of computationally-intensive applications on large HPC clusters built from commodity Linux servers.
Where cloud environments are based on large numbers of virtual machines (VMs) running on moderately-powered commodity servers, HPC environments spread computational tasks across large numbers of computing cores and boost the flow of data between specialised interconnects.
“Cloud systems primarily exist for their ease of management, their flexibility, and for being the historical precursor of virtualised hardware,” Lev Lafayette, HPC support and training officer with the University of Melbourne, told this week’s OpenStack Australia Day in Melbourne.
“However, clouds are not high performance – and they can have very poor performance compared to our bare-metal HPC partitions. But their flexibility is worth the small overheads.”
HPC links large numbers of conventional CPUs using low-latency connections like InfiniBand and 40Gbps Ethernet with nodes based on Nvidia Tesla K80 graphical processing units (GPUs), offering massive additional computing capacity.
This had delivered average latency of around 19 microseconds (µsec) on the typical HPC node – with 276 computing cores and 21GB of RAM per code – compared with 60µsec in a cloud partition running 400 VMs on more than 3000 cores.
Those differences are crucial for high-end HPC workloads involving hundreds of gigabytes of data.
“HPC systems aren’t for the cloud,” compute integration specialist Dr David Perry said.
“They’re managed services, and each is a little different and optimised for its own environment.”
Despite their power, a recent analysis of submitted workloads showed that around 75 percent of the tasks submitted by users were able to run on a single HPC node – meaning they weren’t taking advantage of the expensive HPC interconnects and overall architecture.
Rather than keep blindly investing in those interconnects, Lafayette said, the team began exploring ways to bridge the two usage models – leveraging OpenStack tools to overlay the HPC environment with a more flexible hybrid, cloud-like architecture.
“Your best option is to build a system that is proportional to the usage you have,” Lafayette said.
“And then you can incrementally upgrade at a later date.”
The HPC team drew on a range of commonly used tools to build an abstraction layer over the HPC environment, including the Slurm workload manager, which distributes tasks across available resources; Git version control; Gerrit paired systems administration; and Puppet configuration management.
Rounding out the operating system layer was heavy utilisation of Nova, an OpenStack service that allows for provisioning and decommissioning of virtual machines on demand.
This platform allowed better allocation of the one-node computing jobs within the virtualisation partition, while also allowing conventional HPC tools to shuttle more complex jobs onto the conventional HPC core.
As well as offering users access to a managed hybrid HPC-cloud environment, the team has worked with users to find other ways of optimising their code to boost performance.
One beneficial approach has been to push for new applications to be provided as source code and built within the new environment, using tools like EasyBuild and Singularity.
This approach, which has already been used on more than 1000 applications, has frequently delivered performance improvements of 25 percent or more.
Some applications have boosted speed by 10 times or more “because they are being built from source rather than the commonly available package,” Lafayette said.
“It may take you a fair bit of time to install a package this way, but the first time someone submits a 30-day job you’ve gotten that time back.”
Heavy use of virtualisation has also paid benefits in allowing the team to maintain multiple versions of key applications – which can be essential in an academic environment because haphazard version changes can compromise the reproducibility of research results.
The idea may seem heretical to HPC purists, but by revisiting the overlying architecture the Melbourne University team has been able to deliver a fully containerised architecture running within cloud virtual machines on an HPC environment.
“As far as the user is concerned,” Perry said, “they don’t necessarily know they’re using a supercomputer.”
Future plans include the ability to burst onto the Microsoft Azure cloud platform; expansion in the use of GPUs for raw computing power; use of Thespian for testing; and the addition of new architectures using additional VMs.
The architecture “allows our users to dynamically change their system environment as they need,” Lafayette said.
“They can have the consistency when they’re doing a research project, then switch between modules and recompile with particular extensions. They get the best of both worlds there.”