High-Performance Computing (HPC)

Our Unix Engineering group supports privately-owned, high-performance computing (HPC) clusters for various research and instructional groups. The HPC service is designed to minimize support costs and efforts while maximizing the available resources.

In order to support many clusters effectively, we've established the following practices:

Common physical cluster architecture
Standardized operating system and deployment framework
Standardized job scheduler
Tiered support model for resources

Cluster architecture

The HPC clusters that we manage consist of the following common set of components:

Head Node (required)

Used to install and configure all additional cluster resources
Provides shared storage across cluster
Defines network layout for cluster
Acts as a login node for users
Acts as resource scheduler for the entire cluster

Cluster Devices: Storage nodes / NAS servers (required for persistent storage)

Provide persistent storage space for the cluster
ZFS employed for data integrity and snapshot capabilities

Compute Nodes

Fungible resources - interchangeable - no persistent data on these machines
Scalable - easy to add in order to increase computational capacity
Inexpensive (relatively) - these do not need to be as reliable (or as expensive) as the head node because they are interchangeable and do not store data locally

Premium Compute Nodes (GPU nodes / Distributed storage)

Fungible resources - interchangeable - persistent data replicated across multiple machines
Scalable - easy to add in order to increase computational capacity
Relatively inexpensive - these do not need to be as reliable as the head node. They typically contain scarce resources (GPUs)

Cluster Rack / Power

Customer-owned
Houses all cluster components (rack)
Eliminates per-system machine room charges
UPS required for critical systems (head node, networking, storage servers)
No UPS for compute nodes or GPU nodes

Operating System and Deployment Framework:

Previously, we standardized the Rocks cluster distribution as our deployment framework. Rocks is built on top of CentOS, which is derived from Red Hat Enterprise Linux. While we still support Rocks clusters running CentOS 7, we are in the process of phasing it out.

We currently deploy clusters using our "in-house" Hydra framework. Hydra is a cluster deployment and configuration management framework developed by SCS Unix Engineering using Ansible. It is specifically designed to fit the needs of SCS HPC cluster users and the systems administrators who maintain these clusters. Its design was heavily influenced by Unix Engineering’s decade of experience building and maintaining Rocks clusters for SCS. The project was born of a response to the extremely long development time and lack of maintenance updates for Rocks.

Hydra clusters currently run CentOS 7 or CentOS 8. Hydra significantly improves our ability to manage existing cluster resources and to deploy and customize new resources in a consistent, speedy, and reliable manner. Hydra reduces the downtime required for a cluster upgrade from more than a week to less than a day.

Job Schedulers

For resource allocation and task distribution and scheduling within clusters, SCS Unix Engineering supports the Slurm Workload Manager.

Running a job scheduler on the cluster allows tasks to be distributed to nodes continuously as resources become available permitting better resource utilization. It also allows jobs to continue running even when individual compute nodes are unavailable. Job schedulers also allow fair sharing and prioritization policies to be applied across the cluster. This allows researchers to safely share resources with the knowledge that they can access their resources when they need to.

OS / Application Containers

SCS Unix Engineering does not support Docker on the HPC clusters we manage. There are a number of compatibility and security issues with running Docker in a shared environment which precludes us from providing support. We do support Singularity for running containers on HPC clusters.

Tiered Support Levels

High Priority (cluster infrastructure)

This includes resources that are critical to the operation of the cluster as a whole. This generally means the cluster's head node, login nodes, and storage servers. We work to respond quickly when there are problems with these systems.

Elevated Priority (distributed storage and scarce or complex resources)

These systems contain scarce resources (persistent data, GPUs, etc) that require higher levels of attention when they experience problems. We prioritize these systems at a higher level when troubleshooting but do not treat the availability of individual nodes in this tier as urgent.

Normal Priority (standard compute resources)

These systems are the regular compute nodes in a cluster. Cluster compute nodes are fungible interchangeable resources and should not contain any persistent data. Support of these systems is "best effort" as time permits.

Partner with Us

Learn more about strategic roles for collaboration with SCS Computing Facilities.

Explore Collaboration Roles

Need Help?

Was this page helpful?

Use this box to give us feedback on this webpage and its content. If you need a response, please include your Andrew ID.

Need technical support? Submit a ticket