Carnegie Mellon University School of Computer Science

High-Performance Computing (HPC)

Our Unix Engineering group supports privately-owned, high-performance computing (HPC) clusters for various research and instructional groups. The HPC service is designed to minimize support costs and efforts while maximizing the available resources.

In order to support many clusters effectively, we've established the following practices:

  • Common physical cluster architecture
  • Standardized operating system and deployment framework
  • Standardized job scheduler
  • Tiered support model for resources

Cluster architecture

The HPC clusters that we manage consist of the following common set of components:

Head Node (required)

  • Used to install and configure all additional cluster resources
  • Provides shared storage across cluster
  • Defines network layout for cluster
  • Acts as a login node for users
  • Acts as resource scheduler for the entire cluster

Cluster Devices: Storage nodes / NAS servers (required for persistent storage)

  • Provide persistent storage space for the cluster
  • ZFS employed for data integrity and snapshot capabilities

Compute Nodes

  • Fungible resources - interchangeable - no persistent data on these machines
  • Scalable - easy to add in order to increase computational capacity
  • Inexpensive (relatively) - these do not need to be as reliable (or as expensive) as the head node because they are interchangeable and do not store data locally

Premium Compute Nodes (GPU nodes / Distributed storage)

  • Fungible resources - interchangeable - persistent data replicated across multiple machines
  • Scalable - easy to add in order to increase computational capacity
  • Relatively inexpensive - these do not need to be as reliable as the head node. They typically contain scarce resources (GPUs)

Cluster Rack / Power

  • Customer-owned
  • Houses all cluster components (rack)
  • Eliminates per-system machine room charges
  • UPS required for critical systems (head node, networking, storage servers)
  • No UPS for compute nodes or GPU nodes

Network

  • Private gigabit Ethernet (required)
  • Management of cluster resources
  • Used to deploy nodes
  • Provides access to nodes and data across the cluster
  • Nat access between cluster nodes and public internet
  • Fast networking (optional)
  • Faster data access
  • Reduced latency (MPI jobs)
  • 10Gbit Ethernet supported
  • Infiniband supported

Operating System and Deployment Framework:

Previously, we standardized the Rocks cluster distribution as our deployment framework. Rocks is built on top of CentOS, which is derived from Red Hat Enterprise Linux. While we still support Rocks clusters running CentOS 7, we are in the process of phasing it out.

We currently deploy clusters using our "in-house" Hydra framework. Hydra is a cluster deployment and configuration management framework developed by SCS Unix Engineering using Ansible. It is specifically designed to fit the needs of SCS HPC cluster users and the systems administrators who maintain these clusters. Its design was heavily influenced by Unix Engineering’s decade of experience building and maintaining Rocks clusters for SCS. The project was born of a response to the extremely long development time and lack of maintenance updates for Rocks.

Hydra clusters currently run CentOS 7 or CentOS 8. Hydra significantly improves our ability to manage existing cluster resources and to deploy and customize new resources in a consistent, speedy, and reliable manner. Hydra reduces the downtime required for a cluster upgrade from more than a week to less than a day.

Job Schedulers

For resource allocation and task distribution and scheduling within clusters, SCS Unix Engineering supports the Slurm Workload Manager.

Running a job scheduler on the cluster allows tasks to be distributed to nodes continuously as resources become available permitting better resource utilization. It also allows jobs to continue running even when individual compute nodes are unavailable. Job schedulers also allow fair sharing and prioritization policies to be applied across the cluster. This allows researchers to safely share resources with the knowledge that they can access their resources when they need to.

OS / Application Containers

SCS Unix Engineering does not support Docker on the HPC clusters we manage. There are a number of compatibility and security issues with running Docker in a shared environment which precludes us from providing support. We do support Singularity for running containers on HPC clusters.

Tiered Support Levels

High Priority (cluster infrastructure)

This includes resources that are critical to the operation of the cluster as a whole. This generally means the cluster's head node, login nodes, and storage servers. We work to respond quickly when there are problems with these systems. 

Elevated Priority (distributed storage and scarce or complex resources)

These systems contain scarce resources (persistent data, GPUs, etc) that require higher levels of attention when they experience problems. We prioritize these systems at a higher level when troubleshooting but do not treat the availability of individual nodes in this tier as urgent.

Normal Priority (standard compute resources)

These systems are the regular compute nodes in a cluster. Cluster compute nodes are fungible interchangeable resources and should not contain any persistent data. Support of these systems is "best effort" as time permits.