HPC Cluster Overview

Ace: Development & Teaching Cluster

Development

The primary function of the Ace cluster is to provide a robust development environment for writing and debugging code, running test simulations, and intensive analysis/visualization of results from Turing.

Good examples of intended use of Ace include:

  • Developing Python code using PyCharm
  • Compiling code using gfortran/gcc/ifort/icc
  • Debugging applications using gdb
  • Writing and testing job submission scripts for use on Turing
  • Running Tensorboard
  • Testing code parallelization and single node scalability

Development Hardware

DIRECT LOGIN DEVELOPMENT SYSTEMS CURRENTLY OFFLINE FOR RECONFIGURATION

The current configuration of Ace includes three development machines available for direct login:

ace-dev01 and ace-dev02

  • 64 CPU cores with 96 GB of RAM
  • These machines are meant for interactive development and debugging of CPU based codes and scripts.

ace-viz01

  • 16 CPU cores with 128 GB of RAM
  • 6 NVIDIA K20 GPUs
  • These machines are meant for interactive development and debugging of GPU based codes and scripts. They also serve as the primary Virtual Network Connection (VNC) machines for remote visualization of graphically intensive applications.

DEVELOPMENT MACHINE FAIR USE POLICY

These machines ARE NOT controlled by a batch manager, and are intended for short (no longer than ~30 minutes) test simulations or visualizations.

Any user processes found to be running for extended periods of time will be killed automatically. Examples of exceptions are GNU screen sessions and text editors such as *emacs*.

If you need to run test simulations for longer than 30 minutes, you can submit these test jobs to the Ace batch queue to be run on the compute nodes. Jobs can be run in the Ace batch queue for up to 24 hours.

Compute Nodes

There are also currently seven compute nodes available for up to two days of compute time. Compute nodes on both Ace and Turing are run using the Slurm batch manager. Extensive documentation of the policies in place for compute nodes are outlined elsewhere.

The compute nodes on Ace include a variety of resources meant to enable users to prepare for larger research simulations on Turing. This includes some nodes connected with Infiniband, as well as 24 NVIDIA K20 GPUs.

The most recent configuration of the system is:

Nodes # CPUs Memory Constraints (-C) GRES
compute-1-01 32 128 GB K20 gpu:6
compute[02-09] 40 128 GB K20, IB, E5-2680 gpu:2
compute[10-11] 80 128 GB none none
compute[12-13] 40 128 GB none none

Ace Slurm Partitions

There are currently three partitions available on Ace for general use, normal and debug. The normal and debug partitions span all available nodes.

General Use Partition Configuration (normal):

  • Maximum of 12 hours of walltime per job submission
  • Maximum of 40 CPU cores per job submission
  • Maximum of 2 GPUs per job submission
  • Maximum of 25 simultaneous job submissions per user

Additional resources can be used through the debug partition, which allows larger test jobs to be run for up to 30 minutes.

Debug Partition Configuration (debug):

  • Maximum of 1 hour of walltime per job submission
  • Maximum of 20 CPU cores per job submission
  • Maximum of 20 simultaneous job submission per user

Turing: Research Cluster

The primary purpose of Turing is for performing computationally intensive simulations related to graduate and postdoctoral research. Undergraduate research assistants and certain MQP projects are also eligble to use Turing if their computational needs exceed what is available on Ace.

Compute Nodes

Turing is currently composed of 56 compute nodes with a variety of resources. The heterogeneous nature of Turing is intentional, and important to meet the diverse needs of the many comptutational science projects at WPI.

When users log into Turing, a summary of the most recenty configuration of Turing is printed. The login printout is always the most up-to-date configuration, and may be different from what is listed below.

The most recent configuration of the system is:

Nodes # CPUs Memory Constraints (-C) GRES
compute-0-[01-04] 20 128 GB K40, E5-2680 gpu:2
compute-0-[05-24] 20 128 GB K20, E5-2680 gpu:2
compute-0-[25-26] 40 256 GB K80, E5-2698 gpu:4
compute-0-27 16 512 GB P100, E5-2667 gpu:4
compute-0-28 16 512 GB V100, E5-2667 gpu:4
compute-0-[29-38] 36 256 GB E5-2695 none
compute-0-[39-46] 48 256 GB 8168 none
compute-0-[47-54] 40 192 GB 6148 none
compute-0-55 40 192 GB 6148,K20 gpu:2
compute-0-56 40 192 GB 6148,P100 gpu:2

*All nodes on Turing are connected with FDR Infiniband.

Turing Slurm Partitions

There are currently two partitions available on Turing for general use, short and long. Both partitions span all compute nodes, but the long partition is configured to "float" and will only ever consume 38 nodes. This ensures there are always compute nodes available for jobs to run within 24 hours. There is no debug partition on Turing, and all debug and development jobs should be run on Ace whenever possible.

short Partition Configuration:

  • Maximum of 1 day walltime per job submission
  • Maximum of 8 nodes per job submission
  • Maximum of 2 GPUs per node/per job submission
  • Maximum of 50 simultaneous job submissions per user
  • Maximum of 288 CPUs per user across all job submissions
  • Maximum of 1.8 TB of memory per user across all job submissions

long Partition Configuration:

  • Maximum of 7 days (168 hours) walltime per job submission
  • Maximum of 4 nodes per job submission
  • Maximum of 2 GPUs per job submission
  • Maximum of 25 simultaneous job submissions per user
  • Maximum of 180 CPUs per user across all job submissions
  • Maximum of 1.25 TB of memory per user across all job submissions

TURING RESOURCE USE POLICY

RESOURCE LIMITS ARE IDENTICAL FOR ALL USERS AND HAVE BEEN CONFIGURED TO ALLOW FOR THE HIGHEST THROUGHPUT OF JOBS, WHILE MAINTAINING THE MOST FLEXIBILITY FOR THE VARIED WORKLOADS FROM ALL WPI COMPUTATIONAL SCIENCE RESEARCH AREAS.

In certain cases, resource limits can be modified for certain simulations. This requires verifiable proof that the user and code base are capable of using the requested resources. This includes but is not limited to:

  • Scalability curves for greater than 8 compute nodes
  • Proven inability to checkpoint in fewer than 7 days
  • Proof of efficient, multi-GPU algorithm design