HPC Cluster Overview¶

Ace: Development & Teaching Cluster¶

Development¶

The primary function of the Ace cluster is to provide a robust development environment for writing and debugging code, running test simulations, and intensive analysis/visualization of results from Turing.

Good examples of intended use of Ace include:

Developing Python code using PyCharm
Compiling code using gfortran/gcc/ifort/icc
Debugging applications using gdb
Writing and testing job submission scripts for use on Turing
Running Tensorboard
Testing code parallelization and single node scalability

Development Hardware¶

DIRECT LOGIN DEVELOPMENT SYSTEMS CURRENTLY OFFLINE FOR RECONFIGURATION

The current configuration of Ace includes three development machines available for direct login:

ace-dev01 and ace-dev02

64 CPU cores with 96 GB of RAM
These machines are meant for interactive development and debugging of CPU based codes and scripts.

ace-viz01

16 CPU cores with 128 GB of RAM
6 NVIDIA K20 GPUs
These machines are meant for interactive development and debugging of GPU based codes and scripts. They also serve as the primary Virtual Network Connection (VNC) machines for remote visualization of graphically intensive applications.

DEVELOPMENT MACHINE FAIR USE POLICY¶

These machines ARE NOT controlled by a batch manager, and are intended for short (no longer than ~30 minutes) test simulations or visualizations.

Any user processes found to be running for extended periods of time will be killed automatically. Examples of exceptions are GNU screen sessions and text editors such as *emacs*.

If you need to run test simulations for longer than 30 minutes, you can submit these test jobs to the Ace batch queue to be run on the compute nodes. Jobs can be run in the Ace batch queue for up to 24 hours.

Compute Nodes¶

There are also currently seven compute nodes available for up to two days of compute time. Compute nodes on both Ace and Turing are run using the Slurm batch manager. Extensive documentation of the policies in place for compute nodes are outlined elsewhere.

The compute nodes on Ace include a variety of resources meant to enable users to prepare for larger research simulations on Turing. This includes some nodes connected with Infiniband, as well as 24 NVIDIA K20 GPUs.

The most recent configuration of the system is:

Nodes	# CPUs	Memory	Constraints (-C)	GRES
compute-1-01	32	128 GB	K20	gpu:6
compute[02-09]	40	128 GB	K20, IB, E5-2680	gpu:2
compute[10-11]	80	128 GB	none	none
compute[12-13]	40	128 GB	none	none

Ace Slurm Partitions¶

There are currently three partitions available on Ace for general use, normal and debug. The normal and debug partitions span all available nodes.

General Use Partition Configuration (normal):

Maximum of 12 hours of walltime per job submission

Maximum of 40 CPU cores per job submission

Maximum of 2 GPUs per job submission

Maximum of 25 simultaneous job submissions per user

Additional resources can be used through the debug partition, which allows larger test jobs to be run for up to 30 minutes.

Debug Partition Configuration (debug):

Maximum of 1 hour of walltime per job submission

Maximum of 20 CPU cores per job submission

Maximum of 20 simultaneous job submission per user

Turing: Research Cluster¶

The primary purpose of Turing is for performing computationally intensive simulations related to graduate and postdoctoral research. Undergraduate research assistants and certain MQP projects are also eligble to use Turing if their computational needs exceed what is available on Ace.

Compute Nodes¶

Turing is currently composed of 56 compute nodes with a variety of resources. The heterogeneous nature of Turing is intentional, and important to meet the diverse needs of the many comptutational science projects at WPI.

When users log into Turing, a summary of the most recenty configuration of Turing is printed. The login printout is always the most up-to-date configuration, and may be different from what is listed below.

The most recent configuration of the system is:

Nodes	# CPUs	Memory	Constraints (-C)	GRES
compute-0-[01-04]	20	128 GB	K40, E5-2680	gpu:2
compute-0-[05-24]	20	128 GB	K20, E5-2680	gpu:2
compute-0-[25-26]	40	256 GB	K80, E5-2698	gpu:4
compute-0-27	16	512 GB	P100, E5-2667	gpu:4
compute-0-28	16	512 GB	V100, E5-2667	gpu:4
compute-0-[29-38]	36	256 GB	E5-2695	none
compute-0-[39-46]	48	256 GB	8168	none
compute-0-[47-54]	40	192 GB	6148	none
compute-0-55	40	192 GB	6148,K20	gpu:2
compute-0-56	40	192 GB	6148,P100	gpu:2

*All nodes on Turing are connected with FDR Infiniband.

Turing Slurm Partitions¶

There are currently two partitions available on Turing for general use, short and long. Both partitions span all compute nodes, but the long partition is configured to "float" and will only ever consume 38 nodes. This ensures there are always compute nodes available for jobs to run within 24 hours. There is no debug partition on Turing, and all debug and development jobs should be run on Ace whenever possible.

short Partition Configuration:

Maximum of 1 day walltime per job submission

Maximum of 8 nodes per job submission

Maximum of 2 GPUs per node/per job submission

Maximum of 50 simultaneous job submissions per user

Maximum of 288 CPUs per user across all job submissions

Maximum of 1.8 TB of memory per user across all job submissions

long Partition Configuration:

Maximum of 7 days (168 hours) walltime per job submission

Maximum of 4 nodes per job submission

Maximum of 2 GPUs per job submission

Maximum of 25 simultaneous job submissions per user

Maximum of 180 CPUs per user across all job submissions

Maximum of 1.25 TB of memory per user across all job submissions

TURING RESOURCE USE POLICY¶

RESOURCE LIMITS ARE IDENTICAL FOR ALL USERS AND HAVE BEEN CONFIGURED TO ALLOW FOR THE HIGHEST THROUGHPUT OF JOBS, WHILE MAINTAINING THE MOST FLEXIBILITY FOR THE VARIED WORKLOADS FROM ALL WPI COMPUTATIONAL SCIENCE RESEARCH AREAS.

In certain cases, resource limits can be modified for certain simulations. This requires verifiable proof that the user and code base are capable of using the requested resources. This includes but is not limited to:

Scalability curves for greater than 8 compute nodes

Proven inability to checkpoint in fewer than 7 days

Proof of efficient, multi-GPU algorithm design