Properties of High Power Computing
As opposed to using a single lab computer to process your data, you can connect to the Turing HPC cluster. The cluster is built using many nodes with a variety of purposes. When you send a job to the cluster, it is often completed using multiple physical computers in the cluster. Jobs are processed in a “wide” manner as opposed to “tall”, in which many large jobs are run and completed quickly simultaneously as opposed to small jobs that individually take a long time to complete. Thus, the cluster is highly recommended for operating on large quantities of data (for example over 100 GB of data), and is sometimes the only viable hardware for completing a job (think 10 TB of data). Turing’s hardware specifications can be found here.
Many users across WPI’s campus, across departments, submit jobs to Turing. To ensure everyone’s jobs can be completed in as efficient a manner as possible, a scheduler organizes how jobs are executed. Thus, to interface with Turing, a job script is set up to instruct the scheduler on what resources are needed, and then Turing determines how to execute it among all the currently running jobs. Users connect to Turing through the login node and send their job scripts through the scheduler to be run automatically in the queue.
Node Types in Turing
Turing is composed of three node types:
- Head node: Manages running everything; inaccessible to users
- Login node: Designed for users to login, manipulate files, prepare jobs, etc. and not for running jobs
- Compute node: Runs the scheduled jobs through job allocation, therefore cannot be used directly
These nodes will be switched between by the scheduler as necessary. Whenever possible, you do not want to choose your own node.
Turing uses the Simple Linux Utility for Resource Management (SLURM) scheduler. SLURM aggregates the list of jobs that need to be run, knows what resources are available, and then coordinates what runs where. It handles starting the jobs by running their scripts.
How to Access Turing
To gain access to Turing, you must first request an account by filling out the Turing Account Request form. Once you receive confirmation of your account creation, you are ready to connect to Turing.
To actually connect to Turing, log in using ssh to turing.wpi.edu using your WPI username and password. For example, open a terminal window/command prompt and type the following, replacing gompei with your WPI username:
Now you should see that your terminal shows something like
login-01. Now you are connected to Turing!
If you would like to exit your ssh connection to Turing, type
exit and press enter and you will see
Connection to turing.wpi.edu closed.
and you will be returned to control of your local terminal window.
Preparing a Job to Run on Turing
As Turing is organized using the SLURM scheduler, for each job you submit you must write a SLURM job script. There are a few approaches you can take when writing your job script, based on the task you seek to complete:
- Make a job script that accepts parameters, so you can submit the same file multiple times with different parameter values
- Generate a set of job scripts using another process such as an additional script, Python, etc. and then submit all of them
- Note: there is a limited number of available submit slots to avoid spamming the squeue, be reasonable.
- Package a set of smaller tasks into a single job script that loops over each of them
- Use a job script wrapper that handles the details for you (sabaqus, serialbox, etc.).
A quick start user guide to SLURM and the corresponding scripts can be found in the Slurm Documentation tab in the left sidebar.
Example SLURM Script
Here we present an example SLURM script for running a job named Example Job:
#SBATCH -N 1
#SBATCH -n 10
#SBATCH -J "Example Job"
#SBATCH -p short
#SBATCH -t 12:00:00
#SBATCH -C A100|V100
What do each of these fields mean?
#SBATCH is the core component of the SLURM script, it completes the tasks of getting the requested job resources and then running the job using those resources without requiring user interaction.
-N 1requests 1 node for the job
-n 10requests 10 tasks (CPU cores) per node
--mem=5gspecifies that 5 gigabytes of memory are requested to complete this job, so at least 5GB of RAM are allocated
-J "Example Job"specifies the job name that will be used to identify the job in the queue
-p shortspecifies that the job will be run on the queue partition named short, for jobs that should complete quickly.
-t 12:00:00specifies that the job can take a maximum of 12 hours
--gres=gpu:2specifies that the job will use 2 GPUs per node
-C A100|V100specifies that the job must use only NVIDIA A100 or V100 GPUs
python my_script_name.pyexecutes your script (here, a script named my_script_name.py) when it is next in the scheduler queue
Running Your Job on Turing
Once you have completed your SLURM script, save it as a
.sh file. For example, if we want to run a job with a submission SLURM script we named
run.sh, we would run the command
in the Turing terminal window. Now, when your job is up next in the scheduler queue it will be run.
To check running jobs, use the command
squeue -u $USER
which will show the status of only your own jobs.
Using GPUs and CUDA Drivers
If your job involves training a neural network on a large dataset, it is highly recommended that you request CUDA drivers to accelerate the training process. To access CUDA drivers, first load them in your SLURM script using:
module load cuda
Or, you may wish to use a specific CUDA version or add specific CUDA libraries, which you can load by using the following lines instead:
module load cuda11.6/blas/
module load cuda11.6/fft/
module load cuda11.6/toolkit/
The folder names after
cuda11.6 indicate different libraries of CUDA 11.6 that you may require to run your job:
cuda11.6/blas/loads the CUDA Basic Linear Algebra Subroutines library for matrix and vector operations
cuda11.6/fft/loads the CUDA Fast Fourier Transform library for signal and image processing
cuda11.6/toolkit/loads the entire CUDA toolkit necessary for using NVIDIA GPUs
To complete the neural network’s training process, you may require minimum specific hardware to ensure the job is handled properly. To do so, add the following line to your SLURM script
#SBATCH -C A100|V100
with the other
#SBATCH lines to specify that this job must be run on an NVIDIA A100 or V100 GPU.
Academic and Research Software Modules and Virtual Environments
Turing provides access to hundreds of versions of software, accessible by loading them, such as:
To view available modules, run the following command in your Turing terminal:
module avail [search]
Once you determine which software modules you would like to use, you can load them using:
module load <module name>
For example, to load MATLAB
module load matlab
Which edits the environment variables in your Turing session to make the loaded modules be your used software. You can put module load lines after the
#SBATCH lines in your SLURM script to ensure your job requests access to the correct software.
If a software module you would like to use is unavailable or you would like access to a newer version that has yet to be installed, please make a request to ARC by sending an email to ARCweb@wpi.edu.
Python Virtual Environments
For Python specifically, you can set up a virtual environment, which acts similarly to a personal module but for using Python packages. In these virtual environments, you can install whichever Python packages you want using pip, and the result exists as a directory that you can load into sessions. See the following example for how you can create the virtual environment myenv:
$ module load python/<preferred version, in this case python3>
$ python3 -m venv myenv
$ source myenv/bin/activate
$ pip install <whichever Python package you want>
Once the virtual environment is created, you can now run the source command to load it into a new session and use your Python packages such as NumPy or SciPy.
The main terminal commands used to interact with Turing’s scheduler environment include:
sbatch <job_file>is the primary tool for submitting jobs, which runs asynchronously
- Can take any of the job options to override the option in the job file
- Returns a job ID number which can be used for later reference
sruninitiates MPI tasks inside a batch job, if you are using MPI, and runs synchronously
- When running multiple tasks within a larger batch allocation, initiates those tasks
squeue [-u $USER]shows the queued jobs and their status
-uoption limits what is shown to your own jobs
- States of each job:
- PD: Pending (waiting in queue)
- R: Running
- Reasons for pending
- (Resources): Job is next in line, and just needs resources to free up
- (Priority): Job is not at the front of the line
- (JobHeld*): For some reason, the job has been disabled from starting
- (StartTime): Job is not allowed to start yet due to Start Time not happening yet
scancel <job_id>cancels one or more jobs
scontrol show job <job_id>shows detailed information about a job
- If the job is close to starting, may include the time and nodes on which it is scheduled to run
- Useful for double-checking if SLURM understood the specifications you set in your job request
- Useful for figuring out how long your job is going to have to wait
- Can also be used to hold or release a job, or to suspend or resume a job
Last Updated on October 25, 2023