Basic Commands

The ARC managed HPC clusters use a batch manager called Slurm. Extensive documentation for Slurm can be found at http://slurm.schedmd.com/slurm.html

Some of the most basic and useful Slurm commands are:

sinfo
squeue
sbatch
scancel

To see the full documentation for any of these commands (e.g. sinfo), type:

man sinfo

You can run the command `sinfo` to see the available queues and compute resources.

All jobs must be run through the batch manager. Any jobs found running on the compute nodes outside of the queueing system will be killed.

Once you have a job submitted, you can check the status with the command:

squeue

To kill a submitted job, type:

scancel JOBID

Where the JOBID can be obtained from the squeue command.


Direct Job Submission

Submission of jobs through a batch manager such as Slurm requires the user to specify the required resources either on the command line, or through lines at the top of a job script.

Examples of resources that can be requested are wall time, number of CPUs, number of compute nodes, memory, and number of GPUs per node.

Running your program directly on a single compute node could look like:

srun -p shared -N 1 -t 12:00 -n 1  ./gompbot.x >> gompbot.out &

This would run your compiled program (e.g. gompbot.x) on a single node in serial (single CPU), and output any data to gompbot.out. The ampersand sends the job to the background so you can continue to work in the terminal or monitor your output.



Batch Job Submissions

Multiple jobs, or a single job script that performs many functions, through the use of a batch script.

Instead of passing each resource flag, the requests would go at the top of the script preceded by #SBATCH.


#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node 4
#SBATCH -p shared
#SBATCH -o gompbot.out
#SBATCH -t 1:00:00

srun -l gompbot.x
srun -l /bin/pwd

Important: Options supplied on the command line will override any options specified within the script.