Example of How to Run TensorFlow Batch Jobs

To use TensorFlow, first you need to load the module:

module load tensorflow/0.11

This automatically loads the correct software for TensorFlow (CUDA, cudNN, etc).

The CUDA module exports the relevant environmental variables (CUDA_ROOT and LD_LIBRARY_PATH) for you when loaded.

The second way to submit a job is through the use of a batch script that is submitted to the queue to run on a compute node.

Batch jobs are for when you have production ready code and simulations to run.

You can submit a job to the queue using a shell script (e.g. tensorflow.sh) to call the relevant program, such as:

sbatch tensorflow.sh

The flags for the requested resources are placed at the top of the script like this:


#!/bin/bash
#SBATCH -N1
#SBATCH -pexclusive
#SBATCH --gres=gpu:2

The example shell script below calls a program named tf_bash.py:


#!/bin/bash
#SBATCH -N 1
#SBATCH -p shared
#SBATCH -t 1:00:00
#SBATCH --gres=gpu:1

python tf_bash.py >> tf_test.log

You can also submit a python program (e.g. tf_direct.py) to the batch manager directly, without the extra shell script. Below is a simple TensorFlow program that can be submitted using the command sbatch tf_direct.py.


#!/bin/python
#SBATCH -N 1
#SBATCH -p exclusive
#SBATCH -o tf_test.out
#SBATCH -t 1:00:00
#SBATCH --gres=gpu:2

import tensorflow as tf

hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()

print(sess.run(hello))

a = tf.constant(10)
b = tf.constant(32)

print(sess.run(a + b))   

At the top of the program, you will see the resource requests as #SBATCH comments. These can be put either at the top of the job script, or passed to the sbatch command when submitting, e.g.:

sbatch -N 1 -p exclusive -o tf_test.out -t 1:00:00 --gres=gpu:2 tf_direct.py

If you pass them to sbatch on the command line, there is no need to include them at the top of the program.


Example of How to Run Abaqus Batch Jobs

To use Abaqus, first you need to load the module:


module load abaqus/2016

Abaqus can run on either a single compute node, or across multiple nodes using MPI. For some large problems, the only way to run a simulation is by decomposing the system into domains for execution on separate compute nodes.

ARC has created an Abaqus job submission program for the cluster that automates resource requests, variable settings, and Abaqus flags.

After loading the Abaqus module, the command sabaqus is available. Users can now execute the command sabaqus --help to see options for the cluster and Abaqus jobs:


gompei@ace:~/work$ sabaqus --help
Usage: sabaqus [options]

Options:
  -h, --help                           show this help message and exit
  -i INPF, --input=INPF                Abaqus input file (INPUT.inp)
  -l LOGF, --logfile=LOGF              Log file for Abaqus output
  -s SCR, --scratch=SCR                Abaqus scratch directory
  -p PART, --partition=PART            Partition (queue) for resource request
  -E ERR, --error=ERR                  Error file for Abaqus
  -t TIME, --walltime=TIME             Job run time (HOURS:MINUTES)
  -m MEM, --memory=MEM                 Abaqus memory (in GB)
  -N NODES, --nodes=NODES              Number of nodes
  -C CONST, --constraint=CONST         Hardware constraint (Slurm Feature)
  -n NTASKS, --ntasks-per-node=NTASKS  Number of tasks per node
  -g NGPUS, --gpus-per-node=NGPUS      Number of GPUs per node

The submission program simplifies the job submission process, allowing for quick submission of new Abaqus simulations for the users.

Suggestions or modifications to the sabaqus program are welcome.

If you choose to write your own submission script, below are some important details to keep in mind.

Abaqus requires a couple of special environmental variable settings and flags in order to run properly on the cluster.

The first is a Slurm specific environmental variable (a known issue with Abaqus on HPC clusters):


unset SLURM_GTIDS

The second setting is to pass the interactive keyword to the Abaqus command:


abq2016 cpus=40 gpus=2 mp_mode=mpi input=INPUT_FILENAME job=INPUT_FILENAME scratch=/PATH/TO/ABAQUS/SCRATCH memory="80 gb" interactive

Without the interactive keyword, the job will exit the queue immediately after starting.