Koa - Job Arrays

Slurm, like many other schedulers, provides the ability to submit a single job, which submits many sub jobs, each of which looks very similar to each other, yet are independent. This feature in Slurm is called a Job Array.

With a job array, one can submit thousands of jobs with a single submission. The way this work is each sub job in the array is assigned a number unique to the array, typically from a range defined within the batch script or on the command line at execution time. This range can be sequential, as well as have gaps in it:

Type	Example	Explanation
Sequential	1-1000	Process 1000 jobs with unique IDs 1 through 1000
Gapped	1-10,15	Process 11 jobs with the unique IDs 1 through 10 and 15
strided	1-10:2	Process jobs 1 through 10 in steps of 2 - 1, 3, 5, 7, 9
Limit Concurrent Jobs	1-10%1	Process jobs 1 through 10, but only allow 1 job to run at a time

Within each sub job that makes up the job array, the value assigned to it is provided by the slurm variable "SLURM_ARRAY_TASK_ID". As this is a number, it can be used as is, if your files or what you plan to modify is very uniform, such as having file names with a regular naming convention: foo_1.txt, foo_2.txt, etc. .

As this is not how most users name their files, job arrays by themselves are limited in utility. To work around these limitations, the use of four additional command-line tools can make the Job array accept arbitrary filenames and paths, or even strings.

Commands we will be using:

ls
pwd
sed
wc

The first thing we need to do is create what we will call a listing file. This file, will contain one filename or file path per line. We can use ls to do this as you can get ls to print out a directory with 1 file per line. Note, in this example we are assuming that the directory only contains files we are interested in and does not contain any directories.

To create the listing file, you would navigate to the directory with the files you are interested in and then run the following command to see what would be generated

 ls -1 -d `pwd`/*

What should be printed to screen, is the full path to every file in the current directory you are in. If this list looks like what you are wanting to run through your job arrays, you would then redirect the output to a file

 ls -1 -d `pwd`/* > ~/new_input.lst

This will create a new file in our home directory named "new_input.lst", which will contain a file per line. Now, we need to determine how large of an array will will be working with, for this, we can use the word count command (wc). We we be specifically using it to count lines, which will count the number of times it see a newline character in the file

wc -l ~/new_input.lst
1344 new_input.lst

This tells us that we have 1344 lines in our file, so our job array range would be 1-1344 .

For submission our job script could take the following form:

dowork.slurm

#!/bin/bash
#SBATCH --job-name=dowork
#SBATCH --array=1-10
#SBATCH --partition=shared
#SBATCH --time=0-01:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=6400 ## max amount of memory per node you require
#SBATCH --error=dowork-%A_%a.err ## %A - filled with jobid
#SBATCH --output=dowork-%A_%a.out ## %A - filled with jobid

## Useful for remote notification
##SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE,TIME_LIMIT_80
##SBATCH --mail-user=user@test.org

module purge
fid=$(sed -n "$SLURM_ARRAY_TASK_ID"p ~/new_input.lst)
./dowork -i ${fid} -o ${fid}.out

In the above example, if we were to just submit the script as is using sbatch, it would only process the first 10 lines of our new_input.lst file. We can either decide to modify the script so that --array option uses our whole range: 1-1344, or we modify it at submission time on the command line

[user@login001 ~]$ sbatch --array=1-1344 dowork.slurm

Exploring the dowork.slurm script, we see that we are in fact utilizing the unique ID provided to each sub job: SLURM_ARRAY_TASK_ID in coordination with the sed command. What the sed command is doing, is saying seek to line N in new_input.lst, where N is SLURM_ARRAY_TASK_ID and save that line to the variable named fid. We then use the fid variable as we are executing our application named "dowork" and passing as an input (-i) the file path stored in fid, and then writing an output file (-o) to a new file in the same directory as our input file, with the exact same name, except we append ".out" to the file name.