Scheduler & Jobs


  1. What is a slurm account and what is the default slurm account

    Slurm has the concept of accounts, which  are used to allow and deny users access to different partitions or allocations if setup as a credit scheduler.  While Mana does not use accounts to setup credit allocations, we do use it to limit what partitions users can access. 

    When a user account is created on Mana, they are automatically added to the "uh" account and this is set to be the default account that they will use.  For Faculty that purchase or lease nodes, a new account is created, which must be used in order to access the newly created private partition.


  2. How do I set the Slurm account my batch or interactive job should use

    Both srun and sbatch have a parameter that allows users to set the slurm account a job should be submitted to.  If this is omitted, it goes to the default account set for a user, which si typically the "uh" account on Mana.  

    -A, --account=<account>    Charge resources used by this job to specified account.

    [user@node-0002 ~]$ srun -A private_account -p sandbox -c 2 --mem=6G -t 60 --pty /bin/bash
    -- OR -- 
    [user@node-0002 ~]$ cat job2.slurm 
    #!/bin/bash 
    #SBATCH --job-name=dowork 
    #SBATCH --account=private_account
    #SBATCH --partition=kill-shared 
    #SBATCH --time=0-01:00:00 
    #SBATCH --cpus-per-task=1 
    #SBATCH --mem=6400 ## max amount of memory per node you require 
    #SBATCH --error=dowork-%A_%a.err ## %A - filled with jobid 
    #SBATCH --output=dowork-%A_%a.out ## %A - filled with jobid module purge ./work
    
    
  3. How do I start an interactive or batch job using Slurm

    Please see the following video for the basic usage of slurm to submit jobs on Mana

  4. Which partition should I submit my job to?  What is the difference between each  partition

    Please see the following video in which we discuss the different partitions on Mana and outline the different attributes of each partition

  5. I am trying to run something simple in an interactive session and it keeps dying with messages that the process was killed, out of memory, etc. What am I doing wrong

    When we see reports of this error from users, our first question will be did you request memory along with your interactive job. By default most partitions will only provide 512MB of ram, regardless of cores requested. Users will need to request memory for their interactive job by providing the --mem option. For example:

    srun -p sandbox -c 2 --mem=6G -t 60 --pty /bin/bash
  6. I tried to submit a job using Slurm and it says the command is not found. How do I submit jobs on Mana

    First verify that you are on a login node for the cluster. The login node is accessible at uhhpc.its.hawaii.edu . If you are logged in on the DTN nodes (hpc-dtn1.its.hawaii.edu or hpc-dtn2.its.hawaii.edu) both of these machines are designed for data transfer and only have access to the users storage locations and do not have access to the Slurm job scheduler.


  7. Why do the nodes not allow me to access all the cores by default

    You may notice that if you ask for 20 cores per task in a job, they will place your job on the large memory (lmem) node. The reason for this, is due to the scheduler now reserving a single core on each node for the daemon processes that live on a node. Simple tests have shown that when the last core on a node is used, the runtime actually increases. A performance hit may not be true for all applications, so you may wish to do your own testing to determine if using all the cores or all but one core on a node is optimal for your jobs.

  8. How do I tell the scheduler allow me to use all the cores on a node

    If you are testing to see if using all the cores is a benefit for your job, SLURM provides a parameter that allows you to override the one core reservation on the node. A user would need to supply “-S0” or “--core-spec=0” in their srun command or sbatch script and request all the cores on the node for a job. Using the core-spec option places the node into exclusive mode. This flag should only be used when you want to use all the cores on a node and removed when you are using less than the maximum number of cores on a node.


  9. How do I run MPI jobs on the cluster

    Each user home should have a directory named “examples” which contains a template submission scripts to use. You may notice for the MPI examples, we have two different scripts. One shows how to setup properly for the QDR Infiniband (IB) network, another shows how to properly while the other is for systems that do not have access to the IB network.


  10. My MPI job does something very odd and fails soon after submission. What am I doing wrong

    In case you have an MPI job and you are getting odd failures as you start your job, please verify if you are submitting the job from a compute node via an interactive session or the login node.  If you are submitting it via an interactive session from a compute node, you can first try and see if using the following parameter fixes your problem:

    #SBATCH --distribution=”*:*:*” .

    Setting the distribution parameter as shown will return it to the defaults, instead of trying to use the setting generated by SLURM for the interactive session. If this does not fix your submission problem, please email us at UH-HPC-Help@lists.hawaii.edu so we can investigate your problem.


  11. My MPI Jobs submitted via an interactive session is really slow and shows low CPU utilization.  What could be wrong

    Slurm sets several environment variables when a job is initialized, this includes interactive sessions.  These same environment variables can be used to override what defaults slurm will utilize.  As a result, depending on how the interactive session is started, it can result in a sub-optimal assignment of CPU cores to your MPI task.  To correct this, you can add the following in a job script, before you start the  MPI process.

    unset SLURM_CPU_BIND_VERBOSE SLURM_CPU_BIND_LIST SLURM_CPU_BIND_TYPE SLURM_CPU_BIND

    This will unset the offending Slurm environment variables and use what Slurm typically would use for defaults.
     

  12. What do I need to do to test an MPI job from within an interactive session

    With  newer versions of slurm (version 20.11) changes have been made that change the behavior of certain cases when using Slurm.  One of these cases is with how srun works with resource allocation.  In the past, Slurm would gladly run an job step (srun) within an interactive session, when you request the same resources.  WIth the newer version of Slurm, this is no longer the default behavior.  Instead, users need to explicitly instruction Slurm to allow a resource overlap.  To do this, when a srun is run within an interactive session started by srun, one needs to make sure to include the "--overlap" flag on the second srun.

    [user@login002 ~]$ srun -I30 -p sandbox -c 1 -n 5 --mem=6g -t 60 --pty /bin/bash
    [user@node-0002 ~]$ srun --overlap -n ${SLURM_NTASKS} whoami
    
    
  13. I do not want my job to be requeued when preempted in the kill partitions or when the administrators requeue jobs for maintenance.  How do I prevent this

    Slurm provides a flag that can be passed to sbatch on the command line or within the job script which will tell the scheduler to not requeue and restart your job.  The flag you would add is the '--no-requeue' flag.

    --no-requeue Specifies that the batch job should never be requeued under any circumstances. Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.  

    Source: https://slurm.schedmd.com/sbatch.html

    [user@login002 ~]$ sbatch --no-requeue job.slurm
    -- OR --
    [user@node-0002 ~]$ cat job2.slurm
    #!/bin/bash
    #SBATCH --job-name=dowork
    #SBATCH --no-requeue
    #SBATCH --partition=kill-shared
    #SBATCH --time=0-01:00:00
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=6400 ## max amount of memory per node you require
    #SBATCH --error=dowork-%A_%a.err ## %A - filled with jobid
    #SBATCH --output=dowork-%A_%a.out ## %A - filled with jobid
    
    module purge
    ./work