Slurm

Links For Fast Navigation

Vhat is it
Requirements
How to do it
Links

Vhat is it

The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.

SLURM

Requirements

It provides three key functions:

allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
providing a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes, and
arbitrating contention for resources by managing a queue of pending jobs.

How to do it

Install packages

sudo apt install munge slurm-wlm

Create configuration files

sudo slurmd -C
NodeName=unit32-xavier CPUs=8 Boards=1 SocketsPerBoard=4 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15822

You can use your brouser to open /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html to generate a configuration file.

/etc/slurm-llnl/slurm.conf

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=unit32-xavier #<YOUR-HOST-NAME>
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/builtin
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStoragePass=/var/run/munge/global.socket.2
ClusterName=unit32-xavier #<YOUR-HOST-NAME>
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
#SlurmdDebug=4
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# COMPUTE NODES
NodeName=unit32-xavier CPUs=8 Boards=1 SocketsPerBoard=4 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15822
PartitionName=long Nodes=unit32-xavier Default=YES MaxTime=INFINITE State=UP

Enable and start daemons

Enable and start manageq slurmctld

sudo systemctl enable slurmctld
sudo systemctl start slurmctld

If you see Failed to start Slurm controller daemon - you need to check this with sudo slurmctld -Dvvv. After that you’ll probably see something like Slurmctld has been started with "ClusterName=unit32-xavier", but read "testclusternode" from the state files in StateSaveLocation. You need to delete the file /var/lib/slurm-llnl/slurmctld/clustername.

Enable and start agent slurmd

sudo systemctl enable slurmd
sudo systemctl start slurmd

Start agent mungle

sudo systemctl start mungle

munge -h, --help - show help page

How to check status

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
long*        up   infinite      1   idle unit32-xavier

scontrol show node
NodeName=unit32-xavier Arch=aarch64 CoresPerSocket=2
   CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=unit32-xavier NodeHostName=unit32-xavier Version=17.11
   OS=Linux 4.9.140-tegra #1 SMP PREEMPT Tue Apr 28 14:06:23 PDT 2020
   RealMemory=15822 AllocMem=0 FreeMem=12095 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=long
   BootTime=2020-07-10T17:54:58 SlurmdStartTime=2020-07-10T17:55:06
   CfgTRES=cpu=8,mem=15822M,billing=8
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Run some task

Create test.sh file

#!/bin/sh
sleep 20
date +%T

Make test.sh file executable

chmod +x test.sh

Submit the test.sh script

sbatch test.sh

Check the status - sinfo and squeue

sifo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
long*        up   infinite      1  alloc unit32-xavier

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 4      long  test.sh     root PD       0:00      1 (Resources)
                 3      long  test.sh     root  R       0:11      1 unit32-xavier

cat slurm-<JOB ID>.out
09:18:36