Links For Fast Navigation
Vhat is it
The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world’s supercomputers and computer clusters.
Requirements
It provides three key functions:
- allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
- providing a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes, and
- arbitrating contention for resources by managing a queue of pending jobs.
How to do it
Install packages
sudo apt install munge slurm-wlm
Create configuration files
sudo slurmd -C
NodeName=unit32-xavier CPUs=8 Boards=1 SocketsPerBoard=4 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15822
You can use your brouser to open /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html
to generate a configuration file.
/etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=unit32-xavier #<YOUR-HOST-NAME>
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/builtin
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
#AccountingStoragePass=/var/run/munge/global.socket.2
ClusterName=unit32-xavier #<YOUR-HOST-NAME>
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
#SlurmdDebug=4
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# COMPUTE NODES
NodeName=unit32-xavier CPUs=8 Boards=1 SocketsPerBoard=4 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=15822
PartitionName=long Nodes=unit32-xavier Default=YES MaxTime=INFINITE State=UP
Enable and start daemons
Enable and start manageq slurmctld
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
If you see Failed to start Slurm controller daemon
- you need to check this with sudo slurmctld -Dvvv
.
After that you’ll probably see something like Slurmctld has been started with "ClusterName=unit32-xavier", but read "testclusternode" from the state files in StateSaveLocation
.
You need to delete the file /var/lib/slurm-llnl/slurmctld/clustername
.
Enable and start agent slurmd
sudo systemctl enable slurmd
sudo systemctl start slurmd
Start agent mungle
sudo systemctl start mungle
munge -h, --help
- show help page
How to check status
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
long* up infinite 1 idle unit32-xavier
scontrol show node
NodeName=unit32-xavier Arch=aarch64 CoresPerSocket=2
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=unit32-xavier NodeHostName=unit32-xavier Version=17.11
OS=Linux 4.9.140-tegra #1 SMP PREEMPT Tue Apr 28 14:06:23 PDT 2020
RealMemory=15822 AllocMem=0 FreeMem=12095 Sockets=4 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=long
BootTime=2020-07-10T17:54:58 SlurmdStartTime=2020-07-10T17:55:06
CfgTRES=cpu=8,mem=15822M,billing=8
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Run some task
Create test.sh
file
#!/bin/sh
sleep 20
date +%T
Make test.sh
file executable
chmod +x test.sh
Submit the test.sh
script
sbatch test.sh
Check the status - sinfo
and squeue
sifo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
long* up infinite 1 alloc unit32-xavier
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4 long test.sh root PD 0:00 1 (Resources)
3 long test.sh root R 0:11 1 unit32-xavier
cat slurm-<JOB ID>.out
09:18:36