I am experiencing a significant performance discrepancy when running the same VASP job through the Slurm scheduler compared to running it directly with mpirun. I am hoping for some insights or advice on how to resolve this issue.
System, slurm and vasp information:
Code: Select all
$ inxi -CMmS
System:
Host: x13dai-t Kernel: 6.5.0-18-generic arch: x86_64 bits: 64 Desktop: GNOME
v: 42.9 Distro: Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Machine:
Type: Unknown System: Supermicro product: Super Server v: 0123456789
serial: 0123456789
Mobo: Supermicro model: X13DAI-T v: 1.01 serial: WM23AS002622
UEFI: American Megatrends LLC. v: 2.1 date: 12/14/2023
Memory:
System RAM: total: 512 GiB available: 503.52 GiB used: 15.5 GiB (3.1%)
Array-1: capacity: 6 TiB note: check slots: 16 modules: 16
EC: Single-bit ECC
Device-1: P1-DIMMA1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-2: P1-DIMMB1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-3: P1-DIMMC1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-4: P1-DIMMD1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-5: P1-DIMME1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-6: P1-DIMMF1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-7: P1-DIMMG1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-8: P1-DIMMH1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-9: P2-DIMMA1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-10: P2-DIMMB1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-11: P2-DIMMC1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-12: P2-DIMMD1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-13: P2-DIMME1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-14: P2-DIMMF1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-15: P2-DIMMG1 type: DDR5 size: 32 GiB speed: 4800 MT/s
Device-16: P2-DIMMH1 type: DDR5 size: 32 GiB speed: 4800 MT/s
CPU:
Info: 2x 48-core model: Intel Xeon Platinum 8488C bits: 64 type: MT MCP SMP
cache: L2: 2x 96 MiB (192 MiB)
Speed (MHz): avg: 876 min/max: 800/3800 cores: 1: 800 2: 800 3: 800 4: 800
5: 800 6: 800 7: 800 8: 800 9: 800 10: 800 11: 800 12: 800 13: 800 14: 800
15: 800 16: 800 17: 800 18: 800 19: 800 20: 800 21: 800 22: 800 23: 800
24: 795 25: 800 26: 800 27: 800 28: 800 29: 800 30: 800 31: 800 32: 800
33: 800 34: 800 35: 800 36: 800 37: 887 38: 800 39: 800 40: 3100 41: 800
42: 800 43: 2222 44: 800 45: 2500 46: 800 47: 800 48: 800 49: 800 50: 800
51: 800 52: 800 53: 800 54: 800 55: 800 56: 800 57: 800 58: 800 59: 800
60: 800 61: 800 62: 800 63: 800 64: 800 65: 800 66: 800 67: 800 68: 800
69: 800 70: 800 71: 800 72: 800 73: 800 74: 800 75: 800 76: 800 77: 800
78: 800 79: 800 80: 800 81: 800 82: 800 83: 800 84: 800 85: 800 86: 800
87: 800 88: 800 89: 800 90: 800 91: 800 92: 800 93: 800 94: 800 95: 800
96: 800 97: 800 98: 800 99: 800 100: 800 101: 800 102: 800 103: 800
104: 800 105: 800 106: 800 107: 800 108: 800 109: 800 110: 800 111: 800
112: 800 113: 800 114: 800 115: 800 116: 2400 117: 800 118: 800 119: 800
120: 800 121: 800 122: 800 123: 800 124: 800 125: 800 126: 800 127: 800
128: 800 129: 800 130: 800 131: 3800 132: 2400 133: 1200 134: 800 135: 800
136: 800 137: 800 138: 800 139: 800 140: 800 141: 800 142: 2500 143: 801
144: 800 145: 800 146: 800 147: 800 148: 800 149: 800 150: 800 151: 800
152: 800 153: 800 154: 800 155: 800 156: 800 157: 800 158: 800 159: 800
160: 800 161: 800 162: 800 163: 800 164: 800 165: 800 166: 800 167: 800
168: 800 169: 800 170: 800 171: 800 172: 800 173: 800 174: 800 175: 1021
176: 800 177: 800 178: 800 179: 800 180: 800 181: 800 182: 1500 183: 800
184: 800 185: 800 186: 800 187: 800 188: 800 189: 800 190: 800 191: 800
192: 800
VASP version: 6.4.3
Job Submission Script:
Code: Select all
#!/usr/bin/env bash
#SBATCH -N 1
#SBATCH -D .
#SBATCH --output=%j.out
#SBATCH --error=%j.err
##SBATCH --time=2-00:00:00
#SBATCH --ntasks=36
#SBATCH --mem=64G
echo '#######################################################'
echo "date = $(date)"
echo "hostname = $(hostname -s)"
echo "pwd = $(pwd)"
echo "sbatch = $(which sbatch | xargs realpath -e)"
echo ""
echo "WORK_DIR = $WORK_DIR"
echo "SLURM_SUBMIT_DIR = $SLURM_SUBMIT_DIR"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOBID = $SLURM_JOBID"
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_NNODES = $SLURM_NNODES"
echo "SLURMTMPDIR = $SLURMTMPDIR"
echo '#######################################################'
echo ""
module purge > /dev/null 2>&1
module load vasp
ulimit -s unlimited
mpirun vasp_std
When running the job through Slurm:
Code: Select all
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR
LOOP: cpu time 14.4893: real time 14.5049
LOOP: cpu time 14.3538: real time 14.3621
LOOP: cpu time 14.3870: real time 14.3568
LOOP: cpu time 15.9722: real time 15.9018
LOOP: cpu time 16.4527: real time 16.4370
LOOP: cpu time 16.7918: real time 16.7781
LOOP: cpu time 16.9797: real time 16.9961
LOOP: cpu time 15.9762: real time 16.0124
LOOP: cpu time 16.8835: real time 16.9008
LOOP: cpu time 15.2828: real time 15.2921
LOOP+: cpu time 176.0917: real time 176.0755
Code: Select all
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ module load vasp
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ module list
Currently Loaded Modules:
1) lmod 3) hdf5/1.14.3-oneapi.2023.2.0 5) dftd4/main-oneapi.2023.2.0
2) oneapi/2023.2.0 4) wannier90/develop-serial-oneapi.2023.2.0 6) vasp/6.4.3-oneapi-oneapi.2023.2.0
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ mpirun -n 36 vasp_std
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$ grep LOOP OUTCAR
LOOP: cpu time 9.0072: real time 9.0074
LOOP: cpu time 9.0515: real time 9.0524
LOOP: cpu time 9.1896: real time 9.1907
LOOP: cpu time 10.1467: real time 10.1479
LOOP: cpu time 10.2691: real time 10.2705
LOOP: cpu time 10.4330: real time 10.4340
LOOP: cpu time 10.9049: real time 10.9055
LOOP: cpu time 9.9718: real time 9.9714
LOOP: cpu time 10.4511: real time 10.4470
LOOP: cpu time 9.4621: real time 9.4584
LOOP+: cpu time 110.0790: real time 110.0739
The attachment is the test example used above.
Thank you for your time and assistance.
Best regards,
Zhao