Large run time variation in Force-field (ML_ISTART = 2) simulations

Message

mike_foster · #1 Post by **mike_foster** » Fri Jan 27, 2023 6:48 pm

Hi,

I have been experiencing large variation in run times using ML generated force-fields. Rerunning the same job multiple times can sometimes take 2-3 times longer. I'm using intel, MKL and intel-mpi compilers. I'm guessing/thinking it has something to do with mpi commutation/allocation at run time but can't seem to identify any differences. Any ideas why and/or how to fix the issue? Thanks for any help.

Running short test calculations on 4 nodes each with 48 cpus.
Run time for 3 different runs (Elapsed time (sec) from OUTCAR): 415.773, 613.946, 904.526

INCAR:

ENCUT = 10
NCORE = 12
ISYM = 0
IBRION = 0
NSW = 2000
POTIM = 2.0
NBLOCK = 10
MDALGO = 3
LANGEVIN_GAMMA = 26
LANGEVIN_GAMMA_L = 10
PMASS = 100
TEBEG = 400
TEEND = 400
ISIF = 3
ML_LMLFF = T
ML_ISTART = 2
ML_WTSIF = 2
RANDOM_SEED = 4786233 0 0

#2 Post by **andreas.singraber** » Mon Jan 30, 2023 1:08 pm

Hello!

Welcome to the VASP forum! That is indeed a confusing result, there should not be a significant variation in the timings. However, it is hard to tell what is going wrong without further information. Could you please provide a complete set of input files according to the forum posting guidelines. The ML_FF file is probably to large to send, but please add the ML_LOGFILE which was created in your last training step (ML_ISTART=0,1,3). Thank you!

Best,
Andreas Singraber

mike_foster · #3 Post by **mike_foster** » Tue Jan 31, 2023 1:08 pm

Thanks for the reply. Attached is the ML_LOGFILE file from the last training step and input/output files from a short run. I have experienced this problem with other systems; I'm having this issue in general not just with this particular system (ML_FF).

#4 Post by **andreas.singraber** » Mon Feb 06, 2023 5:27 pm

Hello again,

sorry for the delay! Thank you for providing the in- and output files. I suspect that the provided ML_ISTART=2 run is not scaling well to the large amount of MPI processes. You would like to use 192 MPI ranks to perform the workload that is generated by 257 atoms in the POSCAR file. So most ranks will get only a single atom, some will get two atoms, to work on. However, because ML force fields are way less computationally demanding than ab initio calculations the MPI ranks will have too little to work on and are actually waiting most of the time for the next communication steps. Then the overall total CPU times heavily depend on the communication speed and the synchronicity of the MPI ranks. Hence, you see a lot of variation in the timings.

I would suggest to try a much lower number of MPI processes. Ideally, benchmark how the timings develop starting from a serial run, then 2 cores, 4, 8, and so on.. until you find a good compromise between speed and deployed CPU resources. Please let me know if there is further confusion regarding the parallelization efficiency. Also note that the upcoming VASP release 6.4 will come with a major performance gain for the ML prediction mode.

All the best,
Andreas Singraber

mike_foster · #5 Post by **mike_foster** » Thu Feb 09, 2023 1:24 am

Glad to hear that there will be performance improvement in 6.4. I understand your point regarding the number-of-atoms / number-of-cpus. I need/should do more testing. I did a few tests in the past and noticed a speed-up with more (maybe too many) cpus but then I noticed timing inconsistences. As I said, I need to do more testing to be sure but if the new version is coming out soon, I just might wait.

mike_foster · #6 Post by **mike_foster** » Tue Feb 28, 2023 9:54 pm

I'm still experiencing runtime variations. I'm now using VASP 6.4.0. I ran 5 calculations with a different number of cpus/nodes. I have done this with both ML_MODE = REFIT and REFITFULL and get runtime variations in both cases (REFIT is much faster). The table below is with the REFIT mode on a system with 256 atoms running for 5000 steps (ML_OUTBLOCK = 10; ML_OUTPUT_MODE=0).

Time (sec)
nodes cpus 1 2 3 4 5
1 12 2529 1607 1663 809 816
1 24 550 536 1821 541 535
1 48 352 351 607 355 349
2 96 202 244 203 203 205
4 192 227 155 154 153 619
6 288 118 275 118 187 117

mike_foster · #7 Post by **mike_foster** » Wed Mar 01, 2023 2:17 pm

Attached an image of the data table; it's hard to read above.

alex · #8 Post by **alex** » Thu Mar 02, 2023 7:54 am

Hello Mike,

are you all alone if you are not using all of the machine's cores? These simulations are memory-heavy and - in case you have to battle for memory bandwith - this might cause one of the delays.

Hth,

alex

mike_foster · #9 Post by **mike_foster** » Thu Mar 02, 2023 12:37 pm

Yes, only my job is on the node. It should not be a memory issue; there is 192 GB of memory on the nodes. I also logged onto a node during one of the jobs and the percent of memory used was low. If no one else is experiencing this problem, maybe it's related to my VASP build / libraries used (intel and mkl 19.1; intel-mpi 2019). Maybe I should try building with openmpi.

My Community

Large run time variation in Force-field (ML_ISTART = 2) simulations

Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations

Re: Large run time variation in Force-field (ML_ISTART = 2) simulations