It is quite slow run MD by lammps using MLFF

Message

suojiang_zhang1 · #1 Post by **suojiang_zhang1** » Sun May 25, 2025 2:59 am

Recently, I tried to use the MLFF by lammps. But I found the run is quite slow, even is slower than fast model with ML_Mode=run under the same atomic system and computer setup, of course my supercomputer is AMD Eyp7763 with 128 CPU core.
My system includes 1680 atoms.
The trained MLFF is 563MB after refit
The 100 steps run in lammps need 10 minutes or longer real time. It is quite slow.
please give me some suggestions to improve the speed.

In addition, the temperatures in lammps MD simulation are higher by NOSE thermstate:
fix 1 all npt temp 300 300 100 iso 1 1 1000
the thermo output temperatures are around 350-400K
but fast mode with ML_MODE=run temperatures are around 300K, it is Ok.

#2 Post by **andreas.singraber** » Mon May 26, 2025 8:24 am

Hello!

You have claimed that there is a significant performance issue regarding the LAMMPS interface before in this topic: forum/viewtopic.php?p=31209

However, you have not yet shown any evidence that would support this although we asked you repeatedly to provide the necessary input and output files. We have not observed this behaviour in our tests. Can you please share at least your OUTCAR and log.lammps file of the MD runs where you compare timings? Also, we can only try to reproduce the issue if you send us all files which are required for training/refitting/prediction and so on...

If you provide us with a download link, please ensure that we can download the data without further login procedures (this happened last time...).

Andreas Singraber

suojiang_zhang1 · #3 Post by **suojiang_zhang1** » Mon May 26, 2025 9:20 am

thank you for your prompt reply,
I upload the necessary files to onedriven by:
https://1drv.ms/f/c/9307d2f4f2280f7a/Er ... w?e=hJr0GI

you can download the files directly.

suojiang_zhang1 · #4 Post by **suojiang_zhang1** » Tue May 27, 2025 12:29 pm

Dear,
would you please talk me that you can get the files?

#5 Post by **andreas.singraber** » Wed May 28, 2025 6:45 am

Hello!

Yes, thank you, this time I was able to download the files. It will take a few days to investigate if there is an issue... please be patient, I will report back when I have some results.

All the best,
Andreas Singraber

suojiang_zhang1 · #6 Post by **suojiang_zhang1** » Fri May 30, 2025 2:06 am

Dear,
I is not completely real for the above issue.
I observed the speed is quite slow under the fast model (ML_MODE=run), the speed of training is not slow or is normal.
I compared our AMD cpu with other Intel CPU for the same input files, the speed of fast model with intel CPU with 40 cores are faster so much than our AMD cpu with 128 cores.
It is also same phenomenon observed in lammps run with the same ML_FF.

I guess it maybe is compilation issue, or the BLACK. I do not know. My compiler is oneapi2024/.

suojiang_zhang1 · #7 Post by **suojiang_zhang1** » Thu Jun 05, 2025 8:10 am

Dear
I understand you are very busy. I want to know if you have gotten some initial results.

Yours.

#8 Post by **andreas.singraber** » Fri Jun 06, 2025 8:52 am

Hello!

I did quite some testing with the data you provided, regarding performance and also agreement between the native VASP code and the LAMMPS integration. Here are my results:

Performance

You mentioned that your simulation on an 128-core AMD EPYC 7763 takes more than 10min to complete 100 time steps, hence the performance here is roughly 6 s/timestep (for 1680 atoms).

At first, I did not have the same machine available for testing so I had to use a smaller 48-core AMD Epyc 7643. However, the architecture and speed per core is very similar, so timings can be compared to some extent. The performance on a single core of the AMD Epyc 7643 is about 1.6 s/timestep (depending a bit on the compiler). I gradually increased the number of cores used and found that given the system size is 1680 atoms it makes little sense to go beyond 12-16 cores. The best result I achieved was roughly 0.17 s/timestep on 16 cores with the following toolchain: GCC 11.2.0 compiler together with OpenMPI 4.1.2 and the AOCL library 3.1.

Later, I could run the same simulations on a larger machine resembling yours (dual-socket AMD EPYC 7773X = 128 cores). The situation there was similar, more than 16 cores would result in a very much reduced parallel efficiency (= actual speedup / ideal speedup) below 60%.

A little bit more can be squeezed out if one uses an MPI+OpenMP-hybrid parallelization, hence combining MPI tasks with OpenMP threads. To use that you will have to compile all codes with additional OpenMP support and set up the simulation environment variables correctly. Since this is rather cumbersome and the benefits are marginal I would not recommend hybrid parallelization at this point (we are currently working on improvements in this direction). However, I still provide a table of the timings I got with different task/threads splits:

Code: Select all

#                                                         NA=1680                         NA=13440
# STEPS                                                       100                               20
# ATOMS                                                      1680                            13440
# STIME                                                 2.755E+02                        4.162E+02
# SPERF                                                 1.640E-03                        1.548E-03
#-------------------------------|--------------------------------|--------------------------------
#       NC        NP        NT  |      PERF    SPEEDUP       PEFF|      PERF    SPEEDUP       PEFF
#                               |                                |                                
#-------------------------------|--------------------------------|--------------------------------
         1         1         1    1.640E-03       1.00     100.00  1.548E-03       1.00     100.00
         8         8         1    2.498E-04       6.56      82.04  2.136E-04       7.25      90.60
        16         8         2    1.561E-04      10.50      65.66  1.385E-04      11.18      69.88
        16        16         1    1.425E-04      11.51      71.91  1.150E-04      13.46      84.12
        32         8         4    1.069E-04      15.35      47.96  1.020E-04      15.18      47.45
        32        16         2    8.775E-05      18.69      58.40  7.362E-05      21.03      65.73
        32        32         1    1.012E-04      16.21      50.65  6.306E-05      24.55      76.73
        64         8         8    8.422E-05      19.47      30.42  8.378E-05      18.48      28.88
        64        16         4    6.307E-05      26.00      40.63  5.364E-05      28.87      45.11
        64        32         2    7.148E-05      22.94      35.85  4.063E-05      38.11      59.55
        64        64         1    1.089E-04      15.06      23.53  3.460E-05      44.75      69.93
       128         8        16    7.641E-05      21.46      16.77  8.008E-05      19.34      15.11
       128        16         8    5.488E-05      29.88      23.34  4.340E-05      35.67      27.87
       128        32         4    6.471E-05      25.34      19.80  2.945E-05      52.58      41.08
       128        64         2    1.016E-04      16.14      12.61  2.483E-05      62.36      48.72
       128       128         1    1.699E-04       9.65       7.54  3.183E-05      48.64      38.00

Here you can see that I tested two different system sizes: the smaller one with 1680 atoms and a larger one with 8x1680 = 13440 atoms. The total simulation time for a single core is listed here as STIME. To make the numbers comparable I divide them by the number of atoms and timesteps and name this "performance", i.e. SPERF = STIME / (ATOMS * STEPS). The table then contains the performance (PERF), speedup (SPEEDUP) and parallel efficiency (PEFF) for different combinations of MPI tasks (NP) and OpenMP threads (NT), where the total number of used cores is NC = NP * NT.

One can see that the parallel efficiency drops below 60% already at 32 cores for the smaller system. The important takeaway message here is: There is just not enough work to split on even more cores because the overhead from parallelization is too large. For the larger system where there is more work we can see that there is a decent speedup of 44 at 64 cores, resulting in roughly 70% efficiency.

Overall, I want to stress the following points:

There is definitely something incorrectly set up in your run with 128 cores, maybe there is an issue with your MPI environment? Did you apply correct core pinning? Do you know if your machines use 4 or 1 NUMA domains (you could post the output of the lscpu command)? For example, in the above table to run the large system on 64 cores (no threading) I need to run LAMMPS like this to account for correct pinning to NUMA domains:
Code: Select all
```
mpirun -np 64 --report-bindings --map-by ppr:32:socket:pe=2 -x OMP_NUM_THREADS=1 -x OMP_PLACES=cores -x OMP_PROC_BIND=close ../../lammps-gcc-11.2.0-aocl/src/lmp_mpi -in lammps.in
```
Note that this is OpenMPI, not Intel's oneAPI.
As with all parallel codes also the ML-prediction does not scale to arbitrary number of cores. Do not just throw any number of CPUs at the problem, you have to test beforehand whether the parallel efficiency is still acceptable (I usually would accept down to 60-70%). Otherwise, you will waste energy and compute resources! I admit that the scaling behaviour of the code is not particularly great at the moment but we are currently working on further improvements. Note, that it does not matter whether the Fortran code or the C++ reimplementation is used, both VASP and LAMMPS show the same typical scaling.
So far I only mentioned the GCC+OpenMPI+AOCL toolchain. In my tests, this combination gave the best results on the AMD EPYC architecture. I also tried toolchains including Intel's oneAPI (all-Intel and GCC+Intel MKL) but the results were worse than those with OpenMPI and the AOCL libraries.
You mentioned that you already switched to another machine (Intel 40-core) and the scaling is much better there. Which machines are those exactly, CPU, memory? What timings could you get there? Are they comparable to the ones I mentioned for AMD EPYC above?
To me it seems that the current system size of 1680 atoms would not fit ideally to the 128-core EPYC machines you have available. If you increase the system size like I showed above you could properly fill a node and increase the sampling (increased system size => better statistics). Or, maybe it is better to stick to the smaller system and continue on the better suited Intel machines?

Comparison between Fortran and C++ implementation

Besides the performance testing, I created separate MD trajectories with LAMMPS for your system. Then, I took individual snapshots from the trajectory and recomputed the energies, forces and stresses with VASP, with and without using the C++ implementation. I computed the RMSE and MAX errors between the different sets and found the same behaviour as in previous tests on other systems. In VASP 6.5.1 there is a minor deviation between the Fortran and C++ codes due to slightly different implementations (however, in the next release these deviations will be further reduced). In VASP 6.5.1, in your case the difference between VASP Fortran (ML_LIB = .FALSE.), VASP C++ (ML_LIB = .TRUE.) and LAMMPS C++ codes can be seen in this table:

Code: Select all


==============================================================================
     epatom.dat [meV/atom]                LAMMPS C++              VASP Fortran
------------------------------------------------------------------------------
              VASP Fortran         1.1E-04 / 1.3E-04                       ---
                  VASP C++         1.2E-08 / 2.0E-08         1.1E-04 / 1.3E-04
==============================================================================


==============================================================================
        forces.dat [meV/A]                LAMMPS C++              VASP Fortran
------------------------------------------------------------------------------
              VASP Fortran         2.8E-03 / 3.8E-02                       ---
                  VASP C++         1.9E-09 / 1.3E-08         2.8E-03 / 3.8E-02
==============================================================================


==============================================================================
         stress.dat [kbar]                LAMMPS C++              VASP Fortran
------------------------------------------------------------------------------
              VASP Fortran         4.1E-05 / 1.7E-04                       ---
                  VASP C++         4.4E-06 / 1.1E-05         4.1E-05 / 1.7E-04
==============================================================================

The first value is always the RMSE, the second the maximum error. As you can see the maximum deviation in the forces between LAMMPS and VASP Fortran is 0.038 meV/A which is rougly a factor 2500 below your reported ERR of 96.7 meV/A between ab initio and the ML force field. Hence, I conclude that it is extremely unlikely that these differences have something to do with your diverging outcome of the MD simulations. The origin must be somewhere else, for example in the differently set up thermostats, etc. Did you gain any recent insights into this issue?

All the best,

Andreas Singraber

My Community

It is quite slow run MD by lammps using MLFF

It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF

Re: It is quite slow run MD by lammps using MLFF