Hello!
I did quite some testing with the data you provided, regarding performance and also agreement between the native VASP code and the LAMMPS integration. Here are my results:
Performance
You mentioned that your simulation on an 128-core AMD EPYC 7763 takes more than 10min to complete 100 time steps, hence the performance here is roughly 6 s/timestep (for 1680 atoms).
At first, I did not have the same machine available for testing so I had to use a smaller 48-core AMD Epyc 7643. However, the architecture and speed per core is very similar, so timings can be compared to some extent. The performance on a single core of the AMD Epyc 7643 is about 1.6 s/timestep (depending a bit on the compiler). I gradually increased the number of cores used and found that given the system size is 1680 atoms it makes little sense to go beyond 12-16 cores. The best result I achieved was roughly 0.17 s/timestep on 16 cores with the following toolchain: GCC 11.2.0 compiler together with OpenMPI 4.1.2 and the AOCL library 3.1.
Later, I could run the same simulations on a larger machine resembling yours (dual-socket AMD EPYC 7773X = 128 cores). The situation there was similar, more than 16 cores would result in a very much reduced parallel efficiency (= actual speedup / ideal speedup) below 60%.
A little bit more can be squeezed out if one uses an MPI+OpenMP-hybrid parallelization, hence combining MPI tasks with OpenMP threads. To use that you will have to compile all codes with additional OpenMP support and set up the simulation environment variables correctly. Since this is rather cumbersome and the benefits are marginal I would not recommend hybrid parallelization at this point (we are currently working on improvements in this direction). However, I still provide a table of the timings I got with different task/threads splits:
Code: Select all
# NA=1680 NA=13440
# STEPS 100 20
# ATOMS 1680 13440
# STIME 2.755E+02 4.162E+02
# SPERF 1.640E-03 1.548E-03
#-------------------------------|--------------------------------|--------------------------------
# NC NP NT | PERF SPEEDUP PEFF| PERF SPEEDUP PEFF
# | |
#-------------------------------|--------------------------------|--------------------------------
1 1 1 1.640E-03 1.00 100.00 1.548E-03 1.00 100.00
8 8 1 2.498E-04 6.56 82.04 2.136E-04 7.25 90.60
16 8 2 1.561E-04 10.50 65.66 1.385E-04 11.18 69.88
16 16 1 1.425E-04 11.51 71.91 1.150E-04 13.46 84.12
32 8 4 1.069E-04 15.35 47.96 1.020E-04 15.18 47.45
32 16 2 8.775E-05 18.69 58.40 7.362E-05 21.03 65.73
32 32 1 1.012E-04 16.21 50.65 6.306E-05 24.55 76.73
64 8 8 8.422E-05 19.47 30.42 8.378E-05 18.48 28.88
64 16 4 6.307E-05 26.00 40.63 5.364E-05 28.87 45.11
64 32 2 7.148E-05 22.94 35.85 4.063E-05 38.11 59.55
64 64 1 1.089E-04 15.06 23.53 3.460E-05 44.75 69.93
128 8 16 7.641E-05 21.46 16.77 8.008E-05 19.34 15.11
128 16 8 5.488E-05 29.88 23.34 4.340E-05 35.67 27.87
128 32 4 6.471E-05 25.34 19.80 2.945E-05 52.58 41.08
128 64 2 1.016E-04 16.14 12.61 2.483E-05 62.36 48.72
128 128 1 1.699E-04 9.65 7.54 3.183E-05 48.64 38.00
Here you can see that I tested two different system sizes: the smaller one with 1680 atoms and a larger one with 8x1680 = 13440 atoms. The total simulation time for a single core is listed here as STIME. To make the numbers comparable I divide them by the number of atoms and timesteps and name this "performance", i.e. SPERF = STIME / (ATOMS * STEPS). The table then contains the performance (PERF), speedup (SPEEDUP) and parallel efficiency (PEFF) for different combinations of MPI tasks (NP) and OpenMP threads (NT), where the total number of used cores is NC = NP * NT.
One can see that the parallel efficiency drops below 60% already at 32 cores for the smaller system. The important takeaway message here is: There is just not enough work to split on even more cores because the overhead from parallelization is too large. For the larger system where there is more work we can see that there is a decent speedup of 44 at 64 cores, resulting in roughly 70% efficiency.
Overall, I want to stress the following points:
There is definitely something incorrectly set up in your run with 128 cores, maybe there is an issue with your MPI environment? Did you apply correct core pinning? Do you know if your machines use 4 or 1 NUMA domains (you could post the output of the lscpu command)? For example, in the above table to run the large system on 64 cores (no threading) I need to run LAMMPS like this to account for correct pinning to NUMA domains:
Code: Select all
mpirun -np 64 --report-bindings --map-by ppr:32:socket:pe=2 -x OMP_NUM_THREADS=1 -x OMP_PLACES=cores -x OMP_PROC_BIND=close ../../lammps-gcc-11.2.0-aocl/src/lmp_mpi -in lammps.in
Note that this is OpenMPI, not Intel's oneAPI.
As with all parallel codes also the ML-prediction does not scale to arbitrary number of cores. Do not just throw any number of CPUs at the problem, you have to test beforehand whether the parallel efficiency is still acceptable (I usually would accept down to 60-70%). Otherwise, you will waste energy and compute resources! I admit that the scaling behaviour of the code is not particularly great at the moment but we are currently working on further improvements. Note, that it does not matter whether the Fortran code or the C++ reimplementation is used, both VASP and LAMMPS show the same typical scaling.
So far I only mentioned the GCC+OpenMPI+AOCL toolchain. In my tests, this combination gave the best results on the AMD EPYC architecture. I also tried toolchains including Intel's oneAPI (all-Intel and GCC+Intel MKL) but the results were worse than those with OpenMPI and the AOCL libraries.
You mentioned that you already switched to another machine (Intel 40-core) and the scaling is much better there. Which machines are those exactly, CPU, memory? What timings could you get there? Are they comparable to the ones I mentioned for AMD EPYC above?
To me it seems that the current system size of 1680 atoms would not fit ideally to the 128-core EPYC machines you have available. If you increase the system size like I showed above you could properly fill a node and increase the sampling (increased system size => better statistics). Or, maybe it is better to stick to the smaller system and continue on the better suited Intel machines?
Comparison between Fortran and C++ implementation
Besides the performance testing, I created separate MD trajectories with LAMMPS for your system. Then, I took individual snapshots from the trajectory and recomputed the energies, forces and stresses with VASP, with and without using the C++ implementation. I computed the RMSE and MAX errors between the different sets and found the same behaviour as in previous tests on other systems. In VASP 6.5.1 there is a minor deviation between the Fortran and C++ codes due to slightly different implementations (however, in the next release these deviations will be further reduced). In VASP 6.5.1, in your case the difference between VASP Fortran (ML_LIB = .FALSE.), VASP C++ (ML_LIB = .TRUE.) and LAMMPS C++ codes can be seen in this table:
Code: Select all
==============================================================================
epatom.dat [meV/atom] LAMMPS C++ VASP Fortran
------------------------------------------------------------------------------
VASP Fortran 1.1E-04 / 1.3E-04 ---
VASP C++ 1.2E-08 / 2.0E-08 1.1E-04 / 1.3E-04
==============================================================================
==============================================================================
forces.dat [meV/A] LAMMPS C++ VASP Fortran
------------------------------------------------------------------------------
VASP Fortran 2.8E-03 / 3.8E-02 ---
VASP C++ 1.9E-09 / 1.3E-08 2.8E-03 / 3.8E-02
==============================================================================
==============================================================================
stress.dat [kbar] LAMMPS C++ VASP Fortran
------------------------------------------------------------------------------
VASP Fortran 4.1E-05 / 1.7E-04 ---
VASP C++ 4.4E-06 / 1.1E-05 4.1E-05 / 1.7E-04
==============================================================================
The first value is always the RMSE, the second the maximum error. As you can see the maximum deviation in the forces between LAMMPS and VASP Fortran is 0.038 meV/A which is rougly a factor 2500 below your reported ERR of 96.7 meV/A between ab initio and the ML force field. Hence, I conclude that it is extremely unlikely that these differences have something to do with your diverging outcome of the MD simulations. The origin must be somewhere else, for example in the differently set up thermostats, etc. Did you gain any recent insights into this issue?
All the best,
Andreas Singraber