PLUGINS_STRUCTURE_errors

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
thomas_pigeon
Newbie
Newbie
Posts: 1
Joined: Thu Feb 20, 2025 12:00 pm

PLUGINS_STRUCTURE_errors

#1 Post by thomas_pigeon » Thu Feb 20, 2025 3:34 pm

I compiled VASP 6.5.0 with the python plugins option, with two different compiler (gcc and fpp) see the attached makefile.include.
I execute VASP on a node composed of two processors AMD EPYC™ Milan 7763 - 64 Core - 2.45GHz - 256MB Cache
The plugin is only used to change the atoms positions every steps through a python code which runs Langevin dynamics using an integrator from ASE adapted for the plugin.
Depending on the ML_MODE and ML_LMLFF tag in the INCAR, I obtain two types of errors for both compilations with gcc and fpp.

With ML_LMLFF=.FALSE., the dynamics (through the plugin) runs for 4500 steps (out of 10 000) and then obtain the following error:

Code: Select all

slurmstepd-topaze1701: error: Detected 1 oom_kill event in StepId=7485027.0. Some of the step tasks have been OOM Killed.
srun: error: topaze1701: task 64: Out Of Memory
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:64]
slurmstepd-topaze1701: error: *** STEP 7485027.0 ON topaze1701 CANCELLED AT 2025-02-19T23:19:21 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:0]
+ exit 0

With ML_LMLFF=.TRUE. and ML_MODE = train. I do not obtain any error and can run dynamics (through the plugins) for 10 000 steps (with high CTIFOR to not do DFT).
In that particular case, the ML_CTIFOR was set to a high value so that there is no DFT calls and only FF evaluations.

With ML_LMLFF=.TRUE. and ML_MODE = run, the vasp execution stops before calling the python interface but after writing the first energy and forces to the OUTCAR.
I obtain the following error (many times):

Code: Select all

[topaze1150:3973629:0:3973629] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3973629) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000005aa353 rc_add_()  ???:0
 2 0x00000000004cd81b plugins_mp_plugins_structure_()  ???:0
 3 0x0000000001eff5f1 MAIN__()  ???:0
 4 0x000000000041fba2 main()  ???:0
 5 0x000000000003ad85 __libc_start_main()  ???:0
 6 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973585:0:3973585] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:3973585) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973645:0:3973645] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1268000007f)
==== backtrace (tid:3973645) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================
You do not have the required permissions to view the files attached to this post.
Last edited by manuel_engel1 on Thu Feb 20, 2025 4:34 pm, edited 1 time in total.
Reason: Put errors in code blocks to improve readability

manuel_engel1
Global Moderator
Global Moderator
Posts: 188
Joined: Mon May 08, 2023 4:08 pm

Re: PLUGINS_STRUCTURE_errors

#2 Post by manuel_engel1 » Fri Feb 21, 2025 1:26 pm

Hello,

Thank you kindly for the report. After talking with our ML and plugin experts, I am able to come back with a partial answer.

In the case where ML_LMLFF=True and ML_MODE=run, there is indeed a problem as some of the DFT quantities are not allocated. When running with the VASP plugin, these non-allocated quantities are accessed, causing the segmentation fault you see. We are already working on a fix for this issue.

As to why the first case is running out of memory is still a bit unclear to me. It might be due to an unrelated bug, or it might be something more benign. This needs to be investigated still.

Kind regards

Manuel
VASP developer

manuel_engel1
Global Moderator
Global Moderator
Posts: 188
Joined: Mon May 08, 2023 4:08 pm

Re: PLUGINS_STRUCTURE_errors

#3 Post by manuel_engel1 » Fri Feb 21, 2025 2:16 pm

We have now started to investigate the issue with ML_LMLFF=False that you described first. We suspect that it could be caused by a memory leak. Could you please tell us exactly what compiler and library versions you used to build VASP?

In particular, we are interested in the exact version numbers of

  • the Fortran compiler

  • the MPI library

  • the HDF5 library (if used)

  • scaLAPACK/LAPACK

This information would be greatly appreciated.

Manuel
VASP developer

Post Reply