Segfault version 5.4.1 with MPI rank count not 1, 2, 4, or 8
Posted: Wed Oct 28, 2015 8:14 pm
Hi,
I have run into the following strange behavior.
When I run a particular build of VASP version 5.4.1 with 1, 2, 4, or 8 MPI ranks, a test job completes fine. However, with any rank count other than that, the code segfaults.
Debugging at 3 ranks shows the stack trace as
[0-2] (mpigdb) bt
[0-2] #0 vhamil (wdes1=Cannot access memory at address 0x16260
[0,2] ) at hamil.F:794
[1]
[0,2] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[1] ) at hamil.F:794
[0,2] ) at hamil.F:794
[1] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[0,2] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[1]
[0,2] ) at davidson.F:419
[1] ) at hamil.F:794
[0,2] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[1] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[0,2] ) at electron.F:418
[1] ) at davidson.F:419
[0,2] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[0,2] #5 0x000000000040ba6e in main ()
[1] ) at electron.F:418
[1] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #5 0x000000000040ba6e in main ()
It seems like something is wrong in wdes1, but I can't tell what.
Build: Intel MPI 5.0.3, Intel Fortran compiler 15.0.3, MKL 15.3.187, Scalapack enabled
Test case 4X4X2 Gamma-centered k-point mesh; 832 bands (auto-modified to 834 for 3-rank case)
INCAR:
ISTART = 0
ICHARG = 2
ENCUT = 300
ISMEAR = 0
SIGMA = 0.01
LMAXMIX = 4
ADDGRID = .TRUE.
PREC = Accurate
NELM = 10
NELMIN = 3
EDIFF = 1E-5
LORBIT = 11
NBANDS = 832
LOPTICS = .TRUE.
LWAVE = .FALSE.
LCHARG = .FALSE.
LREAL = On
Would appreciate any pointers, including what to try next in GDB.
Thanks in advance,
Chris
I have run into the following strange behavior.
When I run a particular build of VASP version 5.4.1 with 1, 2, 4, or 8 MPI ranks, a test job completes fine. However, with any rank count other than that, the code segfaults.
Debugging at 3 ranks shows the stack trace as
[0-2] (mpigdb) bt
[0-2] #0 vhamil (wdes1=Cannot access memory at address 0x16260
[0,2] ) at hamil.F:794
[1]
[0,2] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[1] ) at hamil.F:794
[0,2] ) at hamil.F:794
[1] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[0,2] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[1]
[0,2] ) at davidson.F:419
[1] ) at hamil.F:794
[0,2] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[1] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[0,2] ) at electron.F:418
[1] ) at davidson.F:419
[0,2] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[0,2] #5 0x000000000040ba6e in main ()
[1] ) at electron.F:418
[1] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #5 0x000000000040ba6e in main ()
It seems like something is wrong in wdes1, but I can't tell what.
Build: Intel MPI 5.0.3, Intel Fortran compiler 15.0.3, MKL 15.3.187, Scalapack enabled
Test case 4X4X2 Gamma-centered k-point mesh; 832 bands (auto-modified to 834 for 3-rank case)
INCAR:
ISTART = 0
ICHARG = 2
ENCUT = 300
ISMEAR = 0
SIGMA = 0.01
LMAXMIX = 4
ADDGRID = .TRUE.
PREC = Accurate
NELM = 10
NELMIN = 3
EDIFF = 1E-5
LORBIT = 11
NBANDS = 832
LOPTICS = .TRUE.
LWAVE = .FALSE.
LCHARG = .FALSE.
LREAL = On
Would appreciate any pointers, including what to try next in GDB.
Thanks in advance,
Chris