VASP linear response problem keeps failing. Out of memory?
Posted: Sat Nov 09, 2013 5:55 pm
I'm running linear response problem III, on a spin-polarized system of 42 atoms (Fe36N6 supercell). The computations keeps failing without telling exactly what went wrong.
I suspect that the job is running out of memory. What's puzzles me, however, is that this happens after 12 hours or more, and that there's no indication of memory problems in form of error messages. The job just fails with a segmentation fault. I guess if the job failed to allocate requested memory, it would know that and be able to print it in the output.
I'm running it on my local workstation (32 GB memory) and on a cluster (single node, 8 processors, 32 GB memory).
Any ideas of how to debug this problem?
I'm not specifying parallelization (NPAR), as that's not supported with linear response problems. VASP fails if I try that.
[INCAR]
ISMEAR = 1
VOSKOWN = 1
ISPIN = 2
MAGMOM = 36*3 6*0.5
PREC = HIGH
EDIFF = 1E-05
LCHARG = .FALSE.
LWAVE = .FALSE.
RANDOM_SEED = 1
IBRION = 8
[KPOINTS]
K-Points
0
Auto
45 ! Length
[Console output]
running on 6 total cores
distrk: each k-point on 6 cores, 1 groups
distr: one band on 1 cores, 6 groups
using from now: INCAR
vasp.5.3.3 18Dez12 (build Aug 09 2013 13:42:53) complex
POSCAR found type information on POSCAR Fe N
POSCAR found : 2 types and 42 ions
scaLAPACK will be used
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
LDA part: xc-table for Pade appr. of Perdew
generate k-points for: 6 6 5
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.304592402548E+04 0.30459E+04 -0.12490E+05 26568 0.160E+03
DAV: 2 0.987209509215E+02 -0.29472E+04 -0.27687E+04 26568 0.371E+02
DAV: 3 -0.337402588496E+03 -0.43612E+03 -0.35865E+03 28626 0.169E+02
DAV: 4 -0.389066717707E+03 -0.51664E+02 -0.47665E+02 39024 0.567E+01
DAV: 5 -0.390885138501E+03 -0.18184E+01 -0.17954E+01 35520 0.126E+01 0.729E+01
DAV: 6 -0.357217883508E+03 0.33667E+02 -0.44892E+02 32604 0.985E+01 0.532E+01
DAV: 7 -0.341369795827E+03 0.15848E+02 -0.67836E+01 31938 0.436E+01 0.239E+01
DAV: 8 -0.343553785063E+03 -0.21840E+01 -0.14914E+01 33168 0.782E+00 0.125E+01
DAV: 9 -0.342857664658E+03 0.69612E+00 -0.21096E+00 37692 0.554E+00 0.351E+00
DAV: 10 -0.342966148829E+03 -0.10848E+00 -0.66436E-01 30054 0.247E+00 0.145E+00
DAV: 11 -0.342967212784E+03 -0.10640E-02 -0.70143E-02 34548 0.695E-01 0.491E-01
DAV: 12 -0.342972814370E+03 -0.56016E-02 -0.30386E-02 33516 0.474E-01 0.414E-01
DAV: 13 -0.342972013272E+03 0.80110E-03 -0.10009E-03 34116 0.103E-01 0.245E-01
DAV: 14 -0.342971841916E+03 0.17136E-03 -0.14985E-03 37620 0.752E-02 0.763E-02
DAV: 15 -0.342971872052E+03 -0.30136E-04 -0.15219E-04 27000 0.398E-02 0.347E-02
DAV: 16 -0.342971875239E+03 -0.31864E-05 -0.85503E-06 16662 0.850E-03
1 F= -.34297188E+03 E0= -.34297948E+03 d E =0.228175E-01 mag= 87.5534
Linear response reoptimize wavefunctions to high precision
DAV: 1 -0.342971877135E+03 -0.18958E-05 -0.61269E-06 36432 0.698E-03
DAV: 2 -0.342971877147E+03 -0.12173E-07 -0.12102E-07 26946 0.136E-03
DAV: 3 -0.342971877147E+03 -0.18190E-09 -0.10727E-09 14460 0.995E-05
Linear response DOF= 4
Linear response progress:
Degree of freedom: 1/ 4
generate k-points for: 6 6 5
N E dE d eps ncg rms rms(c)
RMM: 1 -0.171116802305E+00 -0.17112E+00 -0.12199E-01172816 0.754E-01
RMM: 2 -0.164793427051E+00 0.63234E-02 -0.41623E-03 99925 0.281E-01 0.829E-01
RMM: 3 -0.169079119482E+00 -0.42857E-02 -0.72983E-03117291 0.252E-01 0.111E+00
RMM: 4 -0.171211599611E+00 -0.21325E-02 -0.85577E-03 93259 0.366E-01 0.122E+00
RMM: 5 -0.164667327575E+00 0.65443E-02 -0.21559E-03 92782 0.191E-01 0.236E-01
RMM: 6 -0.164582709631E+00 0.84618E-04 -0.26977E-04 94502 0.645E-02 0.141E-01
RMM: 7 -0.164622748492E+00 -0.40039E-04 -0.89592E-05 99835 0.384E-02 0.136E-01
RMM: 8 -0.164585936436E+00 0.36812E-04 -0.21273E-06 96321 0.310E-02 0.488E-02
RMM: 9 -0.164633815294E+00 -0.47879E-04 0.43418E-05110349 0.151E-02 0.511E-02
RMM: 10 -0.164632097392E+00 0.17179E-05 0.55945E-05 98532 0.105E-02 0.128E-02
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 0000000000E62448 Unknown Unknown Unknown
vasp 000000000113FBD7 Unknown Unknown Unknown
vasp 0000000000473791 Unknown Unknown Unknown
vasp 00000000004420DC Unknown Unknown Unknown
libc.so.6 00002B6BA955AEAD Unknown Unknown Unknown
vasp 0000000000441FB9 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 13725 on node wheezy2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I suspect that the job is running out of memory. What's puzzles me, however, is that this happens after 12 hours or more, and that there's no indication of memory problems in form of error messages. The job just fails with a segmentation fault. I guess if the job failed to allocate requested memory, it would know that and be able to print it in the output.
I'm running it on my local workstation (32 GB memory) and on a cluster (single node, 8 processors, 32 GB memory).
Any ideas of how to debug this problem?
I'm not specifying parallelization (NPAR), as that's not supported with linear response problems. VASP fails if I try that.
[INCAR]
ISMEAR = 1
VOSKOWN = 1
ISPIN = 2
MAGMOM = 36*3 6*0.5
PREC = HIGH
EDIFF = 1E-05
LCHARG = .FALSE.
LWAVE = .FALSE.
RANDOM_SEED = 1
IBRION = 8
[KPOINTS]
K-Points
0
Auto
45 ! Length
[Console output]
running on 6 total cores
distrk: each k-point on 6 cores, 1 groups
distr: one band on 1 cores, 6 groups
using from now: INCAR
vasp.5.3.3 18Dez12 (build Aug 09 2013 13:42:53) complex
POSCAR found type information on POSCAR Fe N
POSCAR found : 2 types and 42 ions
scaLAPACK will be used
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
LDA part: xc-table for Pade appr. of Perdew
generate k-points for: 6 6 5
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.304592402548E+04 0.30459E+04 -0.12490E+05 26568 0.160E+03
DAV: 2 0.987209509215E+02 -0.29472E+04 -0.27687E+04 26568 0.371E+02
DAV: 3 -0.337402588496E+03 -0.43612E+03 -0.35865E+03 28626 0.169E+02
DAV: 4 -0.389066717707E+03 -0.51664E+02 -0.47665E+02 39024 0.567E+01
DAV: 5 -0.390885138501E+03 -0.18184E+01 -0.17954E+01 35520 0.126E+01 0.729E+01
DAV: 6 -0.357217883508E+03 0.33667E+02 -0.44892E+02 32604 0.985E+01 0.532E+01
DAV: 7 -0.341369795827E+03 0.15848E+02 -0.67836E+01 31938 0.436E+01 0.239E+01
DAV: 8 -0.343553785063E+03 -0.21840E+01 -0.14914E+01 33168 0.782E+00 0.125E+01
DAV: 9 -0.342857664658E+03 0.69612E+00 -0.21096E+00 37692 0.554E+00 0.351E+00
DAV: 10 -0.342966148829E+03 -0.10848E+00 -0.66436E-01 30054 0.247E+00 0.145E+00
DAV: 11 -0.342967212784E+03 -0.10640E-02 -0.70143E-02 34548 0.695E-01 0.491E-01
DAV: 12 -0.342972814370E+03 -0.56016E-02 -0.30386E-02 33516 0.474E-01 0.414E-01
DAV: 13 -0.342972013272E+03 0.80110E-03 -0.10009E-03 34116 0.103E-01 0.245E-01
DAV: 14 -0.342971841916E+03 0.17136E-03 -0.14985E-03 37620 0.752E-02 0.763E-02
DAV: 15 -0.342971872052E+03 -0.30136E-04 -0.15219E-04 27000 0.398E-02 0.347E-02
DAV: 16 -0.342971875239E+03 -0.31864E-05 -0.85503E-06 16662 0.850E-03
1 F= -.34297188E+03 E0= -.34297948E+03 d E =0.228175E-01 mag= 87.5534
Linear response reoptimize wavefunctions to high precision
DAV: 1 -0.342971877135E+03 -0.18958E-05 -0.61269E-06 36432 0.698E-03
DAV: 2 -0.342971877147E+03 -0.12173E-07 -0.12102E-07 26946 0.136E-03
DAV: 3 -0.342971877147E+03 -0.18190E-09 -0.10727E-09 14460 0.995E-05
Linear response DOF= 4
Linear response progress:
Degree of freedom: 1/ 4
generate k-points for: 6 6 5
N E dE d eps ncg rms rms(c)
RMM: 1 -0.171116802305E+00 -0.17112E+00 -0.12199E-01172816 0.754E-01
RMM: 2 -0.164793427051E+00 0.63234E-02 -0.41623E-03 99925 0.281E-01 0.829E-01
RMM: 3 -0.169079119482E+00 -0.42857E-02 -0.72983E-03117291 0.252E-01 0.111E+00
RMM: 4 -0.171211599611E+00 -0.21325E-02 -0.85577E-03 93259 0.366E-01 0.122E+00
RMM: 5 -0.164667327575E+00 0.65443E-02 -0.21559E-03 92782 0.191E-01 0.236E-01
RMM: 6 -0.164582709631E+00 0.84618E-04 -0.26977E-04 94502 0.645E-02 0.141E-01
RMM: 7 -0.164622748492E+00 -0.40039E-04 -0.89592E-05 99835 0.384E-02 0.136E-01
RMM: 8 -0.164585936436E+00 0.36812E-04 -0.21273E-06 96321 0.310E-02 0.488E-02
RMM: 9 -0.164633815294E+00 -0.47879E-04 0.43418E-05110349 0.151E-02 0.511E-02
RMM: 10 -0.164632097392E+00 0.17179E-05 0.55945E-05 98532 0.105E-02 0.128E-02
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
vasp 0000000000E62448 Unknown Unknown Unknown
vasp 000000000113FBD7 Unknown Unknown Unknown
vasp 0000000000473791 Unknown Unknown Unknown
vasp 00000000004420DC Unknown Unknown Unknown
libc.so.6 00002B6BA955AEAD Unknown Unknown Unknown
vasp 0000000000441FB9 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 13725 on node wheezy2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------