Memory issues for larger systems
Posted: Thu Nov 01, 2012 2:58 am
Hello,
I run my VASP jobs on an AMD opteron having 4 processors with 16 cores on each (Total of 4x16 = 64 cores). It has a total memory of 64 GB.
Most calculations run fine on this node. But recently I had to run a larger system of gold surface with nearly 400 Au atoms. The job stated crashing after certain SCF steps before completing the first optimization cycle.
I figured out that the job runs out of memory after some time as I see that in the UNIX log file.
I ran the same job different times on parallel by using 25, 36, 28 cores respectively. I monitored how the code is using the memory as it ran the job in each different case.
What I have noticed is that each core is using roughly around 1GB of ram only irrespective of how many cores you pick and how much free memory you have on the node.
For example,
If I run the job with 25 cores, it uses nearly 24-26 GB.
If I run the job with 36 cores, it uses nearly 35-37 GB.
If I run the job with 48 cores, it uses nearly 46-48 GB.
.
.
.
Ultimately, my question is:
How can I get VASP to use the free memory available to finish the job instead of just using 1GB per core and crashing out after some time ?
I am attaching the VASP log file here which indicates that the job crashes due to a very generic UNIX error (SIGNAL 9) which really doesn't tell you that the job ran out of memory.
I had to look up in the UNIX log file to find that the job did ran out of memory.
running on 36 nodes
distr: one band on 6 nodes, 6 groups
vasp.5.2.12 11Nov11 complex
POSCAR found type information on POSCAR C
POSCAR found : 1 types and 512 ions
LDA part: xc-table for Ceperly-Alder, standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
WARNING: small aliasing (wrap around) errors must be expected
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution resort distribution
resort distribution
resort distribution
resort distribution
FFT: planning ...( 14 )
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.317052588125E+05 0.31705E+05 -0.71846E+05 5136 0.126E+03
RMM: 2 0.100362415386E+05 -0.21669E+05 -0.24838E+05 5136 0.307E+02
RMM: 3 0.236375640794E+04 -0.76725E+04 -0.10883E+05 5136 0.245E+02
RMM: 4 -0.240864746685E+04 -0.47724E+04 -0.47982E+04 5136 0.197E+02
RMM: 5 -0.429003846826E+04 -0.18814E+04 -0.16881E+04 5136 0.126E+02
RMM: 6 -0.505042959014E+04 -0.76039E+03 -0.64604E+03 5136 0.985E+01
RMM: 7 -0.532618844136E+04 -0.27576E+03 -0.26137E+03 5136 0.606E+01
RMM: 8 -0.545307140609E+04 -0.12688E+03 -0.11234E+03 5136 0.449E+01
RMM: 9 -0.554774843934E+04 -0.94677E+02 -0.92157E+02 12242 0.271E+01
RMM: 10 -0.555209501201E+04 -0.43466E+01 -0.51133E+01 13531 0.406E+00
RMM: 11 -0.555241417275E+04 -0.31916E+00 -0.19683E+00 12445 0.114E+00
RMM: 12 -0.555244832743E+04 -0.34155E-01 -0.28424E-01 13107 0.285E-01 0.109E+02
RMM: 13 -0.527612620274E+04 0.27632E+03 -0.27108E+02 10277 0.146E+01 0.592E+01
RMM: 14 -0.518459214017E+04 0.91534E+02 -0.41872E+02 10297 0.199E+01 0.874E+00
RMM: 15 -0.518386255369E+04 0.72959E+00 -0.10694E+01 10917 0.406E+00 0.122E+00
RMM: 16 -0.518388532905E+04 -0.22775E-01 -0.13082E+00 11857 0.893E-01 0.102E+00
RMM: 17 -0.518389338419E+04 -0.80551E-02 -0.12588E-01 10309 0.388E-01 0.641E-01
RMM: 18 -0.518387677956E+04 0.16605E-01 -0.42371E-02 10363 0.160E-01 0.270E-01
RMM: 19 -0.518389993534E+04 -0.23156E-01 -0.55966E-02 10280 0.162E-01 0.217E-01
RMM: 20 -0.518390873521E+04 -0.87999E-02 -0.13270E-02 10327 0.103E-01 0.245E-01
RMM: 21 -0.518393224705E+04 -0.23512E-01 -0.24408E-02 10292 0.130E-01 0.136E-01
RMM: 22 -0.518393962678E+04 -0.73797E-02 -0.95333E-04 7987 0.406E-02 0.996E-02
RMM: 23 -0.518395248316E+04 -0.12856E-01 -0.31120E-03 10272 0.510E-02 0.346E-02
RMM: 24 -0.518395844731E+04 -0.59641E-02 -0.56841E-04 7197 0.165E-02 0.260E-02
RMM: 25 -0.518395976503E+04 -0.13177E-02 -0.88745E-05 6256 0.112E-02 0.847E-03
RMM: 26 -0.518395999067E+04 -0.22564E-03 -0.24774E-05 6170 0.628E-03 0.532E-03
RMM: 27 -0.518396020845E+04 -0.21778E-03 -0.13539E-05 5559 0.455E-03 0.270E-03
RMM: 28 -0.518396027706E+04 -0.68612E-04 -0.76265E-06 4584 0.362E-03
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Let me know if you have any suggestions.
Thank you in advance.
I run my VASP jobs on an AMD opteron having 4 processors with 16 cores on each (Total of 4x16 = 64 cores). It has a total memory of 64 GB.
Most calculations run fine on this node. But recently I had to run a larger system of gold surface with nearly 400 Au atoms. The job stated crashing after certain SCF steps before completing the first optimization cycle.
I figured out that the job runs out of memory after some time as I see that in the UNIX log file.
I ran the same job different times on parallel by using 25, 36, 28 cores respectively. I monitored how the code is using the memory as it ran the job in each different case.
What I have noticed is that each core is using roughly around 1GB of ram only irrespective of how many cores you pick and how much free memory you have on the node.
For example,
If I run the job with 25 cores, it uses nearly 24-26 GB.
If I run the job with 36 cores, it uses nearly 35-37 GB.
If I run the job with 48 cores, it uses nearly 46-48 GB.
.
.
.
Ultimately, my question is:
How can I get VASP to use the free memory available to finish the job instead of just using 1GB per core and crashing out after some time ?
I am attaching the VASP log file here which indicates that the job crashes due to a very generic UNIX error (SIGNAL 9) which really doesn't tell you that the job ran out of memory.
I had to look up in the UNIX log file to find that the job did ran out of memory.
running on 36 nodes
distr: one band on 6 nodes, 6 groups
vasp.5.2.12 11Nov11 complex
POSCAR found type information on POSCAR C
POSCAR found : 1 types and 512 ions
LDA part: xc-table for Ceperly-Alder, standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
WARNING: small aliasing (wrap around) errors must be expected
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution
resort distribution resort distribution
resort distribution
resort distribution
resort distribution
FFT: planning ...( 14 )
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.317052588125E+05 0.31705E+05 -0.71846E+05 5136 0.126E+03
RMM: 2 0.100362415386E+05 -0.21669E+05 -0.24838E+05 5136 0.307E+02
RMM: 3 0.236375640794E+04 -0.76725E+04 -0.10883E+05 5136 0.245E+02
RMM: 4 -0.240864746685E+04 -0.47724E+04 -0.47982E+04 5136 0.197E+02
RMM: 5 -0.429003846826E+04 -0.18814E+04 -0.16881E+04 5136 0.126E+02
RMM: 6 -0.505042959014E+04 -0.76039E+03 -0.64604E+03 5136 0.985E+01
RMM: 7 -0.532618844136E+04 -0.27576E+03 -0.26137E+03 5136 0.606E+01
RMM: 8 -0.545307140609E+04 -0.12688E+03 -0.11234E+03 5136 0.449E+01
RMM: 9 -0.554774843934E+04 -0.94677E+02 -0.92157E+02 12242 0.271E+01
RMM: 10 -0.555209501201E+04 -0.43466E+01 -0.51133E+01 13531 0.406E+00
RMM: 11 -0.555241417275E+04 -0.31916E+00 -0.19683E+00 12445 0.114E+00
RMM: 12 -0.555244832743E+04 -0.34155E-01 -0.28424E-01 13107 0.285E-01 0.109E+02
RMM: 13 -0.527612620274E+04 0.27632E+03 -0.27108E+02 10277 0.146E+01 0.592E+01
RMM: 14 -0.518459214017E+04 0.91534E+02 -0.41872E+02 10297 0.199E+01 0.874E+00
RMM: 15 -0.518386255369E+04 0.72959E+00 -0.10694E+01 10917 0.406E+00 0.122E+00
RMM: 16 -0.518388532905E+04 -0.22775E-01 -0.13082E+00 11857 0.893E-01 0.102E+00
RMM: 17 -0.518389338419E+04 -0.80551E-02 -0.12588E-01 10309 0.388E-01 0.641E-01
RMM: 18 -0.518387677956E+04 0.16605E-01 -0.42371E-02 10363 0.160E-01 0.270E-01
RMM: 19 -0.518389993534E+04 -0.23156E-01 -0.55966E-02 10280 0.162E-01 0.217E-01
RMM: 20 -0.518390873521E+04 -0.87999E-02 -0.13270E-02 10327 0.103E-01 0.245E-01
RMM: 21 -0.518393224705E+04 -0.23512E-01 -0.24408E-02 10292 0.130E-01 0.136E-01
RMM: 22 -0.518393962678E+04 -0.73797E-02 -0.95333E-04 7987 0.406E-02 0.996E-02
RMM: 23 -0.518395248316E+04 -0.12856E-01 -0.31120E-03 10272 0.510E-02 0.346E-02
RMM: 24 -0.518395844731E+04 -0.59641E-02 -0.56841E-04 7197 0.165E-02 0.260E-02
RMM: 25 -0.518395976503E+04 -0.13177E-02 -0.88745E-05 6256 0.112E-02 0.847E-03
RMM: 26 -0.518395999067E+04 -0.22564E-03 -0.24774E-05 6170 0.628E-03 0.532E-03
RMM: 27 -0.518396020845E+04 -0.21778E-03 -0.13539E-05 5559 0.455E-03 0.270E-03
RMM: 28 -0.518396027706E+04 -0.68612E-04 -0.76265E-06 4584 0.362E-03
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
Let me know if you have any suggestions.
Thank you in advance.