Race condition issue in large-scale parallel jobs?

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
d-farrell2

Race condition issue in large-scale parallel jobs?

#1 Post by d-farrell2 » Fri Nov 14, 2008 6:15 pm

I was doing a scaling study of vasp.4.6.34 (5Dec07 gamma-only) on a BG/P, with a system of 144 ions (with 110592 plane-waves, 240 Bands) going from 8 to 1024 processors.

Things went pretty well from 8 to 256 - no major issues in running, but when I got to 512 and 1024, things stopped working (code would start to run then hang until killed or timed out).

Now one weird aspect was that in the 512 processor case, the stdout file and the OSZICAR output didn't match up:

stdout:

Code: Select all

 POSCAR, INCAR and KPOINTS ok, starting setup
 WARNING: wrap around errors must be expected
 FFT: planning ... 1
 reading WAVECAR
 prediction of wavefunctions initialized - no I/O
 entering main loop
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.134616312131E+04   -0.13462E+04   -0.15113E+05   240   0.804E+02
RMM:   2     0.678906973816E+02    0.14141E+04   -0.31181E+04   240   0.241E+02
RMM:   3    -0.373161624624E+03   -0.44105E+03   -0.87906E+03   240   0.156E+02
RMM:   4    -0.665909002304E+03   -0.29275E+03   -0.23943E+03   240   0.871E+01
RMM:   5    -0.724955435073E+03   -0.59046E+02   -0.57775E+02   240   0.439E+01
RMM:   6    -0.737867636598E+03   -0.12912E+02   -0.14283E+02   240   0.218E+01
RMM:   7    -0.740535632918E+03   -0.26680E+01   -0.38981E+01   240   0.117E+01
RMM:   8    -0.741178187819E+03   -0.64255E+00   -0.10515E+01   240   0.598E+00
RMM:   9    -0.741330045880E+03   -0.15186E+00   -0.34026E+00   593   0.345E+00
RMM:  10    -0.741360945514E+03   -0.30900E-01   -0.35481E-01   551   0.671E-01
RMM:  11    -0.741359545649E+03    0.13999E-02   -0.11029E-02   475   0.155E-01
RMM:  12    -0.741359528547E+03    0.17101E-04   -0.97017E-04   461   0.335E-02    0.450E+01
RMM:  13    -0.661558370867E+03    0.79801E+02   -0.28370E+02   512   0.187E+01    0.180E+01
RMM:  14    -0.652731303590E+03    0.88271E+01   -0.16616E+01   539   0.540E+00    0.111E+01
RMM:  15    -0.651277021638E+03    0.14543E+01   -0.53115E+00   482   0.407E+00    0.178E+00
RMM:  16    -0.651156842750E+03    0.12018E+00   -0.97604E-01   523   0.129E+00    0.856E-01
RMM:  17    -0.651150359979E+03    0.64828E-02   -0.62397E-02   505   0.360E-01    0.304E-01
RMM:  18    -0.651164434310E+03   -0.14074E-01   -0.37067E-02   481   0.308E-01    0.288E-01
RMM:  19    -0.651161916093E+03    0.25182E-02   -0.78862E-03   491   0.113E-01    0.473E-02
RMM:  20    -0.651162051035E+03   -0.13494E-03   -0.13765E-03   469   0.518E-02    0.504E-02
RMM:  21    -0.651161956873E+03    0.94162E-04   -0.13006E-04   326   0.177E-02
   1 T=  2000. E= -.61419337E+03 F= -.65116196E+03 E0= -.65116196E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00
 bond charge predicted
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.651138583236E+03   -0.65114E+03   -0.21660E+00   480   0.311E+00    0.455E-01
RMM:   2    -0.651132473309E+03    0.61099E-02   -0.31310E-02   542   0.367E-01    0.255E-01
RMM:   3    -0.651132118441E+03    0.35487E-03   -0.45908E-03   525   0.127E-01    0.991E-02
RMM:   4    -0.651132065402E+03    0.53039E-04   -0.74368E-04   476   0.401E-02
   2 T=  2000. E= -.61416348E+03 F= -.65113207E+03 E0= -.65113207E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00


Meanwhile the OSZICAR looked like:

Code: Select all

N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.134616312131E+04   -0.13462E+04   -0.15113E+05   240   0.804E+02
RMM:   2     0.678906973816E+02    0.14141E+04   -0.31181E+04   240   0.241E+02
RMM:   3    -0.373161624624E+03   -0.44105E+03   -0.87906E+03   240   0.156E+02
RMM:   4    -0.665909002304E+03   -0.29275E+03   -0.23943E+03   240   0.871E+01
RMM:   5    -0.724955435073E+03   -0.59046E+02   -0.57775E+02   240   0.439E+01
RMM:   6    -0.737867636598E+03   -0.12912E+02   -0.14283E+02   240   0.218E+01
RMM:   7    -0.740535632918E+03   -0.26680E+01   -0.38981E+01   240   0.117E+01
RMM:   8    -0.741178187819E+03   -0.64255E+00   -0.10515E+01   240   0.598E+00
RMM:   9    -0.741330045880E+03   -0.15186E+00   -0.34026E+00   593   0.345E+00
RMM:  10    -0.741360945514E+03   -0.30900E-01   -0.35481E-01   551   0.671E-01
RMM:  11    -0.741359545649E+03    0.13999E-02   -0.11029E-02   475   0.155E-01
RMM:  12    -0.741359528547E+03    0.17101E-04   -0.97017E-04   461   0.335E-02    0.450E+01
RMM:  13    -0.661558370867E+03    0.79801E+02   -0.28370E+02   512   0.187E+01    0.180E+01
RMM:  14    -0.652731303590E+03    0.88271E+01   -0.16616E+01   539   0.540E+00    0.111E+01
RMM:  15    -0.651277021638E+03    0.14543E+01   -0.53115E+00   482   0.407E+00    0.178E+00
RMM:  16    -0.651156842750E+03    0.12018E+00   -0.97604E-01   523   0.129E+00    0.856E-01
RMM:  17    -0.651150359979E+03    0.64828E-02   -0.62397E-02   505   0.360E-01    0.304E-01
RMM:  18    -0.651164434310E+03   -0.14074E-01   -0.37067E-02   481   0.308E-01    0.288E-01
RMM:  19    -0.651161916093E+03    0.25182E-02   -0.78862E-03   491   0.113E-01    0.473E-02
RMM:  20    -0.651162051035E+03   -0.13494E-03   -0.13765E-03   469   0.518E-02    0.504E-02
RMM:  21    -0.651161956873E+03    0.94162E-04   -0.13006E-04   326   0.177E-02
   1 T=  2000. E= -.61419337E+03 F= -.65116196E+03 E0= -.65116196E+03  EK= 0.36969E+02 SP= 0.00E+00 SK= 0.00E+00
       N       E                     dE             d eps       ncg     rms          rms(c)
RMM:   1    -0.651138583236E+03   -0.65114E+03   -0.21660E+00   480   0.311E+00    0.455E-01
RMM:   2    -0.651132473309E+03    0.61099E-02   -0.31310E-02   542   0.367E-01    0.255E-01
RMM:   3    -0.651132118441E+03    0.35487E-03   -0.45908E-03   525   0.127E-01    0.991E-02
RMM:   4    -0.651132065402E+03    0.53039E-04   -0.74368E-04   476   0.401E-02



So it appears possible that some part of the code was exited the electronic loop for the 2nd ionic step, but the rest of the processes did not (the outcar doesn't show anything past the 4th electronic iteration for the second ionic step).

Any ideas what could be going on here?
Any ideas on what to check to see if there is a race condition showing up or is it as simple as trying to spread the system out too far (but I see similar behavior with larger systems, so I doubt this)?
<span class='smallblacktext'>[ Edited ]</span>
Last edited by d-farrell2 on Fri Nov 14, 2008 6:15 pm, edited 1 time in total.

d-farrell2

Race condition issue in large-scale parallel jobs?

#2 Post by d-farrell2 » Mon Nov 17, 2008 2:51 pm

After some further testing, it appears that this issue is related to SCALAPACK, as turning it off in the INCAR seemed to allow the jobs to run (though it is much, much slower)

Anyone ever run into this? It'd be nice if I could run on more than 256 procs.
Last edited by d-farrell2 on Mon Nov 17, 2008 2:51 pm, edited 1 time in total.

Post Reply