Page 1 of 1

DFT calculations crashing with "EDDDAV error" when performed on too many cpus

Posted: Wed May 04, 2022 7:02 am
by kdoblhoff
Dear Vasp community,

I am performing relatively small calculations (in terms of ecut and k-points). As I want to use them for subsequent RPA calculations and need considerable amounts of memory, I would like to run them on a relatively large amount of cpus (64-128). However, in doing so, I get the following type of errors (sometimes in the first iteration, sometimes later):

Code: Select all

--------------------------------------- Iteration      1(   1)  ---------------------------------------


    POTLOK:  cpu time      0.0126: real time      0.0164
    SETDIJ:  cpu time      0.1947: real time      0.1968
 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     Error EDDDAV: Call to ZHEGV failed. Returncode = 7 1 8                  |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------
This error does not occur when I run on 32cpus only, so there is nothing intrinsically going wrong which my calculations (also the results are very reasonable then). I see this occurring for different systems (bulk, surfaces, ...) and for different builds/machines. Is this normal behavior and what is the origin of this? Can it be avoided (other than by reducing the amount of cpus on which I run VASP- which is a disadvantage in computational cost, as I need more cpus to get the required memory in the subsequent RPA correlation energy calculations)?

Thank you and best regards,
Katharina

Re: DFT calculations crashing with "EDDDAV error" when performed on too many cpus

Posted: Wed May 04, 2022 8:12 am
by andreas.singraber
Dear Katharina,

this is certainly not the expected behavior! Before we can start investigating, could you please attach the input and output according to the forum posting guidelines. Please add the runs which fail on 64 cores, and, if possible, one that finishes successfully (e.g. on 32 cores). Also it may be good to know what kind of nodes you are using?

Thank you!

Best,
Andreas Singraber

Re: DFT calculations crashing with "EDDDAV error" when performed on too many cpus

Posted: Wed May 04, 2022 7:27 pm
by kdoblhoff
Dear Andreas Singraber,
Thank you for your reply,
I attach the in and output of a calculation for which I had a failing and a finishing calculation lying around. It is probably not the simplest system for which I have seen this issue, but that is what I had at hand. I am aware that the ecut is too low for a "reasonable" calculation and geometry-wise, it is only a test system, but I would have expected something that runs on 32 cpus to run also on 128cpus.

The calclulations in the directory "crashing" and "finishing" should be identical apart from the fact that one was run on 32cpus and the other on 128cpus (where the crashing one is the one running on 128cpus).

The following is a description on the nodes on which I have been running this job. I remember though that I have had the same issue on an epyc machine with 32cpu/nodes for very small bulk calculations.
Node flavor: hcn
Lenovo node type: ThinkSystem SR645
AMD Rome 7H12 (2x),
CPU SKU: 64 Cores/Socket, 2.6GHz, 280W
cpus/node: 128
DIMMS: 16x64GiB, 3200MHz, DDR4
Total memory/node (per cpu): 1 TiB, (8 GiB)

Thank you for having a look,
Best regards,
Katharina

Re: DFT calculations crashing with "EDDDAV error" when performed on too many cpus

Posted: Fri May 13, 2022 9:59 am
by andreas.singraber
Dear Katharina,

I had a closer look at your problem and although I do not yet have a full answer regarding the origin of the error I have some suggestions for you:

1.) You mentioned that you intend to use more nodes because of their memory and not so much because of their compute power (additional cores). Hence, I would suggest to reflect this in different submit script settings. You could limit the number of MPI tasks which are allowed to be executed on one node with the SLURM ntasks-per-node variable. For example, consider nodes with 32 cores each. Let's assume you would need 4 nodes to fulfill the memory requirements. If you write a SLURM job script with

Code: Select all

...
#SBATCH --nodes=4
...
you will get 4 nodes with 128 cores in total. Running VASP with srun will translate to mpirun -np 128. If you provide the ntasks-per-node like this:

Code: Select all

...
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
...
you will still get 4 nodes but only 8 MPI tasks will run on each node. Therefore, VASP will be started with 32 cores in total but has all the memory from 4 nodes available.

2.) The numeric problems in your calculation seem to come from an unlucky choice of the NCORE tag. You chose NCORE = 64, probably because you followed the recommendation to pick a value up to the number of cores per socket. However, in combination with your small system this seems to lead to numeric instabilities. I tested different numbers and found that everything NCORE >= 33 leads to problems. Now, in the finishing example you sent, there were only 32 cores in total, so NCORE was automatically reset from 64 to 32 and everything worked ok. However, with 128 cores the setting NCORE = 64 was accepted and the numerics failed. I will talk to my colleagues to find out what the actual reason for this is but in the meantime I would recommend to choose a different parallelization pattern with a lower NCORE value.

Please have a look at the parallelization documentation and our newest video on Youtube about this topic:

https://youtu.be/KzIuL_e0zz8 (in particular 11:44 and 31:33)

If you run VASP on a single core you can check the number of bands in the OUTCAR file (just do a dry run with ALGO = None and NELM = 1). Search for NBANDS, it will show you that with your settings there are 37 bands. So not a lot of work to parallelize, certainly not something that would scale well to 128 cores. I would suggest to try NCORE = 4 and KPAR = 2 with 32 cores as a starting point. Also, use the SLURM suggestions from 1.) to get the desired memory for the later VASP steps.

I hope that helps you in setting up the parallelization!

All the best,

Andreas