Page 1 of 1

G0W0R calculation crashing

Posted: Tue May 28, 2024 2:09 pm
by bprobinson102
VASP team,

I am currently doing some benchmarking using low-scaling GW (G0W0R) due to my interest in finite-temperature properties. I have tested k-point meshes of 5x5x5, 7x7x7, and 9x9x9 as well as ENCUT values of 300, 500, and 800 eV.

I am currently having issues with the calculation attached, where I am using a 9x9x9 k-point mesh with ENCUT=800 eV. All prior calculations have ran fine. I believe the error may be memory related, however, I am not entirely sure.

Any thoughts/input would be great. If you have any further questions or need more information please let me know.

Brian

Re: G0W0R calculation crashing

Posted: Tue May 28, 2024 2:57 pm
by henrique_miranda
Maybe you can grep for the memory usage in the calculations that finished:

grep 'memory' OUTCAR

Like that, you might get an idea of the amount of memory you are actually using.
In the `gw.OUTCAR` that you shared, it looks like the code crashes before writing the memory usage of the GW calculation.

Re: G0W0R calculation crashing

Posted: Tue May 28, 2024 10:28 pm
by bprobinson102
I would expect the memory to be about 230 GB/rank when NTAUPAR=8, which is too much. However, if NTAUPAR=2, it should be closer to 63 GB/rank, which should be enough. However, I still get the same error.

Re: G0W0R calculation crashing

Posted: Wed May 29, 2024 8:00 am
by henrique_miranda
I think this is a memory problem.
I tried to grep for memory in the OUTCAR you shared, but I get:

Code: Select all

$ grep memory gw.OUTCAR
 total amount of memory used by VASP MPI-rank0    47703. kBytes
 available memory per node:  106.78 GB, setting MAXMEM to  109337
 total amount of memory used by VASP MPI-rank0   226674. kBytes
This is still an incomplete OUTCAR file. The memory estimation is done afterward.
My suggestion to try and get a better idea of the memory usage is to reduce the size of the calculation and grep for memory.
For example, I modified your files so that I can run on my local machine: use the default ENCUT=308.450 and KPOINTS to a 3x3x3 mesh.
Then I grep for memory:

Code: Select all

$ grep memory OUTCAR
 total amount of memory used by VASP MPI-rank0    31753. kBytes
 available memory per node:    6.60 GB, setting MAXMEM to    6756
 total amount of memory used by VASP MPI-rank0    35056. kBytes
 estimated memory requirement per rank   1371.8 MB, per node   5487.3 MB
 memory high mark on MPI-rank0 inside Response functions allocated    85927. kBytes
 memory high mark on MPI-rank0 inside RESPONSE_SUPER  1103737. kBytes
 memory high mark on MPI-rank0 inside RESPONSE_SUPER   154051. kBytes
 memory high mark on MPI-rank0 inside SIGMA_SUPER  1293329. kBytes
 memory high mark on MPI-rank0 inside SIGMA_SUPER  1131900. kBytes
 memory high mark on MPI-rank0 inside SIGMA_SUPER  1314821. kBytes
 memory high mark on MPI-rank0 inside SIGMA_SUPER  1153392. kBytes
 total amount of memory used by VASP MPI-rank0    83410. kBytes
 total amount of memory used by VASP MPI-rank0    31775. kBytes
                   Maximum memory used (kb):     1481000.
                   Average memory used (kb):          N/A
In this case, I need about 5GB of memory per node.
I monitored the calculation with htop and indeed that was an accurate estimation.

Now you've mentioned that you ran some calculations with lower k-point meshes and ENCUT, do those calculations crash as well?
If you grep for the memory usage of those calculations, what do you get?

Re: G0W0R calculation crashing

Posted: Wed May 29, 2024 3:21 pm
by bprobinson102
All other calculations ran successfully, with the memory requirements below.

kpoint, ENCUT: mem/node
5x5x5, 300 eV: 6 GB
5x5x5, 500 eV: 20 GB
5x5x5, 800 eV: 35 GB
9x9x9, 300 eV: 34 GB
9x9x9, 500 eV: 57 GB

Thanks,
Brian

Re: G0W0R calculation crashing

Posted: Fri May 31, 2024 2:14 pm
by bprobinson102
Adding to the list,

kpoint, ENCUT: mem/node
9x9x9, 600 eV: 62 GB
9x9x9, 700 eV: 81 GB
9x9x9, 750 eV: 92 GB

Re: G0W0R calculation crashing

Posted: Fri May 31, 2024 2:46 pm
by henrique_miranda
Indeed, there seems to be a problem in a routine that should return an error, and instead it is accessing out-of-memory leading to a segmentation fault.
This is not a memory issue but a bug.

I fixed this in the code in scala.F:

Code: Select all

-         IF ( .NOT. PRESENT( INFO ) ) THEN
+         IF ( PRESENT( INFO ) ) THEN
Then I get the following error message:

Code: Select all

 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     One or more MPI groups includes at least one CPU that does not          |
|     carry data. This would cause ScaLAPACK to crash and is due to the       |
|     automatically selected processor grid for ScaLAPACK.                    |
|     You can influence the processor grid (NPROW, NPCOL) by changing         |
|     NTAUPAR or NBANDS or both such that (NPCOL/NTAUPAR-1)*MB < NBANDS       |
|     as well as (NPROW/NTAUPAR-1)*MB < NBANDS. Values for this run are:      |
|     NTAUPAR = 8                                                             |
|     NBANDS  = 256                                                           |
|     NPCOL   = 4                                                             |
|     NPROW   = 8                                                             |
|     MB      = 64                                                            |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------
 
Which in your case might be different.
You might be able to fix your calculation by changing NBANDS or NTAUPAR.

Regarding NBANDS my suggestion would be to include all the possible bands by inspecting the OUTCAR file:

Code: Select all

$ grep "maximum number of plane-waves:" OUTCAR
You will see that this number increases with ENCUT so instead of converging ENCUT and NBANDS separately you could simply converge ENCUT and always use the maximum number of bands.

Of course, when you include more bands the memory requirements will increase as well, so perhaps start by running with a ENCUT but a larger number of bands.
Hope this helps and thank you for brining this bug to my attention!

Re: G0W0R calculation crashing

Posted: Mon Jun 03, 2024 2:42 pm
by bprobinson102
Thanks for catching that, I will check out your suggestions!