My Community

Posted: **Fri Jan 28, 2022 12:14 pm**

I'm running supercell calculations of a 2D TMD system with a large number of atoms and bands, NBANDS > 1000.

For NBAND up to around 1000, I'm able to perform SCF calculations using the OpenACC GPU port using gamma point calculation and k-point sampling with KPAR = #kpoints. However, when I try to use larger supercells (NBANDS = 2000-3000) the calculations go nowhere. No matter how many nodes I use the calculation does not get further than "entering main loop", no SCF steps are performed. I have tried with only MPI version and openMP version of the code.

What are the recommended settings for large calculations with few or one kpoint, using the OpenACC GPU port?

Best,
Jonathan Backman

Posted: **Fri Jan 28, 2022 3:04 pm**

Hi,

Please provide all the files according to the forum guidelines.

Posted: **Fri Jan 28, 2022 3:47 pm**

Thanks for your quick answer.

Sorry, I thought there could be some general recommendations for performance using the GPU port. I have now attached my input files.

input.zip

Best,
Jonathan

Posted: **Tue Feb 01, 2022 9:53 am**

Your system is indeed quite big but as long as you have a sufficient amount of memory on the GPU it should work.

You can try to optimize your calculation and see if you are able to run it.
If only the gamma point is calculated, you should definitely use the gamma point version of the code.
Also, I am not sure I understand the motivation for using so many unoccupied states. Do you actually need them?
The energy cutoff in your INCAR is double of the default one, so this is another parameter that can be optimized.

Posted: **Tue Feb 01, 2022 12:12 pm**

I'm using the gamma point version of the code when possible. I need the unoccupied stated to have a accurate wannierization of the system. I have converged the unit cell system with respect to the energy cutoff.

I do not run out of memory and I can always add more nodes if that was a problem. The problem is that adding more nodes does not give any speedup, since the calculation stands still.

Are there no recommended settings on how to optimize the OpenACC GPU port beyond using KPAR = #kpoints?
wiki/index.php/OpenACC_GPU_port_of_VASP

Best,
Jonathan

Posted: **Tue Feb 01, 2022 1:42 pm**

In principle, KPAR and NSIM are the flags that should give the means to distribute the load over GPUs.
How many GPUs are you running the calculation on? It is also possible that this job is just not big enough and you see no speedup when you increase the number of nodes.

Posted: **Tue Feb 01, 2022 2:09 pm**

As I mentioned in the original question calculation does not go further than this being displaced:
entering main loop
N E dE d eps ncg rms rms(c)

meaning no scf steps are performed. I tried with up to 100 GPUs and it never goes anywhere. If I run with about 1000 BANDS the calculation has no problem even with as little as 8 GPUs/Kpoint.

I have KPAR =1 (gamma only) and NSIM = 8.

Posted: **Tue Feb 01, 2022 3:10 pm**

What version of the code are you running?
I tried to reproduce this issue on a node with 3 GPUs but I don't see any problem and the calculation proceeds as it should.

Can you please provide stdout and the OUTCAR files from the calculation that hangs up?

Posted: **Tue Feb 01, 2022 4:54 pm**

Thanks for having a look at it!

I have attached data from different runs, using 8 nodes and 20 nodes. There is one GPU on each node. As one can see in the VASP.err from the 8 node run there is not enough memory, this is however not a problem in the 20 node run. I also tried the 20 node run with 1 or 12 OpenMP threads/CPU. I stopped the 20 node runs for this tests, but when I let it continue nothing more happens.

outdata.zip

Is it correct (in the stdout named output) that only one GPU should be detected? "OpenACC runtime initialized ... 1 GPUs detected"

Posted: **Mon Feb 07, 2022 10:18 am**

Thank you for sending the files.
It looks strange that in the out20nodesNoOpenMP directory the job summary only reports 2 nodes and a single GPU, but otherwise the output looks reasonable.
Regarding "OpenACC runtime initialized" line, it should detect the number of GPUs on the node, so it is correct that it only found 1 GPU in your case.

Could you also provide the versions of the libraries and compilers as well as makefile.include that you used for compiling VASP?

Posted: **Mon Feb 07, 2022 4:00 pm**

I have attached a file with the files used when compiling.

vasp_compile.zip

The code is compiled using easybuild. The file VASP-6.3.0-CrayNvidia-21.05-acc-easybuild-devel shows which modules and environment variables that are used.

Some examples of the versions:
PrgEnv-nvidia: it loads the module nvidia/21.3, providing the NVIDIA HPC SDK (version 21.3) with compilers, NCCL and QD libraries
cudatoolkit (default CRAY_CUDATOOLKIT_VERSION: 11.0.2_3.38-8.1__g5b73779)
Wannier90: version 3.1.0
cray-hdf5 (default version: 1.12.0.0)
intel (default INTEL_VERSION: 2021.2.0)

This information should also be in the easybuild-devel file.

Posted: **Tue Feb 15, 2022 1:35 pm**

Since I was not able to reproduce this issue on our machines I asked for help from Nvidia people. I was told that the specifics of the enviroment at Piz Daint require that code is compiled with flag cc60, which you correctly did. However, at the runtime one needs to set the target accelerator architecture to the Tesla P100 GPU, which is done by loading craype-accel-nvidia60 module.

Could you please load this module in your slurm script and try the calculation again?

Posted: **Wed Feb 16, 2022 1:22 am**

I believe this is the same problem that I have, see the post forum/viewtopic.php?f=7&t=18381.

/Daniel

Posted: **Fri Feb 18, 2022 2:27 pm**

I've read about issues with OpenMP and Nvidia 22.1/21.11 for Q-E code. Nvidia confirmed that there is a bug in those 2 versions, so they recommend to switch either to 21.9 (should be OK) or wait for 22.2 (it seems to be available now).

sergey

Posted: **Wed Feb 23, 2022 6:31 am**

I have tested to compile VASP with NVHPC version 22.2 but the problem still persists.
However version 21.2 is working!

/Daniel

My Community

Large number of bands using OpenACC GPU port

Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port

Re: Large number of bands using OpenACC GPU port