Large number of bands using OpenACC GPU port
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Large number of bands using OpenACC GPU port
I'm running supercell calculations of a 2D TMD system with a large number of atoms and bands, NBANDS > 1000.
For NBAND up to around 1000, I'm able to perform SCF calculations using the OpenACC GPU port using gamma point calculation and k-point sampling with KPAR = #kpoints. However, when I try to use larger supercells (NBANDS = 2000-3000) the calculations go nowhere. No matter how many nodes I use the calculation does not get further than "entering main loop", no SCF steps are performed. I have tried with only MPI version and openMP version of the code.
What are the recommended settings for large calculations with few or one kpoint, using the OpenACC GPU port?
Best,
Jonathan Backman
For NBAND up to around 1000, I'm able to perform SCF calculations using the OpenACC GPU port using gamma point calculation and k-point sampling with KPAR = #kpoints. However, when I try to use larger supercells (NBANDS = 2000-3000) the calculations go nowhere. No matter how many nodes I use the calculation does not get further than "entering main loop", no SCF steps are performed. I have tried with only MPI version and openMP version of the code.
What are the recommended settings for large calculations with few or one kpoint, using the OpenACC GPU port?
Best,
Jonathan Backman
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Re: Large number of bands using OpenACC GPU port
Thanks for your quick answer.
Sorry, I thought there could be some general recommendations for performance using the GPU port. I have now attached my input files. Best,
Jonathan
Sorry, I thought there could be some general recommendations for performance using the GPU port. I have now attached my input files. Best,
Jonathan
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: Large number of bands using OpenACC GPU port
Your system is indeed quite big but as long as you have a sufficient amount of memory on the GPU it should work.
You can try to optimize your calculation and see if you are able to run it.
If only the gamma point is calculated, you should definitely use the gamma point version of the code.
Also, I am not sure I understand the motivation for using so many unoccupied states. Do you actually need them?
The energy cutoff in your INCAR is double of the default one, so this is another parameter that can be optimized.
You can try to optimize your calculation and see if you are able to run it.
If only the gamma point is calculated, you should definitely use the gamma point version of the code.
Also, I am not sure I understand the motivation for using so many unoccupied states. Do you actually need them?
The energy cutoff in your INCAR is double of the default one, so this is another parameter that can be optimized.
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Re: Large number of bands using OpenACC GPU port
I'm using the gamma point version of the code when possible. I need the unoccupied stated to have a accurate wannierization of the system. I have converged the unit cell system with respect to the energy cutoff.
I do not run out of memory and I can always add more nodes if that was a problem. The problem is that adding more nodes does not give any speedup, since the calculation stands still.
Are there no recommended settings on how to optimize the OpenACC GPU port beyond using KPAR = #kpoints?
wiki/index.php/OpenACC_GPU_port_of_VASP
Best,
Jonathan
I do not run out of memory and I can always add more nodes if that was a problem. The problem is that adding more nodes does not give any speedup, since the calculation stands still.
Are there no recommended settings on how to optimize the OpenACC GPU port beyond using KPAR = #kpoints?
wiki/index.php/OpenACC_GPU_port_of_VASP
Best,
Jonathan
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: Large number of bands using OpenACC GPU port
In principle, KPAR and NSIM are the flags that should give the means to distribute the load over GPUs.
How many GPUs are you running the calculation on? It is also possible that this job is just not big enough and you see no speedup when you increase the number of nodes.
How many GPUs are you running the calculation on? It is also possible that this job is just not big enough and you see no speedup when you increase the number of nodes.
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Re: Large number of bands using OpenACC GPU port
As I mentioned in the original question calculation does not go further than this being displaced:
entering main loop
N E dE d eps ncg rms rms(c)
meaning no scf steps are performed. I tried with up to 100 GPUs and it never goes anywhere. If I run with about 1000 BANDS the calculation has no problem even with as little as 8 GPUs/Kpoint.
I have KPAR =1 (gamma only) and NSIM = 8.
entering main loop
N E dE d eps ncg rms rms(c)
meaning no scf steps are performed. I tried with up to 100 GPUs and it never goes anywhere. If I run with about 1000 BANDS the calculation has no problem even with as little as 8 GPUs/Kpoint.
I have KPAR =1 (gamma only) and NSIM = 8.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: Large number of bands using OpenACC GPU port
What version of the code are you running?
I tried to reproduce this issue on a node with 3 GPUs but I don't see any problem and the calculation proceeds as it should.
Can you please provide stdout and the OUTCAR files from the calculation that hangs up?
I tried to reproduce this issue on a node with 3 GPUs but I don't see any problem and the calculation proceeds as it should.
Can you please provide stdout and the OUTCAR files from the calculation that hangs up?
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Re: Large number of bands using OpenACC GPU port
Thanks for having a look at it!
I have attached data from different runs, using 8 nodes and 20 nodes. There is one GPU on each node. As one can see in the VASP.err from the 8 node run there is not enough memory, this is however not a problem in the 20 node run. I also tried the 20 node run with 1 or 12 OpenMP threads/CPU. I stopped the 20 node runs for this tests, but when I let it continue nothing more happens. Is it correct (in the stdout named output) that only one GPU should be detected? "OpenACC runtime initialized ... 1 GPUs detected"
I have attached data from different runs, using 8 nodes and 20 nodes. There is one GPU on each node. As one can see in the VASP.err from the 8 node run there is not enough memory, this is however not a problem in the 20 node run. I also tried the 20 node run with 1 or 12 OpenMP threads/CPU. I stopped the 20 node runs for this tests, but when I let it continue nothing more happens. Is it correct (in the stdout named output) that only one GPU should be detected? "OpenACC runtime initialized ... 1 GPUs detected"
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: Large number of bands using OpenACC GPU port
Thank you for sending the files.
It looks strange that in the out20nodesNoOpenMP directory the job summary only reports 2 nodes and a single GPU, but otherwise the output looks reasonable.
Regarding "OpenACC runtime initialized" line, it should detect the number of GPUs on the node, so it is correct that it only found 1 GPU in your case.
Could you also provide the versions of the libraries and compilers as well as makefile.include that you used for compiling VASP?
It looks strange that in the out20nodesNoOpenMP directory the job summary only reports 2 nodes and a single GPU, but otherwise the output looks reasonable.
Regarding "OpenACC runtime initialized" line, it should detect the number of GPUs on the node, so it is correct that it only found 1 GPU in your case.
Could you also provide the versions of the libraries and compilers as well as makefile.include that you used for compiling VASP?
-
- Newbie
- Posts: 24
- Joined: Thu Nov 26, 2020 10:27 am
Re: Large number of bands using OpenACC GPU port
I have attached a file with the files used when compiling.
Some examples of the versions:
PrgEnv-nvidia: it loads the module nvidia/21.3, providing the NVIDIA HPC SDK (version 21.3) with compilers, NCCL and QD libraries
cudatoolkit (default CRAY_CUDATOOLKIT_VERSION: 11.0.2_3.38-8.1__g5b73779)
Wannier90: version 3.1.0
cray-hdf5 (default version: 1.12.0.0)
intel (default INTEL_VERSION: 2021.2.0)
This information should also be in the easybuild-devel file.
The code is compiled using easybuild. The file VASP-6.3.0-CrayNvidia-21.05-acc-easybuild-devel shows which modules and environment variables that are used. Some examples of the versions:
PrgEnv-nvidia: it loads the module nvidia/21.3, providing the NVIDIA HPC SDK (version 21.3) with compilers, NCCL and QD libraries
cudatoolkit (default CRAY_CUDATOOLKIT_VERSION: 11.0.2_3.38-8.1__g5b73779)
Wannier90: version 3.1.0
cray-hdf5 (default version: 1.12.0.0)
intel (default INTEL_VERSION: 2021.2.0)
This information should also be in the easybuild-devel file.
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: Large number of bands using OpenACC GPU port
Since I was not able to reproduce this issue on our machines I asked for help from Nvidia people. I was told that the specifics of the enviroment at Piz Daint require that code is compiled with flag cc60, which you correctly did. However, at the runtime one needs to set the target accelerator architecture to the Tesla P100 GPU, which is done by loading craype-accel-nvidia60 module.
Could you please load this module in your slurm script and try the calculation again?
Could you please load this module in your slurm script and try the calculation again?
-
- Newbie
- Posts: 38
- Joined: Sat Feb 13, 2016 4:39 pm
- License Nr.: 20-0400 5-1605
Re: Large number of bands using OpenACC GPU port
I believe this is the same problem that I have, see the post forum/viewtopic.php?f=7&t=18381.
/Daniel
/Daniel
-
- Newbie
- Posts: 24
- Joined: Tue Nov 12, 2019 7:55 am
Re: Large number of bands using OpenACC GPU port
I've read about issues with OpenMP and Nvidia 22.1/21.11 for Q-E code. Nvidia confirmed that there is a bug in those 2 versions, so they recommend to switch either to 21.9 (should be OK) or wait for 22.2 (it seems to be available now).
sergey
sergey
-
- Newbie
- Posts: 38
- Joined: Sat Feb 13, 2016 4:39 pm
- License Nr.: 20-0400 5-1605
Re: Large number of bands using OpenACC GPU port
I have tested to compile VASP with NVHPC version 22.2 but the problem still persists.
However version 21.2 is working!
/Daniel
However version 21.2 is working!
/Daniel