Optimizing on the AMD EPYC

Message

cuikun_lin · #1 Post by **cuikun_lin** » Wed Mar 03, 2021 11:55 pm

We have a cluster of 240 nodes connected via a 200 Gbps IB network. Each node is a 128 core AMD EPYC 7702 chip. For many applications (like lammps) it is important to bind the MPI tasks to the l3cache (32 MPI tasks) and then have each MPI task create 4 threads to get good performance. Here is how the binding is done for lammps (two different ways).

METHOD 1

Code: Select all

mpirun -np 256 --bind-to core --map-by hwthread -use-hwthread-cpus -mca btl vader,self lmp -var r 1000 -in in.rhodo -sf omp

METHOD 2

Code: Select all

export OMP_NUM_THREADS=4
mpirun --mca btl self,vader --map-by l3cache lmp -var r 1000 -in in.rhodo -sf omp

Can something similar be done for VASP? We built VASP with openMP support but we cannot get the binding and threading to work.

Thanks

#2 Post by **merzuk.kaltak** » Thu Mar 04, 2021 1:01 pm

Currently we have AMD EPYC chips on nodes connected only via 1Gbps.
As a such, we can't test multi-node processor pinning and thread launching in practice yet.

Concerning MPI+OpenMP on a single node:
I have tried only MPI-parallelization with an EPYC 7402P, where the mpirun option "--map-by core" described in this AMD tuning guide was sufficient.
MPI + OpenMP threading is explained on our wiki page in general. The idea is that you want to have the MPI ranks that launches threads on the same node or (even better) on the same socket.
In the case of EPYC chips, same socket would mean even same chiplett.

cuikun_lin · #3 Post by **cuikun_lin** » Sun Mar 07, 2021 2:56 pm

Ran VASP with four different mpirun setups. Here are the associated timings.

Code: Select all

time -p mpirun vasp_std
286.583 seconds

time -p mpirun --bind-to core vasp_std
287.094 seconds

(1) time -p mpirun --map-by core --report-bindings --mca pml ucx --mca osc ucx \
--mca coll_hcoll_enable 1 -x UCX_NET_DEVICES=mlx5_2:1 -x \
HCOLL_MAIN_IB=mlx5_2:1 vasp_std
10222.636 seconds

(2) mpirun -np 32 --map-by l3cache:PE=4 --bind-to core \
             -x OMP_NUM_THREADS=4 -x OMP_STACKSIZE=512m \
             -x KMP_AFFINITY=verbose,granularity=fine,compact,1,0 \
             vasp_std
415.790 seconds

Not sure why the suggested mpirun (1) in the AMD tuning guide for 7002 processors is performing so badly. When I run the threaded version (2) what should I see when I do top? I expected to see 32 mpi tasks each using approximately 400% of CPUs but didn't.

Thanks

thda0531 · #4 Post by **thda0531** » Tue Mar 16, 2021 9:42 am

Hi cuikun_lin,

sorry to be completely off topic, but can you share your settings in
makefile.include with us for building VASP on AMD Epyc efficiently?

Thank you in advance.

cuikun_lin · #5 Post by **cuikun_lin** » Sat Mar 20, 2021 12:39 am

Thda0531,

All the heavy lifting were done by our HPCC staffs and they did a lot of work on optimizing the clusters. I believe they are still optimizing to get more efficient CPU time.
For the makefile, I tried different ones from VASP wiki and they works very well following their instructions.
For AMD compiler options, like O1, O2, O3 or Ofast etc, please see this documents.
https://www.amd.com/system/files/docume ... essors.pdf
With limited test of GNU version, O3 an Ofast are pretty good. In fact NERSC also did extensive benchmark tests. Please see the following document.
https://www.nersc.gov/assets/Uploads/Co ... 90212.pptx
Hope this will help.

My Community

Optimizing on the AMD EPYC

Optimizing on the AMD EPYC

Re: Optimizing on the AMD EPYC

Re: Optimizing on the AMD EPYC

Re: Optimizing on the AMD EPYC

Re: Optimizing on the AMD EPYC