My Community

Posted: **Thu Apr 11, 2024 2:43 pm**

Hi here,

This is a somehow cont'd discussion of the question here, as quoted below:

michael_wolloch wrote: ↑Thu Apr 11, 2024 7:03 am Dear Zhao,

this post has gotten a bit far from the original question for my taste. If you want to discuss benchmarking and the intricacies of process pinning, I would suggest making a new post in the "users for users" section.

What confuses me is: why does -bind-to core not lead to a significant reduction in computational efficiency compared to -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core?
You are mixing openMPI and intelMPI command line arguments here. Without going into detail, it is important to know where the processes end up. Use -genv I_MPI_DEBUG=4 for intelMPI and --report-bindings for OpenMPI to check.

About the above comments given by Michael, I've the following puzzles:

1. Intelmpi also has the -bind-to option as shown below:

Code: Select all

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e)
Copyright 2003-2023, Intel Corporation.

$ mpirun --help | grep -- -bind-to
    -bind-to                         process binding

So, I think both -bind-to core and -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core can be used with intelmpi to achieve the same purpose, am I right?

2. I debug the above two options with intelmpi for process pinning as follows:

Code: Select all

werner@X10DAi:~/Public/hpc/vasp/benchmark/amd/Cr72_3x3x3K_350eV_10DAV$ mpirun -genv I_MPI_DEBUG=4 -bind-to core -np 4 vasp_std
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: tcp
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/2023.2.0/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       41239    X10DAi     {0,1,2,3,4,5,6,7,8,9,10,44,45,46,47,48,49,50,51,52,53,54}
[0] MPI startup(): 1       41240    X10DAi     {11,12,13,14,15,16,17,18,19,20,21,55,56,57,58,59,60,61,62,63,64,65}
[0] MPI startup(): 2       41241    X10DAi     {22,23,24,25,26,27,28,29,30,31,32,66,67,68,69,70,71,72,73,74,75,76}
[0] MPI startup(): 3       41242    X10DAi     {33,34,35,36,37,38,39,40,41,42,43,77,78,79,80,81,82,83,84,85,86,87}
 running    4 mpi-ranks, on    1 nodes
 distrk:  each k-point on    4 cores,    1 groups
 distr:  one band on    4 cores,    1 groups
 vasp.6.4.2 20Jul23 (build Feb 29 2024 20:51:29) complex                        
  

werner@X10DAi:~/Public/hpc/vasp/benchmark/amd/Cr72_3x3x3K_350eV_10DAV$ mpirun -genv I_MPI_DEBUG=4 -genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core -np 4 vasp_std
[0] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.18.0-impi
[0] MPI startup(): libfabric provider: tcp
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/2023.2.0/mpi/2021.10.0/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank    Pid      Node name  Pin cpu
[0] MPI startup(): 0       42573    X10DAi     {0,44}
[0] MPI startup(): 1       42574    X10DAi     {1,45}
[0] MPI startup(): 2       42575    X10DAi     {22,66}
[0] MPI startup(): 3       42576    X10DAi     {23,67}
 running    4 mpi-ranks, on    1 nodes
 distrk:  each k-point on    4 cores,    1 groups
 distr:  one band on    4 cores,    1 groups
 vasp.6.4.2 20Jul23 (build Feb 29 2024 20:51:29) complex

Does this indicate that they can all be used in conjunction with IntelMPI to accomplish the same task? If they are both correct usage, why do I observe that the second method significantly reduces operating efficiency by half?

Regards,
Zhao

Posted: **Fri Apr 12, 2024 7:58 am**

Dear Zhao,

as you can see from the output that you provide, even if "-bind-to core" is a valid option for Intel MPI, the result differs dramatically to "-genv I_MPI_PIN 1 -genv I_MPI_PIN_DOMAIN core". In the first case, each process is bound to 11 cores (e.g. 0-10) and 11 logical cores (44-54), while the second one binds to one core each with the corresponding logical core (e.g. 0 and 44).

Please consult the available resources:
https://www.intel.com/content/www/us/en ... nning.html

You can also look at Intel's pinning simulator:
https://www.intel.com/content/www/us/en ... lator.html

Check your hardware using e.g. "lscpu" and "lscpu -e" to figure out which core is associated with which shared L3 cache, NUMA domain, socket, and so on.

If testing different numbers of ranks (and threads) on the same node/chip, please also note that you have to control clock frequency and limit boost, so that you do not run at lower clock speeds for higher core counts.

Posted: **Fri Apr 19, 2024 9:00 am**

Hi again,

I just noticed that I forgot to link the available information on parallelization and what processes VASP expects to be together. Please check out the category on parallelization on the wiki for more information.

Cheers, Michael

Posted: **Fri Apr 26, 2024 4:24 am**

Dear Michael,

As you have explained, benchmarking can be quite complex, particularly when it involves process pinning. Thank you for providing such a wealth of useful reference material. I will try to continue my studies to grasp the proper parameter settings relevant to my specific situation.

See here for the related discussion.

Regards,
Zhao

My Community

Process pinning with intelmpi.

Process pinning with intelmpi.

Re: Process pinning with intelmpi.

Re: Process pinning with intelmpi.

Re: Process pinning with intelmpi.