Page 1 of 1

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Mon May 03, 2010 10:51 pm
by vasp16888
Dear all vasp users:
Some time ago, in order to improve the efficiency of parallelised calculation on large system, I posted a thread about "How to improve parallelised calculation ?" After several communications between Danny, Alex and me, they suggested me to do some test on parameter NPAR, NSIM, LPLANE.
The testing example is Tb0.25Dy0.75Fe2 which have 2 Tb atoms, 6 Dy atoms and 16 Fe atoms in unit cell, kpoints is auto, 7*7*7, etc. I used 2 nodes(24 cores, which means each node has 12 cores), the network card is Infiniband (20GB/S).
If I take the default value for NPAR, NSIM, LPLANE, I approximately use 3000 seconds.
I tested NPAR = 1, 2, 4, 6, 8, 12, 24, each of NPAR corresponds to NSIM = 1, 2, 4, 6, 8, 12, 24 which means 7*7 cases, here are my calculated result:
===========================================
NSIM\NPAR NPAR=1 NPAR=2 NPAR=4 NPAR=6 NPAR=8 NPAR=12 NPAR=24
NSIM=1 10712.798 1533.092 2302.704 2220.371 2470.454 2834.889 2941.860
NSIM=2 10889.813 1940.413 2429.192 2239.376 2515.769 2891.733 2948.944
NSIM=4 10622.640 1917.540 2221.515 2385.977 2502.756 2929.271 3033.390
NSIM=6 10836.125 2111.760 2393.558 2395.906 2623.324 2913.558 2990.683
NSIM=8 11168.838 2107.752 2378.309 2296.263 2668.595 2934.627 3109.094
NSIM=12 11148.837 2056.108 2279.254 2339.886 2624.820 2934.643 3207.204
NSIM=24 10512.837 1967.503 2253.869 2260.493 2626.288 3016.769 3725.165
===========================================
It seems like NPAR = 2, NSIM = 1 is fastest. The increase of NSIM will always increase the calculation time, increaing NPAR should be careful, because NPAR should be different according to your specific cases.
I also tested LPLANE = .TRUE. which save approximately 50% calculation time further.

My questions are as follows:
(1)In vasp userguide, it says increase NSIM should improve the performance, but here the data is not the case, so which one is correct ?
(2)Since NPAR = 2, NSIM = 1 is fastest, I think maybe calculating one band in one node is most optimized, so I am expecting a further improvement when I use 3 nodes which I set NPAR = 3, NSIM = 1, but even though, the calculation time is 1515 seconds, only 18 second faster. I don't know why ?

I hope my calculation result will be usefull for all vasp users, I am looking forward to everybody's reply to my questions.
Thanks:)
Hui





<span class='smallblacktext'>[ Edited ]</span>

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Tue May 04, 2010 6:47 am
by alex
1) benchmark results only apply to your very special case you just tested. Similar systems behave similar. So there is no right and wrong, just better or worse.
3) NPAR decreases communication and increases the numerics on one core. Probably you have somehow a double minimum.

alex

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Tue May 04, 2010 9:17 am
by Danny
In addition, bear in mind that your problem (36atoms) might become small for a large number of cores (3nodes=36 cores).
If you have a look at the scalling of VASP wrt the number of cores/nodes used you'll see it show a linear behavior(like we whish) but at some point it starts to top of, and you get no further improvement.

In your case test for 1 node, and normally your optimum should be at NPAR=1, NSIM=1
(again a grid, but you can make it smaller NPAR= 1,2,4,6,12; NSIM=1,2,4,6)
So the fact that you find only 18 seconds difference for 3 nodes might be because you reached your best performance for that system.

Keep in mind that this behavior differs from machine to machine, and probably setup to setup.

Danny

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Tue May 04, 2010 6:11 pm
by vasp16888
[quote="alex"]1) benchmark results only apply to your very special case you just tested. Similar systems behave similar. So there is no right and wrong, just better or worse.
3) NPAR decreases communication and increases the numerics on one core. Probably you have somehow a double minimum.

alex[/quote]


Hi Alex, thanks for your reply.
You said:NPAR decreases communication and increases the numerics on one core. Probably you have somehow a double minimum.

Are you saying we should find a balance between decreasing communication and increasing the numerics on one core ? Or some other thing ?
About NSIM, what do you think the role it playing in the calculation ? ( mannual says NSIM bands are optimized at the same time, it means if NSIM is bigger, it should be faster, but my result is not the case. Is this only about the special case?)
please make it clearer, thanks:)

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Tue May 04, 2010 6:26 pm
by vasp16888
[quote author=36 cores).
If you have a look at the scalling of VASP wrt the number of cores/nodes used you'll see it show a linear behavior(like we whish) but at some point it starts to top of, and you get no further improvement.

In your case test for 1 node, and normally your optimum should be at NPAR=1, NSIM=1
(again a grid, but you can make it smaller NPAR= 1,2,4,6,12; NSIM=1,2,4,6)
So the fact that you find only 18 seconds difference for 3 nodes might be because you reached your best performance for that system.

Keep in mind that this behavior differs from machine to machine, and probably setup to setup.

Danny[/quote]My question are still ambiguous:[/color](1)What is the role of NSIM playing in the calculation ? (mannual says NSIM bands are optimized at the same time, it should be faster when NSIM is bigger, but my result is not the case)

(2)like Alex said, NPAR decreases the communication and increases the numerics in each cores. But how can we find the best NPAR without testing all the NPAR or a given system ?
(because we don't need to test all NPAR for every large system.)
My understanding, I used wien2k for sometime, and I can get the approximate number of plane waves during the init_lapw, so I can make a approximate guess about how many cores I need, and kpoints parallel or mpi parallel etc.

So, please tell me your thoughts about the above 2 questions. Thanks in advance:)

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Wed May 05, 2010 8:35 am
by Danny
wrt NPAR: the communication in VASP is pretty constant for different calculations, so if you benchmark for one decent system (the testjob you just used) you know how it behaves for all calculations (with some exceptions: eg very small jobs). Your current setting fixes the communication probably in such a way that mayor communication only is within a node, and reduces the communication between nodes. In general the latter is slower, so your choice of NPAR=#nodes reduces this last one=> increase in performance.

The NSIM is as far as I understand more connected to matrix operations, which are done outside VASP (mkl, etc takes care of these) I would guess it also depends on the CACHE of your system (and the blocksize you set in your makefile.) So there might be systems where you don't gain anything through setting NSIM higher. At least VASP seems to behave quite consistent on your system, so you have a good startingpoint to run jobs.

Danny

Take a look, here are some important parameters which can improve parallelised calculation!

Posted: Thu May 06, 2010 10:28 pm
by vasp16888
[quote author=#nodes reduces this last one=> increase in performance.

The NSIM is as far as I understand more connected to matrix operations, which are done outside VASP (mkl, etc takes care of these) I would guess it also depends on the CACHE of your system (and the blocksize you set in your makefile.) So there might be systems where you don't gain anything through setting NSIM higher. At least VASP seems to behave quite consistent on your system, so you have a good startingpoint to run jobs.

Danny[/quote]


Hi Danny, my CACHE is 512 KB, memory is 16 GB/node, is this big enough ?
I mean, if I wanna deal with 100 atoms by vasp, how big is the cache will be reasonable?
Thanks:)