1. Introduction
The present paper was prompted by some outstanding results that we recently obtained in a sequence of numerical and computational experiments applying some parallel algorithms already available in the literature (the DVS-BDDC 1,2,3,4,5,6,7,8). They are extraordinary, because contradict the generally accepted belief that in parallel computation the acceleration, or speedup, cannot be greater than the processors number 9,10,11,12,13,14,15,16,17,18,19,20,21,22. For example, in our numerical experiments using 400 processors in parallel we achieve a speedup of 29,278, which is 73.2 times greater than the maximum acceleration that such a belief allows.
In agreement with such a belief, the speedup goal sought in most, probably all, research that has been carried out in domain decomposition methods (DDM) up to now 9,10,11,12,13,14,15,16,17,18,19,20, is equal to the number of processors used. Since our results show that considerably larger speedups are feasible, the conclusion is drawn that the speedup goal sought so far is too modest and restrictive; hence, it should be replaced by a larger and more ambitious performance goals in future DDM research. To this end, we resort to the DIVIDE AND CONQUER STRATEGY, which for solving boundary-value problems of PDEs using parallel computation, probably is the most basic algorithmic paradigm 23. Furthermore, we formulate it in a manner that yields precise and clearly defined quantitative performance goals, to be called DC-goals, which are larger, yet realistic. The adequacy of the modified framework so obtained is verified by satisfactorily incorporating the outstanding results just mentioned in it.
It should be mentioned, before leaving this Section, that according to the Gustafson's Law, the speedup is up-bounded by the number of processors, which is achieved when the linear speedup is attained. However, beyond such limits, the superlinear speedup may happen for plenty of reasons (see 24), although it is not frequent and when it occurs it enhances the value of the software possessing it.
The paper is organized as follows, Section 2 presents some background material on the Derived-Vector Space (DVS) approach to DDM, and DVS-BDDC 1,2,3,4,5,6,7,8. The outstanding performance results that prompted this article are introduced and explained in Section 3. An inconsistency of standard approaches to DDM that such results exhibit, is pointed out and discussed in Section 4, while Section 5 introduces some measures of performance whose conspicuous feature is that they are defined with respect to a performance goal.
The ideas and results contained in Sections 3 to 5, are then used in Sections 6 and 7, to show that both, the concept of ideal parallel-performance and the belief that the ideal parallel speedup is p , lack firm bases. The "divide and conquer"algorithmic paradigm (23, p. v), -the DC-paradigm- is recalled and revised in Section 8, and a quantitative DC-performance goal adequate to be used in future DDM research, is derived from it. There, it is also shown that in the examples here discussed the latter performance goal is larger than p , by a big factor; indeed, in the examples here treated, the DC-speedup goal is close to p 2 and the factor we are referring to, is close to p=p 2 /p . We recall that p is the number of processors, used and hence the “factor ” is large when the number of processors is large.
When the extraordinary numerical and computational results that prompted this paper are incorporated in the DC-framework they look completely normal, as it is shown in Section 10, since their DC-efficiencies, for p≠1 , range from 70.3% to 20.0%. Sections 11 and 12 are devoted to exhibit the severe restrictions that believing in the relation S(p,n)≤p, has imposed on software developed under that assumption. Finally, Section 13 states this paper's conclusions.
2. Some background
Ismael Herrera and some of his coworkers, have been working in domain decomposition methods (DDM) since 2002, when he organized and hosted the Fourteenth International Conference on Domain Decomposition Methods (DDM) 23. In their work on DDM 1,2,3,4,5,6,7,8, they pointed out that it is extremely inconvenient using coarse meshes in which some of the nodes are shared by several subdomains because, when this is done the system matrix is not block-diagonal. This, in turn, shatters one of the main objectives of the DDM strategy: processing in different processors the degrees of freedom belonging to different subdomains of the coarse mesh. So, according to the above discussion, standard methods (i.e., methods that follow the canons prevailing at present) share this handicap and, to overcoming it, we introduced the derived-vector space methodology (DVS methodology) 1,2,3,4,5,6,7,8).
The algebraic venue in which DVS methodology was built is the derived-vector space. Briefly, the derived-vector space construction consists in 2:
Firstly, the partial differential equation is discretized by means of a standard procedure in a fine mesh. This yields a system of linear discrete equations and a system of original nodes. When a coarse mesh is introduced, some of them are shared by several subdomains.
Replacing the original-nodes by the derived-nodes. The nodes of this latter class have the property that each one of them belong to one and only one of the subdomains. The whole set of derived-nodes is decomposed into non-overlapping subsets, with the property that there is a one-to-one correspondence between such subsets and sub-domains of the coarse mesh;
Then, the linear-space of functions defined in the derived-nodes, constitutes the derived-vector space, which is provided with an algebraic structure suitable for effectively carrying out the developments required for constructing the DVS methodology;
The concept of non-overlapping discretization is introduced 2. The most significant and conspicuous property of such a kind of discretizations is that its application yields block-diagonal systems of equations;
A non-overlapping discretization, equivalent to the standard discretization applied in i), is used. This permit transforming the original system of discrete equations into another whose system-matrix is block-diagonal.
Up to now, the DVS methodology has produced four DVS-algorithms (see, 2) for further details): DVS-FETI-DP, DVS-BDDC, DVS-PRIMAL and DVS-DUAL. The first two were obtained by mimicking the well-known FETI-DP and BDDC procedures in the derived-vector space., but the big and very significant difference is that such procedures are applied after the differential equations have been subjected to a non-overlapping discretization, so that the discrete system of linear equations we start with, is block-diagonal. The other two DVS- algorithms: the DVS-PRIMAL and DVS-DUAL, were produced by completing the theoretical framework (again see, 2). So far, only the DVS-BDDC algorithm has been numerically tested; in 2016, preliminary computational experiments were published, which proved that the DVS-BDDC was fully competitive with the top DDM algorithms that were available 1. However, at that time we did not have yet obtained the extraordinary results we are now reporting.
3. The outstanding results
More recently, in 2018, the authors have developed a more careful code of the DVS-BDDC algorithm and tested it through a set of numerical experiments, obtaining the exceptional results that are presented and discussed in this Section. They are objectively outstanding because, for example, when the number of processors used is 400 the acceleration produced is 73.2 times by 400; hence, in this application, the DVS-BDDC algorithm produces an acceleration 73.2 times larger than the largest possible according to canonical theory (i.e., theory that follows the canons prevailing at present).
More specifically, the computational experiments here reported, consisted in treating a well posed 2D problem for Laplace differential operator in the highly parallelized supercomputer "Miztli" of the National Autonomous University of Mexico (UNAM), using successively 1, 16, 25, 64, 256 and 400 processors. The notation used to report the numerical and computational results so obtained is given next:
Here, the size of the problem is equal to number of degrees of freedom, which in turn is equal to the number of nodes of the fine mesh. In general, the "execution time" and "speedup" are functions of the pair . In the set of numerical experiments here reported the size of the problem is kept fixed and equal to 106; i.e.,n=106.
The very impressive results of the numerical experiments are given in Table 1 (everywhere in this paper times are given in seconds), where the fifth column gives the speedup as a multiple of p, the number of processors, which in standard theory of domain decomposition methods is thought to be an unsurmountable speedup. However, in the set of experiments we are reporting, the speedup is much greater than the standard theory foresees, if p≠1 ; even more, it is greater than such an upper bound, by a large factor: 10.28, 15.02, 28.58, 57.18 and 73.2, when the number of processors is 16, 25, 64, 256 and 400, respectively. Observe that the factor increases with the number of processors, which is an enhancing feature. The last column is only included here, for later use.
4. An inconsistency of standard DDM
The standard definition of efficiency is
and it is usually expressed in percentage. The sub-index S used here, comes from Standard and it is used for clarity, since alternative definitions will be introduced later.
The Table 2 that follows has been derived from Table 1, by expressing its last column in terms of standard efficiency, E S (p,n). By inspection of Table 2, where percentages much greater than 100% such as 1,028, 1,502, 2,858, 5,718 and 7,320 occur, it is seen that the standard efficiency is not adequate for expressing the superlinear results of the numerical experiments we are reporting, because efficiencies far beyond 100% occur.
5. Revisiting the measures of performance
In this Section we define some measures of parallel-software performance that will be used in the sequel. As usual, such measures will be based on the execution time that is required for completing a task; the shorter the better. According to Eq.(3.1), the notation T(p,n) means the execution time when the number of processors is p ; in particular, T(1,n) is the execution time when only one processor is applied.
For the sake of clarity, we recall the speedup (or, acceleration) definition:
The main objective in using a parallel computer is to get a simulation to finish faster than it would in one processor. Furthermore, let us take the position of a software designer who intends to develop software that performs well; so, he defines a performance goal he intends to achieve. The following two procedures for specifying such a goal will be considered; fixing the execution-time goal, T G (p,n), or fixing the speedup goal, S G (p,n). Assume either one of them have been specified, then the relative efficiency (relative to a goal performance) is defined by
when S G (p,n) is given, or
These two manners of defining relative efficiency are equivalent, if and only if:
Hence
The first one of these equalities can be used to obtain S G (p,n) when T G (p,n) is given, and the second one, conversely.
According to Eq.(5.2),
Here the symbol ⇔ stands for the logical equivalence; i.e., if and only if. Actually, when we choose a goal we do not know if it is achievable, but the initial state satisfies S(p,n)<S G (p,n) since S G (p,n) is a desirable state. Hence, at the beginning 1-E(p,n)>0 and this quantity may be taken as a distance to the goal. However, it can also happen that our developments lead to a speedup S(p,n)>S G (p,n), since generally we do not know beforehand if the speedup S G (p,n) is an upper bound of those possible. When that happens, E(p,n)>1.
Conversely, a corresponding argument can be made if the execution time and Eq.(5.3) are used to define the parallel efficiency. The main difference is that, in such a case, T(p,n)>T G (p,n) at the beginning and T(p,n)<T G (p,n) is an indication that the goal has been exceeded.
6. The concept of "ideal parallel speedup"
In the literature on scientific parallel computing and on domain decomposition methods for the numerical solution for partial differential equations, the notion of "ideal parallel speedup" is used when defining absolute efficiency. However, its definition lacks precision. When S A (p,n) is the ideal parallel speedup, the relation
holds whenever S(p,n) is the acceleration obtained in a parallel computation. If we try to make this notion rigorous, we could say that S A (p,n) is the supremum, but what is never made clear is: of what set S A (p,n) is the supremum.
Even so, when S A (p,n) is the ideal parallel speedup, the absolute parallel efficiency is defined to be
Thereby, we mention that the subscript A above, comes from Absolute.
However, if we do not know for sure that Eq.(6.1) holds whenever S(p,n) is the acceleration obtained in a parallel computation, this is a risky definition. Indeed, if that is the case and there is an execution for which
Then, we would claim that S(p,n) is not achievable and we would be satisfied with an acceleration that is close to S A (p,n), even if S A (p,n) is much smaller than S(p,n).
7. The international DDM research goal
Generally, it is thought that Eq.(6.1) holds, with S A (p,n)=p; i.e.,
Hence, the standard definition of efficiency of Eq.(4.1):
Comparing this equation with Eq.(5.2) it is seen that Eq.(7.2) implies that the speedup goal, sought by DDM research worldwide is:
Here, we have written SS( p,n) for the speedup goal of standard DDM research. In view of the discussions here presented, this goal is too modest and more ambitious goals should be sought in the future.
8. The relative DVS efficiency of standard approaches
In this Section we make a simple exercise in which we compute the relative efficiency of standard approaches when the goal speedup is that achieved by the DVS-BDDC algorithm in the numerical experiments here reported. The notation here adopted for such a relative efficiency is ES DVS.
Applying the definition of Eq.(5.2), we get
Inspecting the results of our numerical experiments reported in the last column of Table 1, in view of Eq.(7.1), it is seen that the relative efficiency of standard approaches with respect to DVS-BDDC is only 9.7%, 6.7%, 3.5%,1.7% and 1.4%, in these experiments. Hence, our conclusion of this Section is that the speedups goals sought in DDM research worldwide up to now, are too small and should be revised.
9. The Speedup goal of the Divide and Conquer Framework
As a starting point of this Section, we recall the divide and conquer algorithmic paradigm 23, which is frequently considered as the leitmotiv of domain decomposition methods 21. The divide and conquer strategy (DC-strategy) consists in dividing the domain of definition of the scientific or engineering model into small pieces and then send each one of them to different processors. If p is the number of subdomains of the domain decomposition, the size of each piece is approximately equal to n/p ; hence, smaller than n when p≥1 and much smaller than n, when p is large.
This is the procedure used by domain decomposition methods, for reducing the size of the problems treated by each processor. It constitutes an application of the DC-strategy. Of course, for the divide and conquer strategy being most effective it is necessary and sufficient that each one of the local problems be independent of all others. Such a condition (each local problem being independent of all others) is seldom fulfilled in practice, and it will be referred to as the DC-paradigm. Adopting the DC-paradigm as a guide in the development of software implies to strive to construct algorithms in which the local problems are as independent of each other as possible. Thereby, we mention that the DVS methodology, which in the numerical experiments here reported has been so effective, was developed following the DC-paradigm.
Since the approximate size of each local problem is n/p , when all them are independent, T(1,n/p) would be the approximate execution-time for each one of them, which when the computation is carried out in parallel is also the global execution-time. Therefore, in the DC-framework we define the execution-time goal (DC-execution-time goal ), to be denoted by TDc(p,n) , as:
Correspondingly, the speedup goal for the DC-approach is defined to be
and the DC-efficiency is given by
In Table 3, to illustrate the Divide and Conquer concepts, they have been computed in the conditions of the numerical experiments that prompted this paper. The first and second columns (counted from left to right) contain the number of processors and the degrees of freedom of the local problems, respectively. The third column yields the DC-execution time goals of the local problems, which were obtained through numerical experiments; for each p only one of the local problems was solved numerically (and only one of the processors was used). Once T DC (n,p) was known, S DC (p,n) was computed applying straightforward formulas. The local solvers used in our numerical experiments were banded LU decompositions, whose algorithmic complexity turned out to be p 2 and is given in the fifth column. An interesting fact, in the numerical experiments here reported, is that the algorithmic complexity approximates S DC (p,n), and the last column of Table 3 gives the corresponding relative errors in percentage associated with such an approximation.
10. Incorporating the outstanding results in the DC-framework
In this Section the results of our numerical and computational experiments contained in Table 1, are incorporated in the DC-framework. Table 4 that follows, was so built. The seventh column of Table 4 gives the DVS efficiency, relative to the Divide and Conquer performance goal. The last column gives it, relative to the complexity of LU, p 2 .
p | p 2 | T DVS (p,n) | s dvs (p,n) | T Dc (p,n ) | S Dc (p,n) | SDVS(p,n)/P | |
1 | 1 | 29,278 | 1 | 29,278 | 1 | 100% | 100% |
16 | 256 | 178 | 164.5 | 125.15 | 233.9 | 70.3% | 64.3% |
25 | 625 | 78 | 375.4 | 51.45 | 596.1 | 63.0% | 60.1% |
64 | 4,096 | 16 | 1,829 | 7.90 | 3,706 | 49.4% | 44.7% |
256 | 65,536 | 2 | 14,639 | 0.55 | 53,233 | 27.5% | 22.3% |
400 | 160,000 | 1 | 29,278 | 0.2 | 146,390 | 20.0% | 18.3% |
By inspection of this table, it is seen that the superlinear results that prompted this paper look perfectly normal when they are displayed in the DC-framework. This shows that the DC-framework is adequate for accommodating the outstanding numerical and computational results that we have obtained using the DVS-BDDC algorithm.
11. Restrictions on parallel performance imposed by the standard framework
Assuming S (p,n) ≤ p=S S (p,n) is limitative and this Section together with the next one we explore more thoroughly the restrictions on parallel performance that such an assumption imposes.
To start with, the standard speedup goal, p, and the DC-speedup goal, S DC (p,n) , corresponding to the set of experiments we have been discussing, are compared. Their ratios are shown Table 5, where the values of S DC (p,n) are taken from Table 4.
p | S S (p,n) | S DC (p,n) | S DC (p,n)/S S (p,n)=SDC(p,n)/p | S DC (p,n)/S DC (p,n)=p/S DC (p,n) |
1 | 1 | 1 | 1 | 1 |
16 | 16 | 233.9 | 14.6 | .0685 |
25 | 25 | 596.1 | 23.8 | .0420 |
64 | 64 | 3,706 | 57.9 | .0173 |
256 | 256 | 53,233 | 207.9 | .0048 |
400 | 400 | 146,390 | 365.0 | .0027 |
By inspection of Table 5, it is seen that the standard goal-speedups are much smaller than the goal-DC-speedups, and probably too conservative and restrictive.
Table 6, which follows, shows the bounds of performance for any software that satisfies the restriction S (p,n) ≤ p. The last column of this table shows such an assumption limits severely the DC-efficiency that one can hope for, when any of the standard methods is applied, including BDDC and FETI-DP 22.
p | T s (p,n)≤T(1,n)/p | S s (P,n)≤p | T DC (p,n) | S DC (p,n) | E S DC (p,n)≤p/S DC (p,n) |
1 | - | - | 29,278 | 1 | 100% |
16 | TS(16,106)≤1,830 | Ss(16,106)≤16 | 125.15 | 233.9 | ES DC(16,106)≤6.85% |
25 | TS(25,106)≤1,171.1 | SS(25,106)≤25 | 51.45 | 596.1 | ES DC(25,106)≤4.20% |
64 | TS(64,106)≤457.5 | SS(64,106)≤64 | 7.90 | 3,706 | ES DC(64,106)≤1.73% |
256 | TS(256,106)≤114.40 | SS(256,n)≤256 | 0.55 | 53,233 | ESDc(256,106)≤0.48% |
400 | TS(400,106)≤73.20 | SS(400,n)≤400 | 0.20 | 146,390 | ESDc(400,106)≤0.27% |
12. Additional comparisons
To have a clearer appreciation of the relevance of the limitations imposed by the standard framework, which have been established in Section 9, a direct comparison with the results obtained using the DVS-BDDC, which are given in Table 3, can help. Such a comparison is highlighted in Table 7.
p | 16 | 25 | 64 | 256 | 400 |
E DC (p,n) | 70.3% | 63.0% | 49.4% | 27.5% | 20.0% |
BOUNDS FOR E S DC (p,n) | 6.85% | 4.20% | 1.73% | 0.48% | 0.27% |
In summary, for all the numerical and computational experiments here discussed, the efficiency one can hope to obtain using standard software is only a small fraction of that, which is obtained when the DVS-BDDC algorithm is applied.
From all the above discussion, we draw the conclusion that adopting the definition S s (p,n) = p, as is usually done in domain decomposition methods, is too conservative and hinders drastically the performance of methods developed within such a framework.
13. Conclusions
This paper communicates the outstanding results of numerical experiments in which the DVS-BDDC algorithm 2 yields superlinear speedups, which exceed the number of processors by a large factor; 73.2 is the largest obtained in such experiments. From the results and their analysis here presented, the following conclusions are drawn:
The belief that the speedup (or, acceleration) is always less or equal to p (the number of processors) is incorrect. Accelerations much larger than p are not only feasible, but have been achieved using the DVS-BDDC algorithm;
The performance goal that research on DDM has intended up to now, besides being too small, has been very restrictive for the software developed in that framework; and
The Divide and Conquer framework here introduced is, by far, more adequate for accommodating the superlinear behavior of domain decomposition methods here reported.
Based on these conclusions, it recommended that the Divide and Conquer framework be adopted in future research on the applications of parallel computation to the solution of partial differential equations. Then, the performance goal is defined in terms of the execution time goal, as
Or, the speedup goal,
Or, the divide and conquer efficiency: