Go backward to Implementation Go up to Parallel Resultant Computation |
We have benchmarked our implementation with numerous inputs and machine configurations; the results of some of these experiments are shown below. The system environment in which the parallel program has been executed consists of (various subsets of) the following machines:
We have applied the CASA function pacPlot
to four algebraic curves
represented by the randomly generated bivariate polynomials listed in
Figure * and have measured the execution times of the corresponding
resultant computations (using the sequential resultant function provided by
Maple and our parallel version). The results are listed in Figure *
and visualized in Figure *.
Name Tot.Deg. Value 9-11-3 14 23x4y8-17x6y6+22x8+84x7y7+20x4y6-49x3y8-80x4y2-5x3y-92x3y2-47y8+5xy5 10-7-2 13 55x8y2+66x9y4+27y3+31x2y5-75x2y6-70y8-97x2y9 10-8-2 18 -22x9y9-26x2y5+85x8y-50x2y6-74x6y4-31x8y7-172x7y7 12-7-2 20 99xy9-12x6y2-26x2y11-47x8y11-61x5y8-90x9y11+94x2y
Processors Performance Time (s) 9-11-3 10-7-2 10-8-2 12-7-2 Sequential 1 634 214 84 1267 1 0.73 958 317 122 1920 2 1.73 386 135 53 746 3 2.73 251 91 35 477 4 3.46 192 72 36 370 8 6.48 112 47 25 204 12 9.66 88 43 25 143 16 12.1 73 35 26 111 20 15.34 73 29 24 93 24 17.99 66 29 24 84
The largest speedup was 15 (compared to a sequential execution time of 1267 seconds) with 24 processors, which corresponds to an efficiency of more than 80% (considering the total system performance). All four examples achieved an efficiency of at least 50% with 8 processors, three examples achieved an efficiency of at least 50% with 20 processors. The example that ran shortest (sequential execution time of 84 seconds) achieved a maximum speedup of 3.5 with 8 processors (50% efficiency).
Machines
Tasks
Utilization
The visualization of a program run shown in Figure * (curve
12-7-2, 24 processors) gives an impression of the dynamic behavior of the
algorithm (the diagrams have been produced from a trace file generated by the
scheduler on issuing the Maple command dist[trace]
). The left figure
shows the load of each machine, the middle figure displays all tasks, and the
right figure shows the machine utilization during the computation of the
resultant.
We see that a very short initial phase where only one machine is active (the sequential computation of four modular resultants) is followed by a long phase where all machines are active computing modular resultants in parallel. This phase is followed by a shorter phase where only part of the machines are active combining the coefficients of the modular resultants by the Chinese Remainder Algorithm. Since tasks have been calibrated to run about 10 seconds, there is not sufficiently much parallelism to keep all machines active in this phase.
The fact that some of the Chinese Remainder tasks seem to execute significantly longer than other tasks can be attributed to the fact that exactly these tasks are on Linux PCs: while the raw (integer) processing power of these machines is larger than those of all other machines, their communication bandwidth (external sockets and/or internal pipes) is apparently much lower. Since the Chinese Remaindering tasks receive very large input arguments, they start execution on these PCs later than on other machines and/or need more time to transfer their results. This is also the reason why we choose to run the initial Maple kernel of a Distributed Maple session on a (otherwise slower) Silicon Graphics workstation; this gives much better overall speedups. We are currently investigating whether the installation of newer versions of the Linux operating system kernel may help to overcome this communication bottleneck.
It is interesting to note that our implementation is based on parallelization constructs that correspond to those of the parallel computer algebra library PACLIB where a similar parallel version of the modular resultant algorithm was implemented in C [13]. Likewise this parallel algorithm was implemented in the para-functional language pD which was compiled to C code that used the PACLIB runtime system [5]. The PACLIB implementation on a shared memory multiprocessor (16 Intel386 at 20 Mhz) achieved a speedup of 11.5 for a problem whose sequential execution time was 180 seconds [13]. The pD implementation achieved a speedup of 11.5 for a problem whose execution time took 160 seconds.
We may compare these results with the resultant computation that arises in the plotting of curve 10-7-2 where the sequential Maple call takes 214 seconds and the parallel implementation in Distributed Maple on a heterogeneous computer network achieves a speedup of 8 with 16 processors and of 10 with 24 processors, respectively. We are not aware of any other reports in literature on speedups gained by parallel resultant computation in distributed environments.