Go backward to Distributed Maple
Go up to Top
Go forward to Conclusions

Parallel Resultant Computation

As discussed in Section *, one of the most time-consuming components of the CASA function pacPlot is the computation of multi-variate polynomial resultants. In this section, we are going to describe the problem, a sequential solution algorithm, a parallel version of this algorithm, and our implementation of this algorithm in Distributed Maple. We conclude by listing benchmark results that demonstrate the actual efficiency of the parallel solution.

Problem Statement

Let A=SUM_i=0^mA_i xⁱ and B=SUM_i=0ⁿB_i xⁱ be non-zero polynomials over an integral domain I, i.e.,

A, B inI[x]

The Sylvester matrix of A and B is the m+n by m+n matrix

A_m A_m-1 ... A₀

... ... ... ...

A_m A_m-1 ... A₀

B_n B_n-1 ... B₀

... ... ... ...

B_n B_n-1 ... B₀

whose upper part consists of n rows of the coefficients of A and whose lower part consists of m rows of the coefficients of B (all entries not shown are zero). The resultant of A and B is the determinant of this matrix, i.e.,

Resultant(A,B) inI.

In the context of the plotting of algebraic curves, we are interested in the special case where A and B are polynomials in r variables over the integers, i.e., given

A,B inZ[x₁,...,x_r-1][x_r]

we want to find

Resultant(A,B) inZ[x₁,...,x_r-1].

For plotting algebraic curves in two dimensions, we have to solve this problem for r=2.

Sequential Algorithm

Collins [12] describes an efficient sequential algorithm for multivariate polynomial computation based on a modular approach: we use various prime numbers p to map the coefficients of A and B into modular domains Z_p = {0,...,p-1}, i.e., we compute polynomials a=Map_p(A), b=Map_p(B) where

a, b inZ_p[x₁,...,x_r-1][x_r].

We may only use primes p such that degree(a)=degree(A) and degree(b) = degree(B) where degree(P) denotes the degree of polynomial P in the main variable. Using modular arithmetic in Z_p, we compute the modular resultant c := Resultant_p(a, b), i.e.,

c inZ_p[x₁,...,x_r-1].

We then apply the Chinese Remainder Algorithm to lift the result from Z_p to Z_P*p, i.e.,

ChineseRemainder(C, c, P, p) inZ_P*p[x₁,...,x_r-1]

where P is the product of the primes used so far and C is the combination of the corresponding modular resultants:

C inZ_P[x₁,...,x_r-1]

The algorithm iterates until P exceeds a predetermined bound cb for the size of the coefficients of the resultant and then returns C. The algorithm is sketched below:

C := Resultant(A, B)
   m := degree(A); n := degree(B)
   cb := CoeffBound(A, B)
   C := 0; P := 1
   while P <= cb do
      p := a new prime number;
      if A_m mod p != 0 /\ B_n mod p != 0 then
         a := Map_p(A); b := Map_p(B)
         c := Resultant_p(a, b)
         C := ChineseRemainder(C, c, P, p)
         P := P*p
      end
   end
end Resultant

Parallel Algorithm

Various parallel versions of the modular algorithm have been investigated and implemented on shared memory machines [13] and on workstation networks [14]. Their common idea is to compute the resultants in the individual modular domains in parallel; they differ in their approaches to efficiently combine the modular resultants to yield the integer resultant. We apply the idea used by [13] where the sequential structure of the combination phase is maintained but the individual coefficients of the resultant are computed in parallel.

The parallel algorithm starts for each modular domain a task that maps A and B into this domain and computes the coefficients of the modular resultant (up to a predetermined degree bound db). The task results are collected and for each resultant coefficient a task is started that iteratively combines the corresponding modular coefficients. The task results are collected to build the overall result.

C := ResultantParallel(A, B)
   m := degree(A); n := degree(B)
   cb := CoeffBound(A, B)
   db := DegreeBound(A, B)
   C := 0; P := 1; T := [ ]; L := [ ]
   while P <= cb do
      p := a new prime number;
      if A_m mod p != 0 /\ B_n mod p != 0 then
         t := start(ResultantModular, p, db, A, B)
         P := P*p; T := T || [t]; L := L || [p]
      end
   end
   R := WaitForAll(T)
   T := [ ]
   for i in[0 ...db] do
      t := start(ChineseRemainderCoeff, L,
         [R[j]_i | j in[1...length(R)])
      T := T || [t]
   end
   C := WaitForAll(T)
end ResultantParallel

The structure of the algorithm is illustrated in Figure *.

Parallel Resultant Computation

Implementation

We have implemented above algorithm for r=2 using the Distributed Maple constructs dist[start] and dist[wait]. Furthermore, we have replaced the Maple function resultant/bivariate (which is called by Maple for computing the resultants of bivariate polynomials) by a version that uses our parallel implementation if the degrees of both input polynomials in the main variable are are greater than five and the degree of one input polynomial is greater than ten. All other cases are still solved by sequential methods.

The implementation differs from above algorithm in that each task computes multiple modular resultants respectively multiple coefficients of the integer resultant. By adjusting the number of elements computed by a task we can effectively control its grain size.

For estimating the execution time of a "ResultantModular" task we use the complexity bound [12]

(1+max(degree_r(A), degree_r(B)))

*PROD_i=1^r(1+max(degree_i(A), degree_i(B)))

*SUM_i=1^r(1+max(degree_i(A), degree_i(B)))

where degree_i(P) denotes the degree of P in the i-th variable. Based on this bound (multiplied by an experimentally determined constant that represents the processor speed) our implementation adjusts the number of modular images per task such that each task is expected to execute roughly 10 seconds. Smaller grain sizes unduly increase the execution overhead while larger grain sizes may cause significant load imbalances.

We also adjust the number of coefficients computed per "ChineseRemainderCoeff" task such that it processes for any input a constant number of modular coefficients. We have experimentally determined a value for this constant (1800) such that the execution time of a the task roughly matches the execution time of a "ResultantModular" task.

The first four modular resultants are actually computed sequentially in order to derive an improved value for the degree bound such that fewer coefficients have to be considered in the rest of the algorithm.

Benchmark Results

We have benchmarked our implementation with numerous inputs and machine configurations; the results of some of these experiments are shown below. The system environment in which the parallel program has been executed consists of (various subsets of) the following machines:

8 Linux PCs (1 Pentium at 350 Mhz, 2 Pentium at 333 Mhz, 5 Pentium at 266Mhz),
4 Silicon Graphics Octane multiprocessors (2 R10000 at 250 Mhz each),
1 Sun Ultrasparc workstation (1 Sparc at 296 Mhz),
1 Silicon Graphics Origin multiprocessor (32 R10000 at 195MHz, 7 processors used).

The whole environment comprises 24 processors whose total performance is 18 times that of a Linux PC at 333 Mhz which serves as the reference architecture for sequential execution. The Origin multiprocessor in the university campus (20 km far from the site of our institute) is connected to the local area Ethernet of our institute (in which all other machines reside) by a high-speed ATM line.

We have applied the CASA function pacPlot to four algebraic curves represented by the randomly generated bivariate polynomials listed in Figure * and have measured the execution times of the corresponding resultant computations (using the sequential resultant function provided by Maple and our parallel version). The results are listed in Figure * and visualized in Figure *.

Name Tot.Deg. Value

9-11-3 14 23x⁴y⁸-17x⁶y⁶+22x⁸+84x⁷y⁷+20x⁴y⁶-49x³y⁸-80x⁴y²-5x³y-92x³y²-47y⁸+5xy⁵

10-7-2 13 55x⁸y²+66x⁹y⁴+27y³+31x²y⁵-75x²y⁶-70y⁸-97x²y⁹

10-8-2 18 -22x⁹y⁹-26x²y⁵+85x⁸y-50x²y⁶-74x⁶y⁴-31x⁸y⁷-172x⁷y⁷

12-7-2 20 99xy⁹-12x⁶y²-26x²y¹¹-47x⁸y¹¹-61x⁵y⁸-90x⁹y¹¹+94x²y

Benchmark Curves

Processors Performance Time (s)

9-11-3 10-7-2 10-8-2 12-7-2

Sequential 1 634 214 84 1267

1 0.73 958 317 122 1920

2 1.73 386 135 53 746

3 2.73 251 91 35 477

4 3.46 192 72 36 370

8 6.48 112 47 25 204

12 9.66 88 43 25 143

16 12.1 73 35 26 111

20 15.34 73 29 24 93

24 17.99 66 29 24 84

Benchmark Times

Benchmark Results

The largest speedup was 15 (compared to a sequential execution time of 1267 seconds) with 24 processors, which corresponds to an efficiency of more than 80% (considering the total system performance). All four examples achieved an efficiency of at least 50% with 8 processors, three examples achieved an efficiency of at least 50% with 20 processors. The example that ran shortest (sequential execution time of 84 seconds) achieved a maximum speedup of 3.5 with 8 processors (50% efficiency).

Machines

Tasks

Utilization

Program Visualization

The visualization of a program run shown in Figure * (curve 12-7-2, 24 processors) gives an impression of the dynamic behavior of the algorithm (the diagrams have been produced from a trace file generated by the scheduler on issuing the Maple command dist[trace]). The left figure shows the load of each machine, the middle figure displays all tasks, and the right figure shows the machine utilization during the computation of the resultant.

We see that a very short initial phase where only one machine is active (the sequential computation of four modular resultants) is followed by a long phase where all machines are active computing modular resultants in parallel. This phase is followed by a shorter phase where only part of the machines are active combining the coefficients of the modular resultants by the Chinese Remainder Algorithm. Since tasks have been calibrated to run about 10 seconds, there is not sufficiently much parallelism to keep all machines active in this phase.

The fact that some of the Chinese Remainder tasks seem to execute significantly longer than other tasks can be attributed to the fact that exactly these tasks are on Linux PCs: while the raw (integer) processing power of these machines is larger than those of all other machines, their communication bandwidth (external sockets and/or internal pipes) is apparently much lower. Since the Chinese Remaindering tasks receive very large input arguments, they start execution on these PCs later than on other machines and/or need more time to transfer their results. This is also the reason why we choose to run the initial Maple kernel of a Distributed Maple session on a (otherwise slower) Silicon Graphics workstation; this gives much better overall speedups. We are currently investigating whether the installation of newer versions of the Linux operating system kernel may help to overcome this communication bottleneck.

It is interesting to note that our implementation is based on parallelization constructs that correspond to those of the parallel computer algebra library PACLIB where a similar parallel version of the modular resultant algorithm was implemented in C [13]. Likewise this parallel algorithm was implemented in the para-functional language pD which was compiled to C code that used the PACLIB runtime system [5]. The PACLIB implementation on a shared memory multiprocessor (16 Intel386 at 20 Mhz) achieved a speedup of 11.5 for a problem whose sequential execution time was 180 seconds [13]. The pD implementation achieved a speedup of 11.5 for a problem whose execution time took 160 seconds.

We may compare these results with the resultant computation that arises in the plotting of curve 10-7-2 where the sequential Maple call takes 214 seconds and the parallel implementation in Distributed Maple on a heterogeneous computer network achieves a speedup of 8 with 16 processors and of 10 with 24 processors, respectively. We are not aware of any other reports in literature on speedups gained by parallel resultant computation in distributed environments.

Maintained by: Wolfgang Schreiner
Last Modification: April 22, 1999

A_m	A_m-1	...	A₀
	...	...	...	...
		A_m	A_m-1	...	A₀
B_n	B_n-1	...	B₀
	...	...	...	...
		B_n	B_n-1	...	B₀

Name	Tot.Deg.	Value
9-11-3	14	23x⁴y⁸-17x⁶y⁶+22x⁸+84x⁷y⁷+20x⁴y⁶-49x³y⁸-80x⁴y²-5x³y-92x³y²-47y⁸+5xy⁵
10-7-2	13	55x⁸y²+66x⁹y⁴+27y³+31x²y⁵-75x²y⁶-70y⁸-97x²y⁹
10-8-2	18	-22x⁹y⁹-26x²y⁵+85x⁸y-50x²y⁶-74x⁶y⁴-31x⁸y⁷-172x⁷y⁷
12-7-2	20	99xy⁹-12x⁶y²-26x²y¹¹-47x⁸y¹¹-61x⁵y⁸-90x⁹y¹¹+94x²y

Processors	Performance	Time (s)
		9-11-3	10-7-2	10-8-2	12-7-2
Sequential	1	634	214	84	1267
1	0.73	958	317	122	1920
2	1.73	386	135	53	746
3	2.73	251	91	35	477
4	3.46	192	72	36	370
8	6.48	112	47	25	204
12	9.66	88	43	25	143
16	12.1	73	35	26	111
20	15.34	73	29	24	93
24	17.99	66	29	24	84