Chapter 8. Loop Nest Optimization #pragma Directives

Table 8-1 contains an alphabetical list of the #pragma directives discussed in this chapter, along with a brief description of each and the compiler versions in which the directive is supported.

Table 8-1. Loop Nest Optimization #pragma Directives

#pragma

Short Description

Compiler Versions

#pragma aggressive inner loopfission

Tells the compiler to fission inner loops into as many loops as possible.

7.0 and later

#pragma blocking size

Sets the blocksize of the specified loop, if it is involved in a blocking for the primary (or secondary) cache.

7.0 and later

#pragma fission

Tells the compiler to fission the enclosing specified levels of loops after this directive.

7.0 and later

#pragma fissionable

Disables validity testing.

7.0 and later

#pragma fusable

Disables validity testing.

7.0 and later

#pragma fuse

Tells the compiler to fuse the following n loops, which must be immediately adjacent.

7.0 and later

#pragma ivdep

Liberalizes dependence analysis. This applies only to inner loops. Given two memory references, where at least one is loop variant, ignore any loop-carried dependences between the two references.

6.0 and later

#pragma no blocking

Prevents the compiler from involving this loop in cache blocking.

7.0 and later

#pragma no fission

Keeps the following loop from being fissioned. Its innermost loops, however, are allowed to be fissioned.

7.0 and later

#pragma no fusion

Keeps the following loop from being fused with other loops.

7.0 and later

#pragma no interchange

Prevents the compiler from involving the loop directly following this directive (or any loop nested within this loop) in an interchange.

7.0 and later

#pragma prefetch

Specifies prefetching for each level of the cache. Scope: entire function containing the directive.

7.1 and later

#pragma prefetch_manua

Specifies whether manual prefetches (through #pragma directives) should be respected or ignored. Scope: entire function containing the directive.

7.1 and later

#pragma prefetch_ref

Generates a prefetch and connects it to the specified reference (if possible).

7.0 and later

#pragma prefetch_ref_disable

Disables prefetching for the specified reference in the current loop nest.

7.1 and later

#pragma unroll

Suggests to the compiler that a specified number of copies of the loop body be added to the inner loop. If the loop following this directive is an inner loop, then it indicates standard unrolling (version 7.2 and later). If the loop following this directive is not innermost, then outer loop unrolling (unroll and jam) is performed (version 7.0 and later).

7.0 and later


#pragma aggressive inner loop fission

The #pragma aggressive inner loop fission directive instructs the compiler to fission inner loops into as many loops as possible.

The syntax of the #pragma aggressive inner loop fission directive is as follows:

#pragma aggressive inner loop fission

The #pragma aggressive inner loop fission directive must be followed by an inner loop and has no effect if that loop is no longer inner after loop interchange.

#pragma blocking size

The #pragma blocking size directive sets the blocksize of the specified loop.

The syntax of the #pragma blocking size directive is as follows:

#pragma blocking size [n1, n2]

The loop specified, if it is involved in a blocking for the primary (secondary) cache, will have a blocksize of n1 (n2). The compiler tries to include this loop within such a block. If a 0 blocking size is specified, then the loop is not stripped, but the entire loop is inside the block.

Example 8-1. #pragma blocking size

In the following code, the compiler makes 20 × 20 blocks when blocking:

void amat (double x, double y, double z, int n, int m, int mm)
{
  int i, j, k;

  for (k = 0; k < n; k++)
  {
    #pragma blocking size (20)
    for (j = 0; j < m; j++)
    {
      #pragma blocking size (20)
      for (i = 0; i < mm; i++)
      z[i,k] = z[i,k] + x[i,j] * y[j,k]
     }
  }
}


#pragma no blocking

The #pragma no blocking directive prevents the compiler from involving this loop in cache blocking.

The syntax of the #pragma no blocking directive is as follows:

#pragma no blocking

#pragma fission

The #pragma fission directive instructs the compiler to fission the enclosing n levels of loops after this directive.

The syntax of the #pragma fission directive is as follows:

#pragma fission [n]

The default for n is 1. The compiler performs a validity test unless #pragma fissionable is also specified. The compiler does not reorder statements.

#pragma fissionable

The #pragma fissionable directive disables validity testing for loop fissioning.

The syntax of the #pragma fissionable directive is as follows:

#pragma fissionable

#pragma no fission

The #pragma no fission instructs the compiler to not fission the loop directly following this directive. Any inner loops, however, are allowed to be fissioned.

The syntax of the #pragma no fission directive is as follows:

#pragma no fission

#pragma fuse

The #pragma fuse directive instructs the compiler to fuse the specified number of immediately adjacent loops.

The syntax of the #pragma fuse directive is as follows:

#pragma fuse [num, level]

The loops to be fused must immediately follow the #pragma fusion directive.

The default value for num is 2. Fusion is attempted on each pair of adjacent loops and the level, by default, is determined by the maximal perfectly nested loop levels of the fused loops, although partial fusion is allowed. Iterations may be peeled as needed during fusion; the limit of this peeling is 5 or the number specified by the -LNO:fusion_peeling_limit option. No fusion is done for non-adjacent outer loops.

When the #pragma fusable directive is present, no validity test is done and the fusion is done up to the maximal common levels.

#pragma fusable

The #pragma fusable directive disables validity testing for loop fusing.

The syntax of the #pragma fusable directive is as follows:

#pragma fusable

#pragma no fusion

The #pragma no fusion directive instructs the compiler that the loop following this directive should not be fused with other loops.

The syntax of the #pragma no fusion directive is as follows:

#pragma no fusion

#pragma no interchange

The #pragma no interchange directive prevents the compiler from involving the next loop in an interchange. This directive also applies to any loop nested within the indicated loop.

The syntax of the #pragma no interchange directive is as follows:

#pragma no interchange

The pragma directive statement must immediately precede the loop to which it applies.

#pragma ivdep

The #pragma ivdep directive instructs the compiler to liberalize dependence analysis.

The syntax of the #pragma ivdep directive is as follows:

#pragma ivdep

Given two memory references, where at least one is loop variant, this directive instructs the compiler to ignore any loop-carried dependences between the two references. The #pragma ivdep directive applies only to inner loops. If #pragma ivdep is used on a loop that has an inner loop, the compiler ignores it.

Example 8-2. #pragma ivdep

The following are some examples of the use of #pragma ivdep :

  • ivdep does not break the dependence because b(k) is not loop variant:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    b[k] = b[k] +a[i]; 

  • ivdep breaks the dependence, but the compiler warns the user that it is breaking an obvious dependence:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    a[i] = a[i-1] + 3.0;   

  • ivdep breaks the dependence:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    a[b[i]] = a[b[i]] + 3.0;   

  • ivdep does not break the dependence on a[i] because it is within an iteration:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    {   
      a[i] = b[i]; 
      c[i] = a[i] + 3.0; 
    }   

If -OPT:cray_ivdep=TRUE is specified, ivdep instructs the compiler to use Cray semantics and break all backward dependences:

  • ivdep breaks the dependence but the compiler warns the user that it is breaking an obvious dependence:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    {   
      a[i] = a[i - 1] + 3.0;   
    }   

  • ivdep does not break the dependence, because the it is from the load to the store, and the load comes lexically before the store:

    #pragma ivdep   
    for (i = 0; i < n; i++)   
    {   
      a[i] = a[i + 1] + 3.0;   
    }   

To break all dependences, specify the following: -OPT:liberal_ivdep=TRUE .


#pragma prefetch

The #pragma prefetch directive specifies prefetching for each level of the cache.

The syntax of the #pragma prefetch directive is as follows:

#pragma prefetch [n1, n2] 

n1 controls the level 1 cache; n2 controls level 2. n1 and n2 can have the following values:

  • 0: prefetching is off (default for all processors except R10000)

  • 1: prefetching is on but conservative (default at -03 when prefetch is on)

  • 2: prefetching on and aggressive

The scope of this directive is the entire function that contains it.

#pragma prefetch_manual

The #pragma prefetch_manual directive instructs the compiler as to whether manual prefetches (through #pragma directives) should be respected or ignored.

The syntax of the #pragma prefetch_manual directive is as follows:

#pragma prefetch_manual[n]

n can have a value of 0 (the compiler ignores manual prefetches; this is the default for all processors except R10000) or 1 (the compiler respects manual prefetches; default at -03 for R10000 and beyond).

The scope of this directive is the entire function that contains it.

#pragma prefetch_ref

The #pragma prefetch_ref directive generates a prefetch and connects it to the specified reference (if possible).

The syntax of the #pragma prefetch_ref directive is as follows:

pragma prefetch_ref = ref [, stride = num1 [, num2]] 
[, level = [lev1][, lev2]] 
[, kind = {rd|wr}] 
[, size = sz] 

ref is the object you want prefetched.

Table 8-2 describes each of the possible #pragma prefetch_ref clauses. These clauses are optional.

Table 8-2. Clauses for #pragma prefetch_ref

Clause

Effect

Default Value

stride

Prefetches every num iteration(s) of this loop.

1

level

Specifies the level in memory hierarchy to prefetch. The possible values for level are

1: prefetch from L2 to L1 cache

2: prefetch from memory to L1 cache

2

kind

Specifies read or write.

write

size

Specifies the size (in KB) of the object referenced in this loop. Must be a constant.

N/A

The #pragma prefetch_ref directive instructs the compiler to take the following actions:

  • Generate a prefetch and connect to the specified object (if possible).

  • Search for references in the current loop-nest that match the supplied object.

    • If such a reference is found, then the prefetch for that object is scheduled relative to the prefetch node, based on the miss latency for the specified level of the cache.

    • If no such reference is found, the prefetch is generated at the start of the loop body.

  • Ignore all references by the automatic prefetcher (if enabled) to this variable in this loop-nest.

  • Have the automatic prefetcher (if enabled) use the supplied size (if specified) in its volume analysis for this object.

This directive has no scope; it just generates a prefetch.

#pragma prefetch_ref_disable

The #pragma prefetch_ref_disable directive explicitly disables prefetching for the specified reference (in the current loop nest).

The syntax of the #pragma prefetch_ref_disable directive is as follows:

#pragma prefetch_ref_disable = ref [, size = num]
  • ref is the object for which you want to disable prefetching.

  • num specifies the size (in KB) of the object referenced in this loop (optional). The size must be a constant. This explicitly disables the prefetching of all references to object ref in the current loop nest. If enabled, the auto-prefetcher runs but ignores ref. The size is used for volume analysis.

The scope of this directive is the entire function containing it.

#pragma unroll

The #pragma unroll directive suggests to the compiler the type of unrolling that should be done.

The syntax of the #pragma unroll directive is as follows:

#pragma unroll [n]

This directive instructs the compiler to add n-1 copies of the loop body to the inner loop. If the loop that this directive immediately precedes is an inner loop, then it indicates standard unrolling (version 7.2 and later). If the loop that this directive immediately precedes is not innermost, then outer loop unrolling (unroll and jam) is performed (version 7.0 and later).

The value of n must be at least 1. If it is 1, then unrolling is not performed.


Caution: The #pragma unroll directive works only on loops that are legal to unroll. Loops are often not unrollable in C because of potential aliasing. In these cases, you may want to use restrict pointers or the option -OPT:alias=disjoint (see the C Language Reference Manual for more information on restrict pointers). When -OPT:alias=disjoint is specified, distinct pointer expressions are assumed to point to distinct, non-overlapping objects.

-OPT:alias=disjoint is unsafe and may cause existing C programs to fail in obscure ways, so it should be used with extreme care.


Example 8-3. #pragma unroll

The following code samples show the effect of using #pragma unroll . The code in Sample 1 becomes Sample 2, not Sample 3:

  • Sample 1:

    #pragma unroll (2) 
    for (i = 0; i < 10; i++) 
    { 
      for (j = 0; j < 10; j++)
      {
        a i[j] = a[i][j] + b[i][j]; 
      } 
    } 

  • Sample 2:

    for (i = 0; i < 10; i + 2) 
    { 
      for (j = 0; j < 10; j++) 
      { 
        a [i][j] = a[i][j] + b[i][j]; 
        ai+1j = ai+1j + bi+1j; 
      } 
    } 

  • Sample 3:

    for (i = 0; i < 10; i + 2) 
    { 
      for (j = 0; j < 10; j++) 
      a[i][j] = a[i][j] + b[i][j]; 
      for (j = 0; j < 10; j++) 
      {
        a[i+1][j] = a[i+1][[j] + b[i+1][j]; 
      }
    } 

    The #pragma unroll directive is attached to the given loop, so that if an interchange is performed, the corresponding loop is still unrolled. That is, Sample 1 is equivalent to the following:

    #pragma interchange
    for (j = 0; j < 10; j++)
    {
      #pragma unroll (2)
      for (i = 0; i < 10; i++)
      a[i][j] = a[i][j] + b[i][j];
    }