Writing Schedules

Below are some tips for writing Halide schedules.

Schedule

While the algorithm defines the output pixels, the schedule determines how to run on the specified hardware. These decisions can be used to improve the runtime, code size, or utilization of the end application.

Schedule Structure

The schedule typically follows a certain unique structure where the output is scheduled first, and then it proceeds "backwards" toward the input. This is done because some of the scheduling primitives refer to their output, such as compute_at.

Another quality to note is that the Halide scheduling language is not performed sequentially, meaning that all of the scheduling calls will all occur for the program, and the order of the calls does not matter too much.

Furthermore, while the Halide scheduling language is itself following syntax unique to its own language, it is still embedded in C++. This means that you can still use printf and if statements to control how the schedule is performed.

Targets

After the application algorithm, one should specify the schedule for different hardware targets. Typically, hardware targets require their own unique schedules due to the different metrics and contraints on the hardware. For example, a CPU and a custom CGRA accelerator have much different memory hierarchy and size, so the corresponding tiling will have different sizes.

Each of these schedules usually exist in an if/else if/else block:

  if (get_target().has_feature(Target::Clockwork)) {  ...

Clockwork Target Schedules

The Clockwork target schedule is the one used to target the CGRA. The memory file is sent to Clockwork, while the compute is codegen'ed into a json file. Some of the scheduling primitives are new for hardware, as well as some primitives (such as unroll) have a different meaning for hardware targets.

Bounding Image Sizes

The image sizes should be bounded since the hardware needs to have a size at compile-time. This allows the memories to have a determined size. To bound the output width and height, use bound like:

      const int outputWidth  = 64;
      const int outputHeight = 128;
      output.bound(x, 0, outputWidth);
      output.bound(y, 0, outputHeight);

Sometimes, input sizes also must be bound if they cannot be determined from the output size. For example, the number of input channels for a DNN layer does not depend on the output. In this case, an input bound is needed.

Defining Hardware Boundaries

The hardware accelerator is part of an application that runs on the host processor. Scheduling primitives are used to define the transfer to the accelerator and from the accelerator. These primitives are stream_to_accelerator and hw_accelerate. stream_to_accelerator is used on each input. It takes no arguments, and it is simply checked in the compiler to ensure that the correct accelerator input is defined when creating the accelerator.

The output is defined by hw_accelerate(compute_var, store_var). The function that has this scheduling function is the output from the accelerator. The first argument is unused, but defines what variable is a single cycle on the hardware. The second variable, store_var is the first loop that is performed outside the accelerator. For example:

       hw_output
         .tile(x, y, xo, yo, xi, yi, 64, 128)
         .hw_accelerate(xi, xo);
       hw_input.
         .accelerator_input();

In this example, the loops xi and yi are size 64 and 128 respectively, and are executed on the CGRA. The outer loops, xo and yo are performed outside the accelerator. This effectively means that the accelerator is called xo * yo times to calculate the full output.

Splitting, Reordering, and Tiling

There are scheduling primitives to separate loops and determine the loop order. split is used to take a single loop variable and construct two nested loops from it. For example:

      output_glb
         .split(w, w_glb, w_cgra, 8)

In this example, the w loop is decomposed into a w_cgra loop of size 8 and a w_glb loop that covers the rest of the loop iteration. Every 8 iterations of w_cgra increments w_glb once.

To determine the order of your loop variables (including split variables), we can use reorder. This lists the loops from inner-loop-variable to outermost-loop-variable. For example:

       output_glb
         .tile(x, y, x_glb,y_glb, x_cgra,y_cgra, tilesize_x,tilesize_y)
         .split(w, w_glb, w_cgra, k_oc)
         // reorder from inner to outermost
         .reorder(w_cgra, x_cgra, y_cgra,
                  w_glb, x_glb, y_glb);

We first split loops using tile and split. Afterwards, we have 6 separate loop variables for the function output_glb. The reorder determines the loop ordering of these 6 variables.

Finally, there is tile, which is just syntatic sugar for split and reorder. This scheduling primitive was created since splitting and reordering images into tiles usually is a common process. Two equivalent schedules would thus be:

       output_glb
         .tile(x, y, x_glb,y_glb, x_cgra,y_cgra, tilesize_x,tilesize_y);

and:

       output_glb
         .split(x, x_glb, x_cgra, tilesize_x)
         .split(y, y_glb, y_cgra, tilesize_y)
         .reorder(x_cgra, y_cgra, x_glb, y_glb);

Duplicating Compute

Duplicating compute is useful to perform a computation in fewer cycles. This is done by performing multiple iterations of a loop in a single cycle. In hardware, this is done by duplicating compute resources. Similar to HLS semanatics, we use unroll to express this hardware duplication.

One of the common uses of hardware duplication is for unrolling a reduction domain to perform in a single cycle.

  RDom win(0, blockSize, 0, blockSize);
  blur(x, y) += kernel(win.x, win.y) * hw_input(x+win.x, y+win.y);

   blur.update()
      .unroll(win.x)
      .unroll(win.y);

This example shows that a reduction domain win is used to perform a convolution. This computation would normally be performed in blockSize * blockSize iterations. We can get a handle on the adder chain by using blur.update(), and then unroll this computation using unroll to duplicate the compute completely to perform it in a single cycle.

We can further unroll computation to perform multiple pixels per cycle. We typically want to unroll the innermost dimension to allow for the memories to input the subsequent values more easily. For example:

   blur.update()
      .unroll(win.x)
      .unroll(win.y);
      .unroll(x, 3, TailStrategy::RoundUp);

This schedule has unrolled the reduction domain to perform in a single cycle, and thens the innermost loop, x, is unrolled by 3 to make it 3 pixels/cycle. The TailStrategy::RoundUp is used to increase the loop size in case the image width (the extent of x) is not divisible by 3.

Note that even after unrolling the compute, the rest of the application should be equally unrolled. Duplicating hardware at one compute kernel is usually not helpful if the rate of the other kernels are not similarly duplicated. Furthermore, even if all of the compute kernels are unrolled, the initializations and memory transfers also need to also support the same rate. These schedules will thus commonly also include lines like below:

   // unroll the initialization
   blur.unroll(x, 3, TailStrategy::RoundUp);

   // unroll the output and input memories
   hw_output.unroll(xi, 3, TailStrategy::RoundUp);
   hw_input.unroll(x, 3, TailStrategy::RoundUp);

Creating Memories

In addition to compute, it is important to schedule the memories properly. Without any scheduling all compute is by default computed in-line. This means that no reuse is used and lots of compute resources are created. For most applications, it is worth to create memories anywhere you can, because the memory resources are usually abundant and the reduction in compute resources is significant.

A memory can be created for any Func, but it is only useful (and becomes a memory) when there is some sort of reuse. This can be seen in the Halide algorithm when there is an offset in the index, or a reduction domain is used. For example:

   blur_y(x, y) = (blur_x(x, y) + blur_x(x, y+1) + blur_x(x, y+2))/3;

   blur(x, y) += kernel(win.x, win.y) * hw_input(x+win.x, y+win.y);

For these two, blur_x, kernel, and hw_input should be stored in a memory. To schedule these Funcs in memory tiles, use compute_at and store_at:

   blur_x.compute_at(hw_output, xo).store_at(hw_output, xo);

   hw_input.compute_at(hw_output, xo).store_at(hw_output, xo);

The arguments are used to specify which Func and loop variable where this memory should be created. Effectively, a memory is created at the specified loop level, and any reuse within that will be captured. For a hardware accelerator without hierarchy, this is commonly the output Func (that is accelerated) and the store (second) variable. Note that due to the requirements of a store_at and compute_at, usually only compute_at is necessary to specify, since it will automatically store at the same level.

As a further clarification of this process, it may seem odd that the compute level is the same as the store level. Usually in Halide this means that the computation would be computed first at the specified level, and only afterwards continue the computation. This would typically lead to a much larger memory storage at each level than necessary, as well as create serial code. Instead, Clockwork prefers this specification, but will instead create the line buffers that are expected. In effect, all memories therefore are scheduled this way, and Clockwork fixes the compute level based on its own analysis.

More Hardware Boundaries

Although hw_accelerate and stream_to_accelerator are the basic ways of specifying the boundary, there are more complications internally. In src/Func.cpp, one will find that stream_to_accelerator() is simply syntatic sugar for .in() as well as is_accelerator_input. In effect, this call is creating the necessary memory hierarchy needed for Clockwork. To have more control of these input and output labels, one can find accelerator_input(), accelerator_output(), and accelerate_call_output(). Not all of these scheduling primitives are fully fleshed out, but they can be used to specify more advanced application structures such as create memory hierarchies and use multiple hw_accelerate calls in a single application.

Creating Memory Hierarchy

Creating a memory hierarchy is a matter of specifying a series of copies from larger memories to smaller memories. In effect, the computation between them are NOPs, and then tiling should occur to specify the differing sizes.

Halide provides the means of performing these copies using .in(). The generated Func will end with global_wrapper, but no computation will occur between them. When creating the memory hierarchy, one should create a hierarchy at both the input and output, and align the copies using compute_at.

An example of a memory hierarchy using three levels:

      hw_output.in().compute_root();

      hw_output.in()
        .tile(x, y, xo, yo, xi, yi, glbWidth, glbHeight)
        .hw_accelerate(xi, xo);
      hw_output.in().store_in(MemoryType::GLB);

      Var xii, yii, xio, yio;
      hw_output
        .tile(x, y, xio, yio, xii, yii, tileWidth, tileHeight);
      hw_output.compute_at(hw_output.in(), xo);

      blur.compute_at(hw_output, xio);

      blur_unnormalized.update()
        .unroll(win.x, blockSize)
        .unroll(win.y, blockSize);
      blur_unnormalized.compute_at(hw_output, xio);

      hw_input.in().in().compute_at(hw_output, xio); // represents the mem tile

      hw_input.in().compute_at(hw_output.in(), xo);
      hw_input.in().store_in(MemoryType::GLB);

      hw_input.compute_root()
        .accelerator_input();

Here you'll notice that the global buffer level at the input and output are labelled using store_in(MemoryType::GLB). The accelerator output is the GLB hw_output.in() meaning that the input GLB, hw_input.in() is computed at this same level: hw_input.in().compute_at(hw_output.in(), xo). The inner compute on the CGRA has its own tiling, hw_output.tile(), as well as its own input memory tile that is computed at the CGRA output hw_input.in().in().compute_at(hw_output, xo). This effectively creates a three level memory hierarchy with the levels: host, GLB, and memory tile.

CPU Target Schedules

CPU schedules are different from Clockwork and hardware targets. The memory sizes are different, and scheduling primitives such as parallel and vectorize play a key role in improving runtimes. Furthermore, compute_at and store_at play a far more complex and important role to ensure that values fit in different cache sizes. There are many tradeoffs to consider when trying to achieve the fastest CPU implementation for any given machine.

Debugging Schedules

There are several places to look to identify whether the schedule is performing how you expect. By modifying src/Lower.cpp, one can output the HalideIR at any stage (I recommend especially the final stage before codegen). In addition, the generated code (bin/<app>_memory.cpp and bin/<app>_compute.h) can help identify what calls are being created in the hardware.

An example generated loopnest is below:

if (!(_halide_buffer_is_bounds_query(input.buffer) || _halide_buffer_is_bounds_query(output.buffer))) {
  assert((_halide_buffer_get_type(input.buffer) == (uint32)67585), halide_error_bad_type("Input buffer input", _halide_buffer_get_type(input.buffer), (uint32)67585))
  ...
  assert((_halide_buffer_get_host(output.buffer) != reinterpret((void *), (uint64)0)), halide_error_host_is_null("Output buffer output"))
  realize hw_input.stencil([0, 64], [0, 64]) {
    produce hw_input.stencil {
      for (hw_input.s0.y, 0, 64) {
        for (hw_input.s0.x, 0, 64) {
          hw_input.stencil(hw_input.s0.x, hw_input.s0.y) = uint16(input[(((_halide_buffer_get_stride(input.buffer, 1)*hw_input.s0.y) - (_halide_buffer_get_min(input.buffer, 0) + (_halide_buffer_get_min(input.buffer, 1)*_halide_buffer_get_stride(input.buffer, 1)))) + hw_input.s0.x)])
        }
      }
    }
    consume hw_input.stencil {
      realize hw_output_global_wrapper.glb.stencil([0, 62], [0, 62]) {
        produce hw_output_global_wrapper.glb.stencil {
          produce _hls_accelerator.hw_output_global_wrapper {
            produce _hls_target.hw_output_global_wrapper {
// This represents the input global buffer
              realize hw_input_global_wrapper.glb.stencil([0, 64], [0, 64]) {
                produce hw_input_global_wrapper.glb.stencil {
                  for (hw_input_global_wrapper.s0.y, 0, 64) {
                    for (hw_input_global_wrapper.s0.x.x, 0, 64) {
                      hw_input_global_wrapper.glb.stencil(hw_input_global_wrapper.s0.x.x, hw_input_global_wrapper.s0.y) = hw_input.stencil(hw_input_global_wrapper.s0.x.x, hw_input_global_wrapper.s0.y)
                    }
                  }
                }
// This section is for the CGRA MEMs and PEs
                consume hw_input_global_wrapper.glb.stencil {
                  realize hw_output.stencil([0, 62], [0, 62]) {
                    produce hw_output.stencil {
                      realize hw_input_global_wrapper_global_wrapper.stencil([0, 64], [0, 64]) {
                        produce hw_input_global_wrapper_global_wrapper.stencil {
                          for (hw_input_global_wrapper_global_wrapper.s0.y, 0, 64) {
                            for (hw_input_global_wrapper_global_wrapper.s0.x.x, 0, 64) {
                              hw_input_global_wrapper_global_wrapper.stencil(hw_input_global_wrapper_global_wrapper.s0.x.x, hw_input_global_wrapper_global_wrapper.s0.y) = hw_input_global_wrapper.glb.stencil(hw_input_global_wrapper_global_wrapper.s0.x.x, hw_input_global_wrapper_global_wrapper.s0.y)
                            }
                          }
                        }
                        consume hw_input_global_wrapper_global_wrapper.stencil {
                          realize blur_unnormalized.stencil([0, 62], [0, 62]) {
                            produce blur_unnormalized.stencil {
                              for (blur_unnormalized.s0.y, 0, 62) {
                                for (blur_unnormalized.s0.x.x, 0, 62) {
                                  blur_unnormalized.stencil(blur_unnormalized.s0.x.x, blur_unnormalized.s0.y) = (uint16)0
                                }
                              }
                              for (blur_unnormalized.s1.y, 0, 62) {
                                for (blur_unnormalized.s1.x.x, 0, 62) {
                                  blur_unnormalized.stencil(blur_unnormalized.s1.x.x, blur_unnormalized.s1.y) = ((hw_input_global_wrapper_global_wrapper.stencil(blur_unnormalized.s1.x.x, blur_unnormalized.s1.y)*(uint16)24) + (blur_unnormalized.stencil(blur_unnormalized.s1.x.x, blur_unnormalized.s1.y) + ((hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 1), blur_unnormalized.s1.y)*(uint16)30) + ((hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 2), blur_unnormalized.s1.y)*(uint16)24) + ((hw_input_global_wrapper_global_wrapper.stencil(blur_unnormalized.s1.x.x, (blur_unnormalized.s1.y + 1))*(uint16)30) + ((hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 1), (blur_unnormalized.s1.y + 1))*(uint16)37) + ((hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 2), (blur_unnormalized.s1.y + 1))*(uint16)30) + ((hw_input_global_wrapper_global_wrapper.stencil(blur_unnormalized.s1.x.x, (blur_unnormalized.s1.y + 2))*(uint16)24) + ((hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 2), (blur_unnormalized.s1.y + 2))*(uint16)24) + (hw_input_global_wrapper_global_wrapper.stencil((blur_unnormalized.s1.x.x + 1), (blur_unnormalized.s1.y + 2))*(uint16)30))))))))))
                                }
                              }
                            }
                            consume blur_unnormalized.stencil {
                              realize blur.stencil([0, 62], [0, 62]) {
                                produce blur.stencil {
                                  for (blur.s0.y, 0, 62) {
                                    for (blur.s0.x.x, 0, 62) {
                                      blur.stencil(blur.s0.x.x, blur.s0.y) = (blur_unnormalized.stencil(blur.s0.x.x, blur.s0.y)/(uint16)256)
                                    }
                                  }
                                }
                                consume blur.stencil {
                                  for (hw_output.s0.y.yii, 0, 62) {
                                    for (hw_output.s0.x.xii.xii, 0, 62) {
                                      hw_output.stencil(hw_output.s0.x.xii.xii, hw_output.s0.y.yii) = blur.stencil(hw_output.s0.x.xii.xii, hw_output.s0.y.yii)
                                    }
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                    }
// This represents the output global buffer
                    consume hw_output.stencil {
                      for (hw_output_global_wrapper.s0.y.yi, 0, 62) {
                        for (hw_output_global_wrapper.s0.x.xi.xi, 0, 62) {
                          hw_output_global_wrapper.glb.stencil(hw_output_global_wrapper.s0.x.xi.xi, hw_output_global_wrapper.s0.y.yi) = hw_output.stencil(hw_output_global_wrapper.s0.x.xi.xi, hw_output_global_wrapper.s0.y.yi)
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
        consume hw_output_global_wrapper.glb.stencil {
          assert(((0 <= _halide_buffer_get_min(output.buffer, 1)) && ((_halide_buffer_get_extent(output.buffer, 1) + _halide_buffer_get_min(output.buffer, 1)) <= 62)), halide_error_explicit_bounds_too_small("y", "output", 0, 61, _halide_buffer_get_min(output.buffer, 1), ((_halide_buffer_get_extent(output.buffer, 1) + _halide_buffer_get_min(output.buffer, 1)) + -1)))
          assert(((0 <= _halide_buffer_get_min(output.buffer, 0)) && ((_halide_buffer_get_extent(output.buffer, 0) + _halide_buffer_get_min(output.buffer, 0)) <= 62)), halide_error_explicit_bounds_too_small("x", "output", 0, 61, _halide_buffer_get_min(output.buffer, 0), ((_halide_buffer_get_extent(output.buffer, 0) + _halide_buffer_get_min(output.buffer, 0)) + -1)))
          produce output {
            for (output.s0.y, 0, 62) {
              for (output.s0.x, 0, 62) {
                output[((_halide_buffer_get_stride(output.buffer, 1)*output.s0.y) + output.s0.x)] = uint8(hw_output_global_wrapper.glb.stencil(output.s0.x, output.s0.y))
              }
            }
          }
        }
      }
    }
  }
}

Above is a Gaussian blur with a memory hierarchy targetted for a CGRA. The sections are marked based on intention. The loopnest can be analyzed for code intedend for the CGRA with memories as well as global buffer input and output. The above is the default printout for HalideIR in a form that is close to the C++ implementation.

Extending the Scheduling Language

While most of the functionality for the scheduling language exists in Halide, one might find that extending the scheduling to be needed. This might be for additional functionality or additional syntatic sugar. These modifications will start in src/Func.cpp.

results matching ""

    No results matching ""