Writing Applications

Below are some tips for writing Halide applications and some decisions for scheduling to hardware.

Algorithm

The algorithm defines the output pixels that are desired in the Halide application. All basic C++ operators are overloaded to apply to Halide functions. The supported operators are included in Supported Operators.

Indexing

Image processing applications typically define pixels based on the local pixels from a previous function. Therefore, values for an indexed function are defined based on the x and y coordinate of other functions. This example performs hot pixel suppression on Bayer images, where color channels are two pixels away in all directions.

Expr max_value = max(max(input(x-2, y), input(x+2, y)),
                 max(input(x, y-2), input(x, y+2)));
Expr min_value = min(min(input(x-2, y), input(x+2, y)),
                 min(input(x, y-2), input(x, y+2)));

Func denoised("denoised");
denoised(x, y) = clamp(input(x, y), min_value, max_value);

Reduction Domain

A common pattern used in image processing is convolution of an image with filter weights. This can be done using a reduction domain. A single line of code can define a sum of products:

Func input, kernel, convolution;

// Define x and y loops from -1 of length 3
RDom win(-1, 3, -1, 3);

kernel(x, y) = x + y;
convolution(x, y) = 0;  // set initial value of 0
convolution(x, y) += input(x + win.x, y + win.y) *
                     kernel(win.x, win.y);

The default schedule for a reduction domain is to have two loops, but in hardware, this would create two counters. We can unroll these loops to duplicate the multipliers and adders, and remove the counters.

convolution.update(0)
           .unroll(win.x).unroll(win.y);

Select

Conditional statements do not translate well into hardware. Therefore, it is recommended to use a version of a ternary or switch operator. In Halide, this is provided using the select statement. This generates a mux in hardware. Below shows how a select statement can be used for non-maximal suppression.

Expr is_max = in(x, y) > in(x-1, y-1) && in(x, y) > in(x, y-1) &&
    in(x, y) > in(x+1, y-1) && in(x, y) > in(x-1, y) &&
    in(x, y) > in(x+1, y) && in(x, y) > in(x-1, y+1) &&
    in(x, y) > in(x, y+1) && in(x, y) > in(x+1, y+1);
hw_output(x, y) = select( is_max, in(x, y), 0);

Array of Funcs

Code can become a bit tedious due to writing all of the indices as above. Normally one would replace these with RDoms, but when intermediate values are needed or the reduction operator is not defined (such as max), instead an array of Funcs can be used. One can preserve the DAG structure while making it more readable using an array of Funcs.

Func segment;
Func largest_value[16];
for (int i=0; i<16; ++i) {
  if (i==0) {
    largest_value[i](x,y) = segment(x,y,0);
  } else {
    largest_value[i](x,y) = max(segment(x,y,i), largest_value[i-1](x,y));
  }
}

Linebuffers

A linebuffer is a memory element that has the minimum amount of size to store the working set during execution. For example, for 3x3 convolution, two rows and 3 pixels are needed from the input image during execution. The Halide compiler will insert a linebuffer when specified in the schedule with a linebuffer size to match the stencil size of the kernel. This is done in the schedule with store_at().

FIFOs

When there is convergence of Funcs in the application, it is necessary that the hardware is constructed with delays through each parts match. When a DAG diverges into separate Funcs, a FIFO is needed when the delays are mismatched from different linebuffer sizes. Clockwork should generate these FIFOs properly, so the user does not have to do anything.

Defining Accelerator Bounds

A hardware accelerator is created using some scheduling calls. The output is defined by the function and loop level. Each of the inputs must be denoted as well. An example is below, which has inputs hw_input and hw_kernel, and output hw_output with loop xo being the first loop outside the accelerator.

hw_output.tile(x,y, xo,yo, xi,yi, imgsize, imgsize)
  .hw_accelerate(xi, xo);

hw_input.stream_to_accelerator();
hw_kernel.stream_to_accelerator();

Generator Parameters

All of the applications are written as generators. One of the main purposes of this is so that similar applications can be encapsulated by a single application. Parameters can be used to alter single integers (such as kernel size) or choose different schedules.

    GeneratorParam<int> ksize{"ksize", 3};    // default: 3

To alter the value during execution, use HALIDE_GEN_ARGS with the new value.

make compare HALIDE_GEN_ARGS="ksize=5"

results matching ""

    No results matching ""