Simulator (run_tbg.csh)
The simulator takes a CGRA bitstream, plus an input image, and produces an output image. The output image is supposed to be the same as actual CGRA hardware would build.
run_tbg.csh
is the driver for the simulator. Its main
purpose is to call the Test Bench Generator (TBG).
Given a bsa
bitstream from the
Assembler, plus verilog from the CGRA
Generator, TBG builds a custom testbench.
Making heavy use of TBG scripts, run_tbg.csh
invokes the
generated testbench using a given input image, and thereby generates
the output image and/or other collateral.
In more detail, run_tbg.csh
currently does the following:
- verifies and echoes the current github branch;
- verifies and echoes essential details of the target CGRA design (i.e. memheight);
- cleans (removes comments from) and verifies validity of bitstream config file;
- reorders the bitstream to prevent lockup (see here and below.
- optionally calls the CGRA Generator to (re)build a verilator-friendly verilog from scratch;
- calls TBG
process_input
script to shape the input image according to DELAY paramater; - uses generated verilog, plus the bitstream, to build a verilator testbench;
- runs the testbench on the input image to make an output image.
- calls TBG
process_output
script to shape the output image according to DELAY paramater;
IO configuration
Along with the bitstream, run_tbg
needs an IO-configuration
descriptor to set up inputs and outputs, i.e. something that would say
that the sixteen pads on the north side of the chip are being used as
a 16-bit input bus. This configuration file should be supplied by
whoever created the bitstream, e.g. PNR.
Here is a sample IO config file.
% less io/2in2out.json
{
"reset_in_pad": {
"pad_bus" : "pads_W_0",
"bits": {
"0": { "pad_bit":"0" }
},
"mode": "reset",
"width": 1
},
"io16in_in_arg_1_0_0": {
"pad_bus" : "pads_N_0",
"mode": "in",
"width": 16
},
"io16_out_0_0": {
"pad_bus" : "pads_E_0",
"mode": "out",
"width": 16
},
"io1_out_0_0": {
"pad_bus" : "pads_S_0",
"bits": {
"0": { "pad_bit":"0" }
},
"mode": "out",
"width": 1
}
}
Reordering the bitstream
Bitstream configuration order is important!
If a LUT is set up as an inverter, it is easy during routing for the wires to pass through a state such that the inverter output is connected directly to its input, which makes the simulator hang forever waiting for the value to stabilize to a 1 or 0 (this has actually happened).
It's bad if e.g. a linebuffer's write-enable signal
WEN
wiggles off and back on again after the linebuffer has been set up, which can happen as wires get unconnected and reconnected during the routing portion of the bitstream setup. It messes up the internal state of the linebuffer and you wind up with e.g. a 7-deep linebuffer instead of the 10-deep that you originally configured (this has also happened).
So now the rule for configuration is a) do the switchbox and
connection box wiring FIRST and then b) do the tile setup for LUTs,
memories (e.g. WEN), ALU ops etc. For this reason there is a csh script
reorder.csh
that takes any bitstream config file config.bs
and
transforms it for the proper order (ish).
Also see here.
Verilator hacks
The tapeout version of the chip contains proprietary modules for e.g. SRAM and JTAG drivers, plus it has tri-state buffers for the programmable IO pads. Unfortunately, we have not been able to make Verilator work with tri-state buffers, and proprietary modules are verboten to use where there is no NDA protection (e.g. github repositories and/or our development server "kiwi."
Therefore! When "run_tbg" detects that it is running in a non-proprietary environment it makes certain changes to the design:
- each tri-state IO pad is replaced by separate 'input' and 'output' pads
- proprietary modules (SRAM, JTAG driver) are replaced by handwritten stubs with equivalent functionality
run.csh vs. run_tbg.csh
run.csh
is the deprecated older version of run_tbg.csh.
It uses a
handwritten test bench that is compiled per CGRA design, and which can
then be used with any bitstream, as opposed to an automatically
generated testbench customized per bitstream as with TBG.
Usage and defaults (--help)
./run_tbg.csh --help
Usage:
./run_tbg.csh <textbench.cpp> -q [-gen | -nogen] [-nobuild]
-usemem -allreg
-config <config_filename.bs>
-io_config <io_config_filename.json>
-input <input_filename.png>
-output <output_filename.raw>
-out1 <1bitout_filename>,
-delay <ncy_delay_in>,<ncy_delay_out>
[-input-size <8>]
[-output-size <8>]
[-trace <trace_filename.vcd>]
-nclocks <max_ncycles e.g. '100K' or '5M' or '3576602'>
-build # no longer supported, use -rebuild_from_scratch instead
-nobuild # no genesis, no verilator build
-nogen # no genesis
-gen # genesis
Defaults:
./run_tbg.csh top_tb.cpp \
-gen \
-config ../../bitstream/examples/pw_padring_shortmem.bsa \
-io_config io/2in2out.json \
-input io/conv_bw_in.png \
-output /tmp/output.raw \
-out1 /tmp/onebit.raw \
-delay 0,0 \
-input-size 8 \
-output-size 8 \
-nclocks 1M
Sample run_tbg output
run_tbg.csh: I think we are in branch 'genspec'
run_tbg.csh: Looks like memtile_height is 1
Running with the following switches:
./run_tbg.csh -v \
-gen \
-config ../../bitstream/examples/pointwise.bsa \
-io_config io/2in2out.json \
-input io/conv_bw_in.png \
-output /tmp/run.csh.fYN/output.raw \
-out1 /tmp/run.csh.fYN/onebit.raw \
-delay 0,0 \
-nclocks 1M
bin/reorder.csh /tmp/pointwise.bs > /tmp/pointwise_reordered.bs
run_tbg.csh: Building CGRA because it's the default...
run_tbg.csh: ../../hardware/generator_z/top/build_cgra.sh
./build_cgra.sh WARNING I think we are running from kiwi;
setting USE_VERILATOR_HACKS
NOTICE Building shortmem design
--------------------------------------------------------------------
Here is what I built (it's supposed to look like an array of tiles).
// mem_tile_height (_GENESIS2_DECLARATION_PRIORITY_) = 1
//CGRA 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11
//CGRA 00 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
//CGRA 01 .. p p p m p p p m p p p m p p p m ..
//CGRA 02 .. p p p m p p p m p p p m p p p m ..
//CGRA 03 .. p p p m p p p m p p p m p p p m ..
//CGRA 04 .. p p p m p p p m p p p m p p p m ..
//CGRA 05 .. p p p m p p p m p p p m p p p m ..
//CGRA 06 .. p p p m p p p m p p p m p p p m ..
//CGRA 07 .. p p p m p p p m p p p m p p p m ..
//CGRA 08 .. p p p m p p p m p p p m p p p m ..
//CGRA 09 .. p p p m p p p m p p p m p p p m ..
//CGRA 0A .. p p p m p p p m p p p m p p p m ..
//CGRA 0B .. p p p m p p p m p p p m p p p m ..
//CGRA 0C .. p p p m p p p m p p p m p p p m ..
//CGRA 0D .. p p p m p p p m p p p m p p p m ..
//CGRA 0E .. p p p m p p p m p p p m p p p m ..
//CGRA 0F .. p p p m p p p m p p p m p p p m ..
//CGRA 10 .. p p p m p p p m p p p m p p p m ..
//CGRA 11 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
--------------------------------------------------------------------
run_tbg.csh: Building the verilator simulator executable...
build_simulator_tbg.csh -v \
pointwise_reordered.bs 2in2out.json \
conv_bw_in.png output.raw 8 8
build_simulator_tbg.csh: Generating the harness
python3 $dir/generate_harness.py \
--pnr-io-collateral io/2in2out.json \
--bitstream /tmp/pointwise_reordered.bs \
--max-clock-cycles 5000000 \
--output-file-name build/harness.cpp \
--input-chunk-size 8 --output-chunk-size 8
Building simulator source files...
verilate.py \
--harness harness.cpp \
--verilog-directory ../../hardware/generator_z/top/genesis_verif/ \
--output-directory build \
--top-module-name top \
Found an existing verilator binary, skipping
make -C build -j -f Vtop.mk Vtop
First prepare input and output files...
BITSTREAM '/tmp/pointwise_reordered.bs':
00080101 00200003
00020101 00000005
...
python3 process_input.py io/2in2out.json /tmp/pw_in.raw 0,0
Done resetting
Beginning configuration
Done configuring
Running test
Cycle: 1000
Cycle: 2000
Cycle: 3000
Cycle: 4000
Reached end of file io16in_in_arg_1_0_0.raw
Done testing
python3 $TBG/process_output.py io/2in2out.json /tmp/output.raw pw 0,0
INPUT od -t u1 /tmp/pw_in.raw
0000000 95 95 98 89 98 103 95 97 93 92 90 89 84 83 81 82
0000020 94 91 87 81 96 88 91 86 83 80 91 91 81 83 87 84
...
OUTPUT od -t u1 /tmp/1output.raw
0000000 190 190 196 178 196 206 190 194 186 184 180 178 168 166 162 164
0000020 188 182 174 162 192 176 182 172 166 160 182 182 162 166 174 168
...