# LAB5 : Matrix Multiplication module [Deployment on the _PYNQ_ board + ASIC Floorplanning]
Deadline: 2nd December 2023 23:59
## Objective
In the previous lab, you designed and simulated a systolic matrix multiplication kernel. For simplicity, we read out the final result in parallel with appropriate valids.
In this lab, you will:
1. modify the readout mechanism to support a systolic (serial) operation, and
2. deploy the circuit on a real FPGA board.
We will provide the required hardware infrastructure to generate input matrices on the ARM CPU of the FPGA board, send it to your systolic core, and read the result back to the ARM CPU for validation.
We will use the _PYNQ_ FPGA board for this lab, which is a hybrid FPGA+ARM SoC (System-on-Chip) that allows you to run an embedded Linux stack on the ARM CPU and use the FPGA as an accelerator.
Pynq board also ships with user-friendly Python APIs which we will use for programming and interacting with the FPGA.
Should FPGA board deployments present a technical challenge during this term,
we will use the post-place-and-route timing simulations for grading
instead. Note that, it is the instructor who will decide whether this
alternative will be used.
Specific tasks are below:
* Modify your systolic array to readout the `D` result matrix serially per row. This will require a change to the interface of `pe.v`, some extra internal registers for holding state, and more wires in `systolic.sv` to shift out the data. Test your design for functional correctness by using similar test setup as the previous lab.
* Synthesize, implement, and download your design bitstream to the _PYNQ_ board. Setup the Pynq board for use with your lab, and run the Python-based test framework to confirm correct operation.
* Alternatively, run a timing simulation after placement and routing to ensure
correct timing behavior that closely resembles on-chip behavior. This is going
to be used a a fallback should FPGA deployments present a technical challenge
during this term due to health regulations.
## Design Description:
The components for the complete systolic design are shown below. As you can see, the design is mostly similar to that in Lab3. The important change is the new set of RAM blocks for collecting the output results `D`.

### Shifting out results:
The red link shown in the figure above will shift out the results to the RAM blocks `D` to the right of the figure.
There are two design requirement for implementation of the readout circuit:
1. The results to the right at `PE[*,N1-1]` are pushed out to `D` first. In the figure above, `PE[*][3]` will shift out first.
2. PE cannot be paused and must continue to process the next matrix multiplication while results of the previous multiplication are shifting out.
3. You use a *constant* amount of pipelining in each PE such that the same solutions works for any `N1`. Constant is defined here as a small number of registers, one or two.
To do this, you will need to modify `pe.v` to include new input and output ports. These will be inputs `in_data` (D_W_ACC bits) + `in_valid` (1 bit) from the left, and an output `out_data` (D_W_ACC bits) + `out_valid` (1 bit) shifting data out.
The table below illustrates the expected behaviour for the shifting out of results:
- The column headers are the different inputs and outputs of a single `pe` module, which is reset at time t=0. The positive edges of the clock happen between subsequent time values.
- Over the following cycles, the `pe` calculates two matrix values: `1 * 2 + 3 * 4 + ... + 13 * 14 + 15 * 16 = 744` and `17 * 18 + 19 * 20 + ... + 29 * 30 + 31 * 32 = 4968`. As in lab 3, the `init` signal is used to drop the value currently accumulated in the `pe`, so that a new accumulation may be started for the next value.
- The pe also receives results from other `pe`s (`100`, `101` ...) indicated by `in_valid = 1`.
- Both the values calculated by the `pe` and the values received from its neighbors to the left in the same row must be outputted through `out_data`. Note the order in which the values appear in `out_data`. The `out_data = Dc` means that we `D`on't `c`are about the value at that cycle.