辅导案例-CSEE 4868

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Project
CSEE 4868: SoC Platforms
November 12, 2020
IMPORTANT:
• do not modify the directory structure and the file names in the GIT repository.
• make sure the code runs on the socp0X.cs.columbia.edu servers.
• before building and running your code, source the tools: source “/opt/cad/scripts/tools env.sh”.
• DO NOT commit binaries, output files or log files;
• DO NOT commit modifications to any Makefile.
You should work on this project individually. A repository is assigned and can be cloned by executing:
git clone [email protected]:socp/prj-UNI.git
1 Introduction
The main goal of the project is to give you a taste of design space exploration (DSE) for a hardware accelerator de-
signed with high-level synthesis (HLS). This project will combine the contrast adj accelerator you have been
working for the homework assignments this semester with accelerators for a Convolutional Neural Network (CNN).
Together, these accelerators will comprise an entire system, as shown in Figure 1 that can perform inference (classifi-
cation) on an image with sub-optimal contrast.
The project will be divided into two parts with the following dates. Part 1 repositories are available now. Part 2
repositories will be made available at the start date for Part 2.
1. November 12 - December 1
2. December 2 - December 22
In the first part, everyone in the class will participate in a DSE competition on the contrast adj accelerator. In the
second part, each student will be assigned to a layer of the CNN. You will be competing with the subset of students
assigned to the same layer. However, the second part of the project also adds a layer of collaboration. On several spec-
ified days, you will select one submitted design, including one of your own, for each component to form a complete
system, with the goal of obtaining a Pareto-optimal implementation.
At the end of the project, your grade will depend on the distance between your component/system implementations
and those in the Pareto curve derived by considering the implementations of all the students in the class. Additionally,
you will have many opportunities for collecting extra-credit points throughout the course of the project.
1
Figure 1: DWARF-7 system for the project with the layers. The design-space exploration will be performed both at
component and system level. The performance metrics for the exploration are shown in the two charts.
2 Logistics
In each part, we provide you with a fully-functional, baseline micro-architecture. We purposely provide a sub-optimal
baseline that has a lot of room for improvement to encourage you to explore a wide variety of optimizations through
DSE.
You are required to obtain at least 3 Pareto-optimal micro-architectures that are named FAST, MEDIUM, and SMALL,
respectively. These have been specified as targets for the HLS in the file project.tcl, where you can add synthesis
attributes to each hls config. Even though you should design 3 different micro-architectures, you are asked to commit
a single SystemC module where C preprocessor directives are used to switch between the different implementations
depending on the target:
#if defined(SMALL)
// Source code specific to the SMALL target...
#elif defined(MEDIUM)
// Source code specific to the MEDIUM target...
#elif defined(FAST)
// Source code specific to the FAST target...
#endif
Throughout the project you should keep improving your own Pareto frontier by either advancing your Pareto optimal
designs or by adding new ones. Eventually, you should aim at having your designs on the Pareto curve of the class.
Daily, starting from Friday November 13th at 11.59 pm, we are going to automatically test your submitted designs.
We will consider the last git commit of the day. The performance of the valid designs will be published daily on the
project website:
2
http://pelican.cs.columbia.edu:3838/socp2020
The website allows you to observe the anonymized results of the other students. Currently, the website is configured to
only show results for Part 1. More panels will become available at the start of Part 2 to perform the system composition
and view the corresponding results.
In your repository you will find your ID in hls/syn/team.tcl. On the website we will use your ID to label your
designs. Since students will compete for some parts of the project, you are discouraged from sharing your ID with
others.
3 Part 1: Contrast Adjustment Accelerator
The baseline design for Part 1 is very similar to the solution code for 4, so you should already be somewhat familiar
with the design. Please note that we provide an implementation that works larger images (32x32) than the homework.
There is lots of room for improvement in this accelerator. We will test the accelerator’s performance on a batch of
images. Thus, it may be advantageous to implement ping-pong buffering, such that the load, compute, and store
processes can work in a pipeline. There is further opportunity to pipleine the compute process: while the histogram
equalization kernel is working on the first image, the histogram can work on a second image.
3.1 Performance metrics
There are two metrics: area and effective latency. They are evaluated on a single instance of your accelerator. Each of
your valid micro-architectures will appear as a point in a Component DSE chart on the website.
Correctness. There are two requirements for an implementation to be valid:
• Functional correctness. The results of the behavioral SystemC simulation must match exactly the results
of the programmer’s view.
• HLS-generated RTL correctness. The results of the behavioral SystemC simulation must match the results
of the RTL-SystemC co-simulation of the HLS-generated Verilog.
Area. The area of your accelerator is calculated in terms of FPGA-resource utilization on the target Xilinx zcu106
FPGA board. Specifically, the area is calculated based on the average of the percentage utilization of the main
FPGA resources: BRAM, LUTs, and DSPs (Mults).
%Area = (%BRAM +%LUTs +%DSPs)/3.
Be aware that increasing the number of ports and access patterns of the private local memory (PLM) may allow
for some important performance gains, but will simultaneously make the number of BRAMs grow significantly.
Note that you cannot exceed the resources available on the FPGA.
Effective Latency. The effective latency LEff (ns) corresponds to the amount of time needed to process completely a
given input image across the layer. The effective latency of a layer is reported at the end of the execution of the
RTL simulation.
3.2 Testing and Make Targets
We will test your accelerator on a batch of 32x32 images from the CIFAR-10 dataset. The input file will contain the
pixel values from 10 images concatenated together, as in the files test1.txt and test2.txt. Note that you need
to adjust the contrast of each image separately and can’t treat it as one giant image. We will also test your accelerator
on a different ordering of the images than we provide.
The following are the make targets we provide for testing, where, can be either FAST, MEDIUM or SMALL.You
should run the targets from the hls/syn/ directory.
• $ make Makefile.prj
Generation of HLS dependencies, including the memory models listed in memlist.txt. You do not need to
run this target, it is run as a dependency of other targets.
3
• $ make sim BEHAV
Use this target to run behavioral simulation to test and debug the functionality of your SystemC accelerator. The
output values are stored in: testOut.txt, it has to match the output from programmer’s view.
• $ make hls
Perform HLS for one of the micro-architectures (i.e. one hls config).
• $ make sim BASIC V
Perform HLS for one of the micro-architectures. Then run the simulation of your SystemC accelerator.
3.3 Evaluation
3.3.1 Required Work (300 Points)
By the Part 1 deadline of December 1st at 11.59pm, you are required to submit the design of three micro-architectures
of the contrast adjustment accelerator Your designs will be evaluated based on the results of the whole class as follows:
• 100 points for each micro-architecture (SMALL, MEDIUM, FAST) for which you submitted a Pareto optimal
design. Students with no Pareto optimal design will get a score proportional to the distance of their design that
is closest to the Pareto curve.
3.3.2 Optional Work for Extra Credit (Up to 105 Points)
Every day from November 12th to December 1st, a script will evaluate all the design that have been submitted on or
before 11.59pm. The new valid designs will be automatically plotted on the charts on the website.
This offers the opportunity for optional work that is highly encouraged and will be evaluated as follows:
• Daily self-progress: 75 points. 5 points per day if you improve your own Pareto frontier at the component level
(Nov. 17th to Dec. 1th included, that is 15 days) and submit your design by 11.59 pm. Your own Pareto frontier
is considered improved if at least one of your new designs is better than a design on your existing Pareto frontier
by at least 5% either in terms of area or effective latency.
• Pareto optimal on Nov. 24th and December 1st: 15 points per date. On each of these two dates the points are
assigned if at least one of your design is Pareto optimal with respect to those of the other students. For each
date, these points are assigned based on all the designs submitted on or before 11.59pm.
NOTE: Each design should pass the validation in order to be considered for credits. Designs that do not pass the
validation will not receive credits and will not be plotted on the website.
4 Part 2: Convolutional Neural Network Accelerator
4.1 Background
Recall the DWARF-7 convolutional neural network (CNN) from the Homework 0 assignment. DWARF-7 is a simple
in-house designed CNN with CIFAR-10 as the training dataset. See Figure 1. In Part 2 of the project, we are going to
focus on the convolutional layers. Similar to part 1 of the project, we provide a baseline implementation of a hardware
accelerator. Your task is to perform a design space exploration (DSE) on the provided accelerator.
We suggest to carefully study the convolution compute() function of the Mojo application and see how the
convolution is implemented. Adding prints in the programmer’s view can help you find out the values of the parameters
passed to each layer.
Furthermore, understanding the general idea of a convolutional layer is important. In addition to the lecture slides, the
following are some recommended references on CNNs:
• Material from Stanford’s Spring 2017 class [1] CS231n: Convolutional Neural Networks for Visual Recognition.
– Notes (highly recommended): http://cs231n.github.io/convolutional-networks/
– Slides: http://cs231n.stanford.edu/slides/2017/cs231n 2017 lecture5.pdf
– Lecture video: https://www.youtube.com/watch?v=bNb2fEVKeEo&feature=youtu.be
• Beginner’s overview: part 1 [2] and part 2 [3].
4
4.2 Baseline Micro-Architecture
As shown in Figure 1, the DWARF-7 network is divided into six layers. The system can process data in a streaming
fashion because it can work as a pipeline at the granularity of a layer. For example while Layer 2 is working on the
inference of a dog image, Layer 1 could be working on the image of a cat and Layer 3 on one of a automobile.
Each student is randomly assigned to design one of the four convolutional layers. We provide a baseline accelerator
implementation that works for all the layers. The file team.tcl contains the information on which layer is assigned
to you.
The provided accelerator merges together accumulation, activation, padding (also called resize) and pooling. Notice
that only some convolutional layers in DWARF-7 require all of the above.
Again, the provided baseline implementation is suboptimal, and we encourage you to try a variety of DSE techniques.
4.3 System Composition
The final component of this project is to combine your component design with those of other students to build the
full DWARF7 system. The project website allows you to observe the anonymized results of the other students. On
certain dates (see Section 4.5) you will be able to select designs from other students for each component to produce
your system design. Similarly, the other students will be able to reuse your design. In this way the project promotes
collaborative engineering and design for reusability.
For each of your micro-architectures you have to select a contrast adj design along with one design for each CNN
layer other than yours. These will be combined with your CNN layer design for that particular microarchitecture to
produce the full system. This means that you are going to select 6 implementations to be combined with your SMALL
micro-architecture, 6 to be combined with your MEDIUM micro-architecture, and 6 to be combined with your FAST
implementation. You should carry out the selection of the implementations on the website and then download the
updated header file precision test fpdata.hpp, which is automatically generated. You must copy this file
into your repository in the folder hls/tb/. Remember to push this file with git when you do your submission. The
selection feature for the system composition on the website will become available later on, closer to the deadline for
Part 1(see Section 4.5). NOTE. You should not edit the downloaded file. The data types specified there are used for a
system-level simulation that measures the accuracy of the system implementation. You may run the system accuracy
test yourself locally on your socp0X server (see the make target in Section 3.2). The system accuracy simulation is
available in a folder called accuracy in your repository.
While initially your choice of is limited, there will be more opportunities to compose a better system, as every student
makes progress. For each micro-architecture that you choose for each of the other six layers, you must specify how
many instances you want to use for that layer; this choice will greatly impact area and throughput. After you push
your system choices into the git repository, the performance of your system will appear on the website the next day.
Ultimately, these are the main goals:
• Component level: Try to improve the design of the accelerator for your convolution layer daily. Compete with
the students that have been assigned your same layer by trying to optimize your design such that it lies on the
Pareto curve with respect to the component-level metrics. This will increase the likelihood that your component
design is reused by other students.
• System level: Combine your accelerator design with those of the accelerators for the other layers made available
by the other students to build the full system. This instance of collaborative design will allow you to quickly
obtain an implementation of the whole system. Then, compete with all the students by trying to refine your
choice of components to obtain an implementation of the system that lies on the Pareto curve with respect to the
system-level metrics.
4.4 Performance metrics
4.4.1 Component Level
At component level there are two metrics: area and effective latency. They are evaluated on a single instance of
your accelerator, which is then used to execute sequentially all the layers within your assigned layer. Each of your
5
valid micro-architectures will appear as a point in a Component DSE chart on the website as shown Figure 1. A
micro-architecture is valid if it meets the accuracy constraint explained below.
Accuracy and correctness. There are three requirements for an implementation to be valid:
• Functional correctness. The results of the behavioral SystemC simulation with float data types must
match exactly the results of the programmer’s view.
• HLS-generated RTL correctness. The results of the behavioral SystemC simulation with fixed-point data
types must match the results of the RTL-SystemC co-simulation of the HLS-generated Verilog.
• Accuracy. The component accuracy must be at least 50% for the design to be valid. The accuracy test exe-
cutes inference on 10 images and at least 5 have to be classified correctly. The test executes the behavioral
simulation with your chosen fixed-point precision for the layer and the original floating-point precision for
the other layers. See the Makefile target for this test in Section 3.2.
Area. The area of your accelerator is calculated in terms of FPGA-resource utilization on the target Xilinx zcu106
FPGA board. Specifically, the area is calculated based on the average of the percentage utilization of the main
FPGA resources: BRAM, LUTs, and DSPs (Mults).
Alayer = (%BRAM +%LUTs +%DSPs)/3.
Be aware that by increasing the number of ports and access patterns of the private local memory (PLM) of the
accelerator the number of BRAMs is going to grow significantly. Note that you cannot exceed the resources
available on the FPGA.
Effective Latency. The effective latency LEff (ns) of a layer corresponds to the amount of time needed to process
completely a given input image across the layer. The effective latency of a layer is reported at the end of the
execution of the RTL simulation.
Throughput. The throughput is only used as a metric at the system level, but it needs to be defined at the component
level as well. The throughput depends on the number of instances of an accelerator that are used for a given
layer; this number may vary between one and the maximum of three.
• With one instance per layer, the throughput is the inverse of the effective latency of the slowest layer:
Tlayeri =
1
(Llayeri)
• If you select multiple accelerator instances for a specific layer, the throughput will be:
Tlayeri =
1
(Llayeri/#instances)
=
#instances
Llayeri
4.4.2 System Level
At the system level there are three metrics: area, throughput and accuracy. Each of your valid system implementation
is represented by a point on a System DSE chart as shown in Figure 1, where accuracy, the third metric, appears as a
label next to each point.
Accuracy The accuracy is obtained by counting the percentage of input images classified correctly by the system
composed by the chosen accelerators. There is no way to know in advance how the composition of modules will
affect the overall accuracy, but we provide a Makefile target for this test in Section 3.2.
Area The area of the resulting system is computed as
ASystem = Alayer1 +Alayer2 +Alayer3 +Alayer4 +Alayer5 +Alayer6.
The layer area depends on how many instances of an accelerator the designer chooses to instantiate:
Alayer = Aaccelerator ·Ninstances
Throughput Since the system is pipelined with the granularity of a layer, the effective throughput of the resulting
system is computed as
Tsystem = min{Tlayer1, Tlayer2, Tlayer3, Tlayer4, Tlayer5, Tlayer6}.
6
4.5 Evaluation
4.5.1 Required Work (300 Points)
By the project deadline of December 22nd at 11.59pm, you are required to submit the design of three micro-architectures
of the accelerator for the assigned layer and three system-level implementations obtained through design reuse of the
other layers, as explained in Section 2. Your designs will be evaluated based on the results of the whole class as
follows:
• 50 points for each micro-architecture (SMALL, MEDIUM, FAST) for which you submitted a Pareto optimal
design at component level. Students with no Pareto optimal design will get a score proportional to the distance
of their design that is closest to the Pareto curve.
• 50 points for each micro-architecture (SMALL, MEDIUM, FAST) for which you submitted a Pareto optimal
design at system level. Students with no Pareto optimal design will get a score proportional to the distance of
their design that is closest to the Pareto curve.
4.5.2 Optional Work for Extra Credit (Up to 190 Points)
Every day from December 2nd to December 22nd, a script will automatically evaluate all the design that have been
submitted on or before 11.59pm. The new valid designs will be automatically plotted on the charts on the website.
This offers the opportunity for optional work that is highly encouraged and will be evaluated as follows:
• Daily self-progress: 100 points. 5 points per day if you improve your own Pareto frontier at the component
level (Dec. 3rd to Dec. 22nd included, that is 20 days) and submit your design by 11.59 pm. Your own Pareto
frontier is considered improved if at least one among your new designs of the layer assigned to you is better than
an older design by at least 5% either in terms of area or effective latency.
• Pareto optimal at component level on Dec. 4th, Dec. 14th, and 21st: 15 points per date. On each of these three
dates the points are assigned if at least one of your design is Pareto optimal at the component level with respect
to those of the other students working on the same layer. For each date, these points are assigned based on all
the designs submitted on or before 11.59pm.
• Pareto optimal at system level on Dec 5th, Dec. 12th, and Dec 22nd: 15 points for date. On each of these three
dates the points are assigned if at least one of your system composition is Pareto optimal at the system level
with respect to those of all the other students. For each date, these points are assigned based on all the designs
submitted on or before 11.59pm.
NOTE: Each design should pass the validation in order to be considered for credits. Designs that do not pass the
validation will not receive credits and will not be plotted on the website.
5 Suggestions for Design Space Exploration
Here are some suggestions of micro-architectural choices or synthesis options for the design space exploration.
• Refactor the source code: There may be opportunities for optimization that you can explore by modifying the
source code with respect to the provided baseline. The functions and the overall structure of the accelerator can
be re-arranged or re-implemented in a variety of ways. To explore these different options, it may be helpful to
allow these implementations to live side by side within the same shared source file. A recommended approach
for doing this is to use C preprocessor directives to switch between the different source code implementations.
• Modify HLS attributes:
– Target clock period: Stratus HLS uses the clock period to characterize datapath parts and to schedule
parts into clock cycles. The number of operations scheduled in a clock cycle depends on the length of
the clock period and the time it takes for the operation to complete. If the clock period is longer, more
operations are completed within a single clock period. Conversely, if the clock period is shorter (higher
clock frequency), Stratus HLS automatically schedules the operations over more clock cycles. Please, note
that for some values of clock period Stratus HLS may not be able to instantiate big arithmetic operators.
You can specify a different clock period for each micro-architecture in the file project.tcl.
7
• Apply HLS knobs:
– Loop unroll: The loop unrolling transformation duplicates the body of the loop multiple times to expose
additional parallelism that may be available across loop iterations. You can completely unroll, partially
unroll or not unroll at all. Remember that higher parallelism means better performance but more area.
– Loop pipeline: Loop pipelining enables one iteration of a loop to begin before the previous iteration has
completed, resulting in a higher throughput design while still enabling resource sharing to minimize the
required area. This makes it possible to incrementally trade off improved throughput against potentially
increased area.
– HLS CONSTRAIN LATENCY: The directive is used to specify the minimum and maximum acceptable
latency for a block of code. The block can be a loop body, an if/else or switch branch, or straight line code
enclosed by braces. See the manual for more details and examples.
– HLS SCHEDULE REGION: The directive provides a way to have Stratus HLS segregate part of a de-
sign’s functionality for separate processing with minimal modification to the source code structure. The
operations in the region are synthesized as if they were in a separate SC THREAD, producing an inde-
pendent finite state machine in RTL. This FSM can implement the operations with a fixed latency, with a
variable latency, or in a pipelined manner. See the manual of Stratus HLS for more details and examples.
– HLS CONSTRAIN REGION: The directive can be used to control the latency and timing of the re-
source created for an HLS SCHEDULED REGION. This optional directive should be placed adjacent to a
HLS CONSTRAIN REGION directive. See the manual for more details and examples.
– Stratus HLS offers many more “knobs” to interact with. The User Manual, the Reference Guide, and the
class slides provide further documentation. We suggest to try them and to keep monitoring the web forum
for additional hints.
• Customize the private local memory (PLM): You have complete freedom on the organization and the size of
the PLM, as long as it fits on the target FPGA.
• Customize the DMA channels: It is possible to both change the bitwidth of the DMA or the number of DMA
channels. Notice that the AXI bus would have to reflect these changes. The biggest effect of this kind of
optimization is the accelerator’s bandwidth to memory. To apply these modification you should act on the
memory model in the SystemC testbench. Notice that the minimum data width on the AXI bus is 32 bits. You
are highly encouraged to at least increase the DMA bitwidth, which is expected to provide great speedups.
6 Private Local Memory (PLM) Generation
The target make Makefile.prj generates Stratus HLS scripts and its dependencies, including the definition and
the Verilog models of the private local memory (PLM). The PLM is generated based on the specification given in the
file memlist.txt. Each line of this file describes a memory to be generated with the following syntax:

• : Memory name;
• : Number of logic words in memory;
• : Word bit-width;
• : List of parallel accesses to memory. These may require one or more ports.
The contains all the possible access patterns that the PLM can support. An access pat-
tern is described by the number of read and write operations that can be executed simultaneously. Each element of
has the following syntax: :.
Let’s consider an accelerator that uses a PLM called plm0 with size of 1024 words of 32 bits each; let’s also assume
that the accelerator access the PLM only in 3 possible scenarios: (a) read 8 contiguous words simultaneously; (b)
read only 1 word; and (c) write only 1 word. A memory with the above specifications should be generated with the
following description: plm0 1024 32 1w:0r 0w:8r 0w:1r.
8
The previous example corresponds to the case when the 8 words that are read are stored in contiguous locations in the
PLM. Instead, if multiple words are read in locations that are not contiguous, then the corresponding access pattern
must be specified as 0w:8ru, where a u is appended after r (or after w in the case of writing).
Lastly, it is possible that reads and writes happen simultaneously on the same PLM. If for example 1 word is read and
1 other word is written at the same time, the access pattern would be: 1w:1r.
7 Resources
Since Stratus HLS is a powerful industrial CAD tool, you may find it helpful to refer to its documentation in addition
to the in-class tutorials we gave on its use. You may find the documentation in /opt/cad/stratus/doc/ on the
socp0* servers. The two PDFs you are likely to find most helpful are:
• /opt/cad/stratus/doc/Stratus HLS User Guide/Stratus HLS User Guide.pdf
• /opt/cad/stratus/doc/Stratus HLS Reference Guide/Stratus HLS Reference Guide.pdf
You may download it to your local machine using rsync, scp, WinSCP or similar file transfer methods for easier
viewing.
References
[1] https://www.cs.toronto.edu/˜frossard/post/vgg16/
[2] https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-
Neural-Networks/
[3] https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-
Neural-Networks-Part-2/
9

欢迎咨询51作业君