辅导案例-CS 330-Assignment 1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Introduction to Analysis of Algorithms October 14, 2020
Boston University CS 330
Professor Adam Smith, Dora Erdos Fall 2020
Programming Assignment 1
Due October 24, 2020 at 11:59 PM
1 General Description
Models of how a virus spreads through a community play an important role in deciding on public
health measures. Graph algorithms are often used to simulate the spread. In this assignment, you
will practice on aspect of that type of simulation.
Suppose we have a community modeled by a graph G, where vertices represent people and edges
represent social connections. We will use a very simple model of virus transmission: there is a set of
initially infected people (“sources”), and people get the virus with some probability through social
contact with a person that is infected. Given the initial sources, and the social connections among
the individuals, you have to estimate who will get infected in the community.
In this assignment, you will implement an algorithm that emulates the spread of the virus. With
the help of this algorithm, you will experiment on various types of input to infer characteristics of
the virus spread. You will submit working code through Gradescope, as well as summarize your
findings in a report. Supported languages are Python and Java. (Unfortunately, we won’t be able
to support C++.)
Collaboration policy. For this assignment only very limited collaboration is allowed. You can
discuss high level ideas with your classmates. However, you cannot share code or experimental
results with each other. The findings in your report have to be your own. We will use plagiarism
detection tools on the BFS portion of your assignment.
2 Deliverables
You need to submit the following.
• A working implementation of BFS as described in section 5.1. This code will be run on
various test files in the autograder.
• A 2-page pdf document summarizing your experimental findings as described in section 4.
The page limit is strict! Make sure that your writing is concise and easy to follow. While
sometimes it makes sense to include certain counts (e.g. number of nodes infected) or numer-
ical observations (e.g. results change significantly when p = 0.5), usually including raw data
is not helpful. You can instead for example add figures (see examples in section 4) or verbal
explanations.
1
1 - T A
-
programming_assignment1 1 of 10
3 Technical Description
Model setup. We model the community with help of a graph G(V,E). Nodes correspond to
people, and an edge between two nodes represent a social connection (e.g. friendship) between
them. The input to this model is G, the set S of source nodes (may be multiple nodes) that are
initially infected by the disease (S ⇢ V ) , and the infection probability p.
We measure the disease propagation in days. Each day some people get infected. Initially on
day 1 only the sources are sick. Once a person gets sick, the next day they infect each of their
contacts with probability p. (After that first day they quarantine and don’t infect any additional
people on subsequent days.) Infections are independent from each other.
Incorporating the random spread. Given the random process for node infection it is hard
to give an exact formula on how likely each node is going to be infected. However, we can get a
reasonable estimate by measuring it empirically. We play out the infection process multiple times
on G (using the same sources) and keep track of the infection of each individual node. We then
take the average outcome for each node over the multiple process instances.
We call an edge e active if one node infected its neighbor through this edge. Observe, that the
nodes in V that eventually get infected are the ones that are connected to a source through active
edges. The day that a node v is infected is the distance of v to a source.
Hence, we can generate an instance Gi of the infection propagation by removing all edges from
G that are inactive. Formally, we generate a graph Gi(V,Ei), such that the nodes in G and Gi are
the same. We obtain the set Ei by adding each edge in E independently with probability p. The
edges in Ei are the active edges. (Thus, Ei is a subset of E.)
We can use BFS on this random subgraph Gi to find out whether—in this particular random
instance—a node was infected and, if so, on what day.
Algorithm 1: GenRndInstance(G(V,E), p)
1 Ei = ; ;
2 for e in E do
3 active = rnd([0, 1]) /* generate random number */
/* edge is active with probability p */
4 if active  p then
5 Ei.add(e);
6 return Ei
To find the average outcome of the process for each node, we repeat the following 100 times:
first generate a graph instance Gi and then run BFS on Gi.
For each node v we can compute how likely it is to get infected by computing the fraction of
instances that its distance from the source is finite. For the instances where v did get infected we
can also compute the average time to infection by taking its average distance from the source.
Multiple sources. In this infection model it is possible that there are multiple sources s1, s2, . . . sk,
i.e. several people are initially infected. We can deal with this by adding an additional node s. Add
an edge (s, si) connecting it to each source, and add these edges to Ei with probability 1. (This
ensures that all the sources get marked as infected on the first day.)
2
programming_assignment1 2 of 10
4 Experiments
As the second part of your assignment you will apply your algorithm to di↵erent types of graphs
(the graphs are provided by us). We ask you to observe and discuss the virus propagation in the
context of the structure of each graph. All of the graphs are synthetic generated so that they
exhibit the desired properties for sure.
Run all of your experiments for each four of the infection probabilities p1 = 0.1, p2 = 0.3,
p3 = 0.5 and p7 = 0.7 and report your findings on all.
4.1 Input.
We give you three graphs and source nodes for each. The graphs are artificially generated, each
one based on a di↵erent principle.
Grid graph. (We refer to this graph as Ggr) In this graph nodes and edges are as if they are
laid out on a square grid. It has 1024 nodes. Each node has degree 4 and is part of exactly 4 cycles
with 4 edges. (e.g. edges are laid out along the edges of the squares in the grid.) While this graph
is quite artificial, some real life networks are quite close to emulating this structure. Roughly, it
models a setting where people can only infect the people who live very nearby. (The grid structure
shows up in lots of real-world settings. One well known dataset with similar properties is the US
Electrical Power Grid 1. This data set does contain a few high degree nodes (which Ggr doesn’t)
that correspond to large relay stations. Road networks, such as the one for Pennsylvania 2, exhibit
similar structure.)
sources. This data set comes with 4 source nodes that form a length-4 cycle. (i.e. they are the
vertices of one of the squares.)
Grid graph with shortcuts. (denoted by Ggs) This graph is very similar to the vanilla grid
graph. In fact, it is the same graph – consisting of 1024 nodes that are laid out along the edges of
a square grid. However, in addition to the grid edges, each node v also connects to one additional
node u chosen uniformly at random. (note, that it is possible that u is the additional neighbor to
multiple nodes).
This graph is an example of a so called ”small world” graph. Due to the shortcut edges the di-
ameter (= maximum lengths of the shortest paths between pairs of nodes) of this graph is very low.
sources. The source nodes used are the same as for the vanilla grid.
Scale free network. (denoted by Gsf ) These graphs follow a so-called Power-law degree distri-
bution. In such graphs, the fraction of nodes of degree d is about d . Here is a parameter that
1http://konect.cc/networks/opsahl-powergrid/
2http://snap.stanford.edu/data/roadNet-PA.html
3
programming_assignment1 3 of 10
is characteristic of a graph. Typical values for are between 2 and 3. What this means is that
there are a small number of nodes in this graph that have very high degrees, while the majority of
the nodes have low degrees. The term scale-free refers to the fact that the degree distribution of
the graph is independent of the size of n, it does not ”scale” with the size. Check out this excellent
set of slides by the SNAP group at Stanford 3
We used a built-in graph generator for power-law graphs 4 from the well-known Python graph
package NetworkX. The generator is an enhanced version of the Barabasi-Albert preferential at-
tachment model 5.
sources. We give you two sets of source nodes for this data set. The first consists of two high degree
nodes (One has degree 87, the other 135). The second consists of four nodes, each of degree 4.
Some background information: Power-law graphs are of interest since many real-life networks
have been found empirically to follow such degree distributions (or other distributions with similar
shapes). Some examples are the friendship connections in social networks6 and the hyper-link
connections on the World Wide Web7. There are ongoing debates as to how well real life networks
are modeled by these distributions8.
4.2 Data Analysis.
You will analyze the infection spread on the various types of graphs. We give you multiple tasks.
We provide you with very precise instructions for some of them and then there are a few more open
ended questions.
Workflow. You will first compute the outcome of the virus propagation on the various inputs.
Then you will analyze those results as specified below. For the open ended questions you may
decide to run additional computations.
Input. You will run your initial experiments on the following data set + source combinations.
• grid 1.txt Contains the graph Ggr with one source (specified in the second line of the
graph) and 1024 nodes. (Denote this graph with one source by G1gr.)
• grid 4.txt Contains the graph Ggr with 4 sources (specified in the second line of the graph)
and 1024 nodes. (Denote by G4gr)
• grid shortcut 1.txt Contains the graph Ggs with one source (specified in the second line
of the graph) and 1024 nodes. (Denote by G1gs)
3http://snap.stanford.edu/class/cs224w-2015/slides/04-powerlaws.pdf.
4networkx.generators.random graphs.powerlaw cluster graph
5It takes as input the number of nodes n an integer m and a probability p. We used parameters n = 1000, m = 4,
p = 0.3. It is initiated with a complete graph of m nodes. Then nodes are added one at a time. Each new node
connects to m others at random with probabilities proportional to the other nodes’ degrees. Finally if node u is
recently added and connected to v and w, then an extra edge is added between v and w with probability p.
6For example, Mislove et. al.https://mislove.org/publications/SocialNetworks-IMC.pdf find that the net-
works they investigate are in fact close to power-law, though they don’t show a perfect fit.
7http://snap.stanford.edu/class/cs224w-readings/faloutsos99powerlaw.pdf
8https://www.quantamagazine.org/scant-evidence-of-power-laws-found-in-real-world-networks-20180215/
4
programming_assignment1 4 of 10
• grid shortcut 4.txt Contains the graph Ggs with 4 sources (specified in the second line of
the graph) and 1024 nodes. (Denote by G4gs)
• scalefree high.txt Contains Gsf with 2 high degree sources (specified in the second line
of the graph) and 1000 nodes. (Denote by Ghsf )
• scalefree low.txt Contains Gsf with 2 low degree sources (specified in the second line of
the graph) and 1000 nodes. (Denote by Glsf )
Running instructions. For each of the above data sets you run your model on all four values of p:
p = 0.1, p = 0.3, p = 0.5, p = 0.7. For each combination of data – source – probability you run
your model 100 times and take the average over those outcomes. Ideally, the many runs will be
incorporated in your code and you only need to specify the input each time.
Output. For each run of your experiment you will retrieve two lists of length n. Your analysis will
use these lists as input. The lists are (1.) prob infect contains at index i the probability that
node i is infected. (2.) avg day contains at index i the average day (distance from a source) that
node i is infected – among the days that it’s infected at all.
Analysis of grid graphs. Describe your findings of your experiments. Often it is useful to create
a figure with your data and include that in your report too. (Note that the example figures in this
section are for results on a di↵erent input than are given to you.)
1. Run the experiments on both G1gr and G
4
gr. For each four value of p report the number of
nodes infected in the two graphs. Is the number of infected nodes in G4gr significantly more
than G1gr, for what values of p? Plot the infection probability distribution for both graphs.
Figure 1 depicts such a plot. The X-axis corresponds to nodes, the Y-axis to the probability
that each node is infected. The nodes are ordered in descending order of infection probability.
The results for all four p are depicted in the same plot. Write your observations whether you
think the probability distributions for G1gr and G
4
gr and the ps follow the same pattern and
why. If it helps with your comparison, you may take the ratio of probabilities. E.g. for
p = 0.5 take the probability distribution of G1gr and divide by the value for G
4
gr. The closer
the results are to 1, the more similar the two results are.
2. Compare the average day that nodes get infected in G1gr and G
4
gr. Report your observations
for all values of p. Round the average infection days for each node to an integer and then
represent the data in a histogram similar to Figure 2. In our figure, to make the data
easy visible we created two separate histograms, the first containing results for p = 0.1 and
p = 0.3. The second for p = 0.5 and p = 0.7. Feel free to show your results in as few or as
many histograms as you think is best.
3. Perform the same analysis as above on the probability distribution on inputs G1gs and G
4
gs.
4. Perform the same analysis as above on the infection day on inputs G1gs and G
4
gs.
5. Compare the outcome of your experiments on the grid without and with the shortcuts. For
the graphs G4gr and G
4
gs compare the number of infected nodes and the infection probability
distributions for all four p values. Further, compare the data on their average infection days.
5
programming_assignment1 5 of 10
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600 700 800 900 1000
pr
ob
ab
ilit
yo
fin
fe
cti
on
nodes
p=0.1
p=0.3
p=0.5
p=0.7
probability distribution for infection with p=0.1, 0.3, 0.5, 0.7
Figure 1: Probability of nodes in the grid being infected in descending order. Note that for p = 0.1
and p = 0.3 many nodes have 0 probability.
1
6
11
16
21
0 5 10 15 20 25
nu
m
be
ro
fn
od
es
day infected
p=0.1
p=0.3
average day of infection with p=0.1, 0.3
1
6
11
16
21
26
0 10 20 30 40 50 60 70 80 90 100
nu
m
be
ro
fn
od
es
day infected
p=0.5
p=0.7
average day of infection with p=0.5, 0.7
Figure 2: Average day of infection on the grid with one source for p = 0.1 and p = 0.3 (left) and
p = 0.5, p = 0.7 (right).
6
programming_assignment1 6 of 10
In your explanation you may refer to figures included in the previous points. Or, if you think
a new figure would help with your clear explanation , feel free to include it.
Comment whether you observed the same trends when comparing G1gr and G
1
gs. If they are
very similar, then there is no need to share your figures for the latter comparison. If there a
di↵erence, then show the data you think best aides your argument.
6. (open ended) Formulate your opinion, how do the presence of the extra ”shortcut” edges in
Ggs influence the outcome when comparing Ggr and Ggs? (For both 1 and 4 sources.)
7. (open ended) For both Ggr and Ggs there seems to be a di↵erence in terms of infected
population between the infection probability being p < 0.5 or p 0.5. What is your intuition
for this and why? (Argue based on your understanding of the edge-layout of the graphs and
their active edges.)
Analysis of scale free graphs.
1. Analyze the outcome on the scale free networks Ghsf and G
l
sf . Again, run your experiments for
all four values of p. Compare the infection probability distributions. This should be similar
to the analysis you did to compare G1gr and G
4
gr.
2. Compare the average days of infection similar to your comparison earlier.
3. (open ended) Formulate your intuition, which nodes are most likely to get infected in both
experiments Ghsf and G
l
sf? Do you see a correlation between the distance of the infected
node to the sources? Is there a correlation between the degree of a node and its infection
probability? For this question we don’t expect you to run a full analysis. Instead, look up
some of the high probability nodes manually and base your explanation on those.
5 Implementation Details
Input files. The input files are given as .txt files. For a graph with N nodes, the file consists of
N + 2 lines. Nodes are indexed 0 . . . N 1.
• line 1: number of nodes N
• line 2: comma separated integers corresponding to the source ids.
• lines 3 . . . N + 2: line i consists of the comma separated list of neighbor ids of node i 3.
5.1 Part 1: BFS implementation
We ask you to write your own BFS function with multiple source nodes. This is the basic BFS,
without any of the random factors from the infection simulation. Your function will return the
distance/level of each node to the nearest source. Your function will be tested for correctness on
various input graphs through Gradescope.
7
programming_assignment1 7 of 10
starter code. We provide you with starter code in both Python and Java:
• breadth first search.py
• breadth first search.java
These files contain the code to read the content of the .txt files and load it in to predefined
variables. It also contains the code to write the output of your BFS algorithm to a txt file into a
format the autograder will accept. Please do not alter the code already in the files, and do
not change the name of the files, or the autograder will not accept them.
The readGraph(inputfile) function returns the following three variables (with these exact
names):
1. N: An integer representing the number of vertices in the graph.
2. s: List of integers representing the source nodes.
3. adj list: A list of lists in which the 0th element is a list of vertices adjacent to vertex 0, the 1st
element is a list of vertices adjacent to vertex 1, the 2nd element is a list of vertices adjacent
to vertex 2, etc.
Write your own BFS. You will implement the body of the function BFS(N,s, adj list).
Within the function the variable level is initialized, which is an n-sized list (populated with ‘x’s
to serve as placeholders). Your code should perform a BFS search, and it should store the level of
node i in the ith entry of the list level. The function will return the variable level which is then
used to write the output. You do not need to keep track of any information, other then level,
about the execution of BFS , e.g. parents. Please do not change the name of the variable
level, as the starter code will pass its contents to the autograder.
To better illustrate how this works, let’s look at an example. Consider the following graph,
which is based on Figure 3.2 in the book:
0
1 2
3 4
5
6
7
Say we would like to perform BFS starting from node 0. In this case, the variables in the code
will be set as follows:
1. N: 8
2. s: 0
8
programming_assignment1 8 of 10
3. adj list: [[1,2], [0,2,3,4], [0,1,4,6,7], [1,4], [1,2,3,5], [4], [2,7], [2,6]
So your algorithm might set the variable level to the following value, corresponding to the BFS
tree below:
level = [0, 1, 1, 2, 2, 3, 2, 2]
0
1 2
3 4
5
6 7
We will provide you with a file you can use to test your code called input. This will set the
variables N and adj list to represent the same graph as in this example, but the start node s will
be set to 6.
When you have finished writing your algorithm, you may submit your code to the autograder
on Gradescope called Programming Assignment 1, Part 1. After you submit, the autograder
will test your algorithm on several di↵erent graphs and indicate how many points you scored on
each test. You are allowed to edit and resubmit your code as many times as you like before the due
date. You are not allowed to use non-standard Python or Java libraries in your code.
Please do not submit code with “print” statements, as it will cause the autograder to
malfunction. Also, note that the autograder will be checking for plagiarism, so please
do not copy from each other or online resources.
5.2 Part 2: Outbreak simulation
You will write the code to run your experiments based on the workflow described in section 4.2.
We provide you with the following (optional) starter code to use:
• outbreak.py
This code will accept a graph file name and an output file name in the command line, like this:
>> python outbreak.py grid.txt myfile.txt
In this example, the code will then load the data for the grid graph (grid.txt) and write the
output to a file called myfile.txt. If you do not name your file at the command prompt, it will
be named outbreak output.txt.
First, you must copy and paste your BFS function solution into that you wrote for part 1 into
outbreak.py. Then you may write your solution code for part 2 in the function model outbreak (but
feel free to write and use more functions if you would like). The model outbreak function returns
two lists of size N :
• prob infect: the probability that each node ever gets infected
9
programming_assignment1 9 of 10
• avg day: the average day of infection for each node (which can be the string ‘inf’, for
infinity, if it is never infected)
Once you have populated these lists with the solution, the code will print your output file with
the writeOutput function. Each line will represent a node, and the first N lines will have the
probability of infection. Then there will be a single blank line, followed by N lines containing
the average day of infection.
You may use the resulting output file to perform your analysis in section 4.2.
10
programming_assignment1 10 of 10

欢迎咨询51作业君