程序代写案例-BEE2041

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
Final Project
[Full mark: 100; 70% of module grade]
BEE2041: Data Science in Economics
In this project, you will demonstrate your understanding and mastery of programming using data
science tools.

The grade of this project will be calculated out of 100, and it will contribute 70% towards your
overall grade in the module. The following aspects need to be shown:
● Manipulation of different data structures.
● Preparing and preprocessing data.
● Producing high quality visualisations.
● Improving and extending analysis.
● Utilising the above skills to perform high quality projects

Your submission will be a compressed file (.zip) containing:
1. A directory/folder called p1_git that contains files and folder as asked in Problem 1.
2. A copy of your Python script named p2_contagion.ipynb (done in Jupyter Notebook).
Your scripts must be sufficient to reproduce your answers to Problems 2.
3. A copy of your Python script named p3_warmup.ipynb (done in Jupyter Notebook). Your
scripts must be sufficient to reproduce your answers to Problems 3.
4. A copy of your Python script named p4_mini_project.ipynb (done in Jupyter Notebook)
that includes code and textual description and read as a blog post.
5. PDF copies of 2 - 4: p2_contagion.pdf, p3_warmup.pdf, and p4_mini_project.ipynb 1

The deadline is Monday 26th April at 3:00 PM (BST).

Collaboration: You are encouraged to think about this project with other students or ask each
other for help. If you do, you should do the following: 1) write your own code (no code copying
from others), 2) Report the names of all people that you worked with in your submission, 3) if
you received help from someone, write that explicitly, 4) plagiarism of code or writeup will not
be tolerated. The University takes poor academic practice and academic misconduct very
seriously and expects all students to behave in a manner which upholds the principles of
academic honesty. Please make sure you familiarise yourself with the general guidelines and
rules from this link2 and this link3.


1 See here how to create a PDF from a .ipynb file: https://vle.exeter.ac.uk/mod/forum/discuss.php?d=194496
2 http://as.exeter.ac.uk/academic-policy-standards/tqa-manual/aph/managingacademicmisconduct/
3https://vle.exeter.ac.uk/pluginfile.php/1794/course/section/27399/A%20Guide%20to%20Citing%2C%20Referencing%20and%
20Avoiding%20Plagiarism%20V.2.0%202014.pdf
Problem 1 [10 marks]: Git/GitHub
In this problem, you will demonstrate your ability to use simple commands in Git and GitHub. Do
the following:
A. Create an empty directory p1_dir
B. Initialise a local .git repository inside p1_dir
C. Create a new file file1.txt with one row content: “Hello World!”
D. Commit a .git change with a message “Add a hello world file”
E. Create a second file file2.txt with one row content: “This is a second file.”
F. Commit a .git change with a message “Add a second file”
G. Add a second row to file1.txt: “This is a second line in file1”
H. Commit a .git change with a message “Add a second row to file1”
I. Create a GitHub repository name it: bee2041_p1
J. Push your work to bee2041_p1 repository
K. Copy the link to your GitHub repository bee2041_p1 and paste it in the first Jupyter
Notebook cell of Problem 2
L. Create a .zip file that contains directory p1_dir and its content. Add it to your submission
package.


Problem 2 [20 marks]: Contagion in networks
In this problem, you will show your understanding of epidemic models and contagion in
networks, a topic we covered in Week 11. While you are not asked to write a simulation from
scratch, you will be adding some code and modifying existing code.
For this problem, we will be creating network models (Erdos-Renyi and Barabasi-Albert), and we
will be using an SIR (Susceptible–Infected–Recovered) epidemic model. In this model, at every
time step, each node can be in one of the following three states:
a. Susceptible: cannot infect others, is not infected, and can be infected
b. Infected: can infect others, is infected, and cannot be double-infected (whatever
that means)
c. Recovered: cannot infect others, is not infected, and cannot be infected (again).
In the initial state of the network, most nodes start as Susceptible and a few (with probability
“probInfectionInit” e.g., 10%) start as Infected. Every iteration (time step), susceptible nodes may
become infected if one of their neighbours is infected. Basically, an Infected node can infect each
of its Susceptible neighbours with a probability “probInfection”. Moreover, every iteration, each
Infected node may recover (thus moves to “Recovered” state, with a probability
“probRecovery”). In this setup, once a node becomes Infected, it cannot return to being
Susceptible again, and once a node becomes Recovered, it cannot return to being Susceptible or
Infected again. Please follow the instructions below.
A. Create a copy of the file P4_1_Contagion.ipynb from Week 11 in your directory.4 You will
be making changes on this new copy.

4 Find it here: https://vle.exeter.ac.uk/mod/folder/view.php?id=1904985
B. The setup described above is different from the one in P4_1_Contagion.ipynb. Therefore,
we need to make some changes. Change the following lines to:
max_iterations = 10
probInfectionResidual = 0.0
randomInitialInfection=True
This means, we are changing the maximum number of iterations to 10. We are not
allowing agents to be infected randomly other than through their infected neighbours.
We are also specifying that nodes will be chosen randomly to start in the state Infected
at the start of the simulation. The effect of these changes will render some pieces of
code unnecessary. For example, you might as well just comment out the block of code
inside init() function starting with:
else:
# targetted infection: highest-degree nodes start as infected
You may also comment out block of code inside step() function starting with:
# residual infection
But make sure to keep this line uncommented: systemState = nextSystemState
You don’t need to make these commented out as long as you make the change in the 3
rows above. Go ahead and re-run the simulation to make sure it is still working fine.

C. The current simulation create a 2-d grid network. We want to be able to use other types of
networks, and we want to pass these networks as parameters. We also want to pass the
number of nodes as a parameter to the init() function, and we want to pass another
parameter (call it netParam). This is a network specific parameter. For Erdos-Renyi, it is
the probability of connection p, for Barabasi-Albeter it is the number of edges a new node
has m. The init() function line should look something like:
def init(network, nb_agents, netParam, probInfectionInit):
You need to make necessary changes in the code inside init() function. Test your
simulation again (passing a ER network with p=0.1). For the sake of this testing alone,
you can add default values for your parameters in init() like:
def init(network =nx.erdos_renyi_graph, nb_agents, netParam=.1, probInfectionInit):

D. We want to start to minimise the output of this simulation. We don’t care about drawing a
network or even to plot of number of Susceptible, Infected, and Recovered node at each
time step. Instead we only care about two things: the total number of Susceptible nodes at
the end of the simulation, and the number of Infected nodes at each time step. We only
need the latter in order to make an early stop of the simulation when the network has no
Infected nodes (note the break command inside run_simulation() function). To do this,
you can change collect_statistics() function so that it only returns two parameters: statS
and nbI. You may comment out the draw() function (we don’t need it). Now make the
function run_simulation() return the last element in statS (number of susceptible agents at
the end of the simulation). That’s the only output we care about from this simulation. Try
your simulation again to make sure it is working as expected.
E. It is now time to comment out all printouts in the code. We don’t want anything printed
to the screen (other than the last element of statS). This includes all code about starting of
simulation, which agent got infected, etc. You can also comment out these lines:
# if __name__ == "__main__":
# run_simulation()
An important thing to note about this simulation is that the outcome is not always the
same each time. Obviously, we use probabilities of infection and recovery. But mainly,
because the group of agents we choose to infect at the start are chosen randomly
(remember we set randomInitialInfection=True in B). In order to avoid having varying
results, we will repeat each simulation multiple times. We will calculate the output of
each repetition (i.e., the last element of statS), and calculate their means. To make this
change, add a loop inside run_simulation() function. Use a parameter called rep to
specify the number of repetitions that we want to repeat our simulation for.
F. Almost there! We now need to pass all these parameters inside the run_simulation() with
default values. The first line of run_simulation() should now look like:
def run_simulation(nba = 100, maxIter = 10, probI = 0.2, probR = 0.2, probI_init = 0.1,
network = nx.erdos_renyi_graph, netParam=0.1, rep=10):

where nba: number of agents, maxIter: maximum number of iterations, probI is the
probability of infection, probR: probability of recovery, probI_init is the probability of
initially infected agents, network: network model, netParam: network parameter, and rep:
number of repetitions.
G. Calculate Percentage of Infected Agents as:

− ℎ

∗ 100

H. Now you’re ready to produce the following three plots!
a. Use 500 nodes for both networks.
For BA, use m=2. For x-axis, use: x = np.logspace(-3,0,12)
Leave all other values as set by default.


b. Use 500 nodes for both networks. For repetition use rep=20.
For BA, use m=2. For x-axis, use: x = np.logspace(-3,0,12)
Leave all other values as set by default.


c. Use probability of infection probI=0.01 for both networks.
For BA, use m=2. For x-axis, use: x = np.arange(100,1100,100)
Leave all other values as set by default.




Problem 3 [30 marks]: Mini Project – Warmup and Exploration
In this part, you will get a chance to warmup and explore three different data sets before you
choose one of them for the mini project in Problem 4. In each of the following subproblems (3.1,
3.2, and 3.3) you will have two parts: A) reproduce a plot, B) create a new compelling plot that
you choose. The goal of these two parts is to get a good chance to explore the data and see if you
can come up with interesting ideas with it. This will be helpful for you to make a choice for your
mini project in Problem 4. You can re-use the plot you make in part B in the mini project (if you
end up choosing that data set). Remember that you need to invest more time in the mini project.
You will receive a mark for any progress you make (even if you don’t end up reproducing the plot
in part A).

Problem 3.1 [14 marks]: COVID-19 – Our World In Data
In this problem, you will use a data set, named owid-covid-data.csv (see folder data), from “Our
World” in Data website.5 The data is about COVID-19 cases in all countries on every day during
the last year and this year, until 23 March 2021. The website provides many visualisations related
to COVID in different countries and over days.6 The data you will find in data folder is downloaded
from this link which has more information about the data.7
A. Use the owid-covid-data.csv file to reproduce the plot shown below, as accurately as
possible. The upper panel contains both solid lines and dashed lines. The dashed lines are
the smoothed version (column ‘new_cases_smoothed_per_million’). The remaining
parts are self-explanatory. You will get marks for reproducing the plot as accurately as
possible, taking into consideration the steps undertaken to reach the final figure.

B. Use the data set to perform a compelling extra analysis. You will get marks if you create
a compelling and interesting visualisation and/or analysis. One plot (or one piece of
analysis) is enough. Please provide 1-2 sentences to explain your interesting analysis.
Write it in a separate cell inside Jupyter (using Markdown). You are allowed to re-use this
analysis in the mini project if you end up choosing this data set.


5 https://ourworldindata.org
6 see this for example about vaccination progress in different countries: https://ourworldindata.org/covid-vaccinations
7 See this for general info: https://github.com/owid/covid-19-data/tree/master/public/data and this for codebook:
https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv
Problem 3.2 [8 marks]: Game of Thrones – A Song of Ice and Data
In this problem, you will use a data set about Game of Thrones characters (see GoT_character.csv
in data folder). Read more about this data here.8
A. Use the GoT_character.csv file to produce a relationship network of GoT characters. The
final network in Gephi may look something like below (which is not great looking
admittedly). You can use other tools to visualise the network, but you need to create two
files:
1) a node list file: each node is a GoT character. It has three columns: Id (you create it),
Label (the name of the character), and isNoble (whether it is noble or not).
2) an edge list file: each directed edge is one of four types of relationships: mother (x
y: x is a mother of y), father (x y: x is a father of y), heir (x y: x is a heir of y), and
spouse (x y and y x: x is a spouse of y and vice versa). It has 6 columns: Id (you create
it), Source, Target, Type (=‘Directed’ for all), Label (mother, father, heir, or spouse), and
Weight (=1 for all). Spouse relationships are represented as two lines in this data frame
(Source: x, Target: y and Source: y, Target: x). Make sure that all nodes specified in
columns “mother”, “father”, “heir”, and “spouse” are existing in the node list file (you can
set isNoble value to match the value of their relative). Colour nodes using isNoble
variable. Show node labels and edge labels.

B. Use the data set to perform a compelling extra analysis (not necessarily related to
networks). You will get marks if you create a compelling and interesting visualisation
and/or analysis. One plot (or one piece of analysis) is enough. Please provide 1-2

8 https://data.world/data-society/game-of-thrones (start from “Finally,..”)
sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter
(using Markdown). You are allowed to re-use this analysis in the mini project if you end
up choosing this data set.


Problem 3.3 [8 marks]: Famous People – Pantheon
In this problem, you will use a data set about “famous people” (see person_2020_update.csv file
in data folder). The data contains famous individuals with information about time and location
of their birth and death as well as other information like occupation and how “memorable” they
are. The data is downloaded from the Pantheon project.9

A. Use the person_2020_update.csv file to reproduce the plot shown below, as
accurately as possible. You are free to use any geographic map library. I used the
Python library Basemap. There is a section about it in the Data Science book we
studied from.10 You can also read more about it here.11 Before you plot, make sure
you only consider individuals who were born strictly after 1920, and have non-null
values for the longitude and latitude coordinates of their birth and death locations.

Note: Basemap is not usually installed by default with Anaconda. The recommended way to
install it is using the following in bash:
$ conda create --name basemap-env python=3.6 basemap
$ conda activate basemap-env
This will create an environment called basemap-env. If all goes well, you can do the following
in Python:
import os
os.environ['PROJ_LIB'] = '/Users/edmondawad/opt/anaconda3/envs/basemap-env/share/proj'
from mpl_toolkits.basemap import Basemap
Where you have to replace the path above with yours.




9 You can explore data here (using different filters): https://pantheon.world/explore/rankings?show=people&years=-3501,2020
10 https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html
11 https://matplotlib.org/basemap/users/geography.html


B. Use the data set to perform a compelling extra analysis (not necessarily related to
geographic maps). You will get marks if you create a compelling and interesting
visualisation and/or analysis. One plot (or one piece of analysis) is enough. Please
provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell
inside Jupyter (using Markdown). You are allowed to re-use this analysis in the mini
project if you end up choosing this data set.


Problem 4 [40 marks]: Mini Project – A Data-driven Blog Post
For this mini project, you will need to write a data-driven blog post.

Structure: There are tons of tutorials and examples online. Here’s a good guide (you can ignore
some parts that are not relevant for this project).
https://playbook.datopian.com/dojo/writing-a-data-oriented-blog-post/#_11-simple-steps-to-
create-data-driven-blog-posts
Here are some good examples of data-driven blog posts. Let’s start with some fancy ones by The
Pudding:
https://pudding.cool/2017/08/the-office/
https://pudding.cool/2017/08/screen-direction/
Check their website, they have some interactive posts as well: https://pudding.cool
You are not expected to write a post to this level of quality (certainly not an interactive one), but
you may consider these as the upper limit. Nevertheless, you may find their tutorial useful:
https://pudding.cool/process/how-to-make-dope-shit-part-1
To be fair, you may aim to write a post that’s more similar to these ones:
https://buffer.com/resources/social-media-language/
https://datahub.io/blog/automated-kpis-collection-and-visualization-of-the-funnels

Size: Your post should have between 500 – 2000 words, and it should contain 4-7 pieces of output
(e.g., plots, tables, summary statistics).

Code/Platform: You will need to write your post in Jupyter Notebook (using Python), where you
include Python code, output (plots, tables, statistics, analysis), and text (change cell type to
“Markdown” for text). Markdown cells uses markdown language (very simple; look it up). See
how the cell looks when it is being edited vs. when it is run below.






Data: You will have to choose one topic to write your post about. You will choose one of the three
data sets you played with in Problem 3 (hence the exploration task). While you don’t have to,
you can add new data sets, as long as they’re still about the same topic (e.g., if you decide to use
the famous people data set, you may also add a data set that contains the number of Twitter
followers for some of these famous people). Alternatively, you may choose to do analysis using
simulations that extend the work in Problem 2 (contagion in networks). If you would like to use
data from outside these, that’s also acceptable, but you will need to consult with me first. The
most important thing is that your post should have one topic/theme. For example, don’t include
plots from Game of Thrones and COVID-19 together (unless you can come up with a compelling
connection).

Submission: In addition to submitting your post as a Jupyter and PDF files (as specified above),
you should also consider submitting a link to your files on GitHub, and a link to your online post
(You can upload your post online without “publishing” it; so it won’t be archived by search
engines). One way to do this is using Hackmd (https://hackmd.io). You don’t need to create an
account; you can login using your GitHub account. In order to upload your Jupyter notebook file,
you need to download it as a Markdown file. This will create a .zip file which contains one .md
file, and a set of your plots. Here’s an example of how it will look like on HackMD (this is a tutorial
that I gave for BEE1038 last year and this year):
https://hackmd.io/@ZAIixDaKTDe2glG9TRGiwA/rkEwwmyrd
Here’ another example by someone else: https://hackmd.io/@linnil1/Sy0p1s9ZX

Audience: You should assume that the audience of your post has some knowledge and interest
about the topic (e.g., for Game of Thrones, you may assume they have watched some episodes),
but they have no idea about the data set. Your audience is well educated and have basic
knowledge that is comparable to your university’s 2nd-year students who never coded.

Assessment: Your blog post will receive a wholistic mark, based on the following rubrics:
Primary:
• Insightfulness: is the analysis insightful and compelling? Are there some interesting,
thought-provoking, and/or surprising findings?
• Soundness: Is the analysis sound? Are all comparisons fair (not comparing apples to
oranges)? Was proper filtering done? Are plots used appropriately based on data types
(e.g., using bar plots for categorical data points, scatter plots for independent data points,
lines for time series)?
• Presentation: Is the narrative coherent? does it all read as a one story? Is it easy to
understand the post? Is it easy to follow the ideas and the main story? Are there
appropriate (and readable) labels, legends, and captions provided for plots?

Secondary:
• Visual appeal: are the plots visually appealing (beautiful, appropriate colours, non-trivial
yet simple plots)?
• Pre-processing: Has there been an appropriate level of pre-processing of data? Is the code
easy to follow? Is the code efficient? Did the student put some effort in cleaning or re-
shaping data?
• Practice: Did the student use Git/GitHub or any other version control while preparing this
post? Did the student upload their post to Hackmd or a similar online platform?

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468