Final Project [Full mark: 100; 70% of module grade] BEE2041: Data Science in Economics In this project, you will demonstrate your understanding and mastery of programming using data science tools. The grade of this project will be calculated out of 100, and it will contribute 70% towards your overall grade in the module. The following aspects need to be shown: ● Manipulation of different data structures. ● Preparing and preprocessing data. ● Producing high quality visualisations. ● Improving and extending analysis. ● Utilising the above skills to perform high quality projects Your submission will be a compressed file (.zip) containing: 1. A directory/folder called p1_git that contains files and folder as asked in Problem 1. 2. A copy of your Python script named p2_contagion.ipynb (done in Jupyter Notebook). Your scripts must be sufficient to reproduce your answers to Problems 2. 3. A copy of your Python script named p3_warmup.ipynb (done in Jupyter Notebook). Your scripts must be sufficient to reproduce your answers to Problems 3. 4. A copy of your Python script named p4_mini_project.ipynb (done in Jupyter Notebook) that includes code and textual description and read as a blog post. 5. PDF copies of 2 - 4: p2_contagion.pdf, p3_warmup.pdf, and p4_mini_project.ipynb 1 The deadline is Monday 26th April at 3:00 PM (BST). Collaboration: You are encouraged to think about this project with other students or ask each other for help. If you do, you should do the following: 1) write your own code (no code copying from others), 2) Report the names of all people that you worked with in your submission, 3) if you received help from someone, write that explicitly, 4) plagiarism of code or writeup will not be tolerated. The University takes poor academic practice and academic misconduct very seriously and expects all students to behave in a manner which upholds the principles of academic honesty. Please make sure you familiarise yourself with the general guidelines and rules from this link2 and this link3. 1 See here how to create a PDF from a .ipynb file: https://vle.exeter.ac.uk/mod/forum/discuss.php?d=194496 2 http://as.exeter.ac.uk/academic-policy-standards/tqa-manual/aph/managingacademicmisconduct/ 3https://vle.exeter.ac.uk/pluginfile.php/1794/course/section/27399/A%20Guide%20to%20Citing%2C%20Referencing%20and% 20Avoiding%20Plagiarism%20V.2.0%202014.pdf Problem 1 [10 marks]: Git/GitHub In this problem, you will demonstrate your ability to use simple commands in Git and GitHub. Do the following: A. Create an empty directory p1_dir B. Initialise a local .git repository inside p1_dir C. Create a new file file1.txt with one row content: “Hello World!” D. Commit a .git change with a message “Add a hello world file” E. Create a second file file2.txt with one row content: “This is a second file.” F. Commit a .git change with a message “Add a second file” G. Add a second row to file1.txt: “This is a second line in file1” H. Commit a .git change with a message “Add a second row to file1” I. Create a GitHub repository name it: bee2041_p1 J. Push your work to bee2041_p1 repository K. Copy the link to your GitHub repository bee2041_p1 and paste it in the first Jupyter Notebook cell of Problem 2 L. Create a .zip file that contains directory p1_dir and its content. Add it to your submission package. Problem 2 [20 marks]: Contagion in networks In this problem, you will show your understanding of epidemic models and contagion in networks, a topic we covered in Week 11. While you are not asked to write a simulation from scratch, you will be adding some code and modifying existing code. For this problem, we will be creating network models (Erdos-Renyi and Barabasi-Albert), and we will be using an SIR (Susceptible–Infected–Recovered) epidemic model. In this model, at every time step, each node can be in one of the following three states: a. Susceptible: cannot infect others, is not infected, and can be infected b. Infected: can infect others, is infected, and cannot be double-infected (whatever that means) c. Recovered: cannot infect others, is not infected, and cannot be infected (again). In the initial state of the network, most nodes start as Susceptible and a few (with probability “probInfectionInit” e.g., 10%) start as Infected. Every iteration (time step), susceptible nodes may become infected if one of their neighbours is infected. Basically, an Infected node can infect each of its Susceptible neighbours with a probability “probInfection”. Moreover, every iteration, each Infected node may recover (thus moves to “Recovered” state, with a probability “probRecovery”). In this setup, once a node becomes Infected, it cannot return to being Susceptible again, and once a node becomes Recovered, it cannot return to being Susceptible or Infected again. Please follow the instructions below. A. Create a copy of the file P4_1_Contagion.ipynb from Week 11 in your directory.4 You will be making changes on this new copy. 4 Find it here: https://vle.exeter.ac.uk/mod/folder/view.php?id=1904985 B. The setup described above is different from the one in P4_1_Contagion.ipynb. Therefore, we need to make some changes. Change the following lines to: max_iterations = 10 probInfectionResidual = 0.0 randomInitialInfection=True This means, we are changing the maximum number of iterations to 10. We are not allowing agents to be infected randomly other than through their infected neighbours. We are also specifying that nodes will be chosen randomly to start in the state Infected at the start of the simulation. The effect of these changes will render some pieces of code unnecessary. For example, you might as well just comment out the block of code inside init() function starting with: else: # targetted infection: highest-degree nodes start as infected You may also comment out block of code inside step() function starting with: # residual infection But make sure to keep this line uncommented: systemState = nextSystemState You don’t need to make these commented out as long as you make the change in the 3 rows above. Go ahead and re-run the simulation to make sure it is still working fine. C. The current simulation create a 2-d grid network. We want to be able to use other types of networks, and we want to pass these networks as parameters. We also want to pass the number of nodes as a parameter to the init() function, and we want to pass another parameter (call it netParam). This is a network specific parameter. For Erdos-Renyi, it is the probability of connection p, for Barabasi-Albeter it is the number of edges a new node has m. The init() function line should look something like: def init(network, nb_agents, netParam, probInfectionInit): You need to make necessary changes in the code inside init() function. Test your simulation again (passing a ER network with p=0.1). For the sake of this testing alone, you can add default values for your parameters in init() like: def init(network =nx.erdos_renyi_graph, nb_agents, netParam=.1, probInfectionInit): D. We want to start to minimise the output of this simulation. We don’t care about drawing a network or even to plot of number of Susceptible, Infected, and Recovered node at each time step. Instead we only care about two things: the total number of Susceptible nodes at the end of the simulation, and the number of Infected nodes at each time step. We only need the latter in order to make an early stop of the simulation when the network has no Infected nodes (note the break command inside run_simulation() function). To do this, you can change collect_statistics() function so that it only returns two parameters: statS and nbI. You may comment out the draw() function (we don’t need it). Now make the function run_simulation() return the last element in statS (number of susceptible agents at the end of the simulation). That’s the only output we care about from this simulation. Try your simulation again to make sure it is working as expected. E. It is now time to comment out all printouts in the code. We don’t want anything printed to the screen (other than the last element of statS). This includes all code about starting of simulation, which agent got infected, etc. You can also comment out these lines: # if __name__ == "__main__": # run_simulation() An important thing to note about this simulation is that the outcome is not always the same each time. Obviously, we use probabilities of infection and recovery. But mainly, because the group of agents we choose to infect at the start are chosen randomly (remember we set randomInitialInfection=True in B). In order to avoid having varying results, we will repeat each simulation multiple times. We will calculate the output of each repetition (i.e., the last element of statS), and calculate their means. To make this change, add a loop inside run_simulation() function. Use a parameter called rep to specify the number of repetitions that we want to repeat our simulation for. F. Almost there! We now need to pass all these parameters inside the run_simulation() with default values. The first line of run_simulation() should now look like: def run_simulation(nba = 100, maxIter = 10, probI = 0.2, probR = 0.2, probI_init = 0.1, network = nx.erdos_renyi_graph, netParam=0.1, rep=10): where nba: number of agents, maxIter: maximum number of iterations, probI is the probability of infection, probR: probability of recovery, probI_init is the probability of initially infected agents, network: network model, netParam: network parameter, and rep: number of repetitions. G. Calculate Percentage of Infected Agents as: − ℎ ∗ 100 H. Now you’re ready to produce the following three plots! a. Use 500 nodes for both networks. For BA, use m=2. For x-axis, use: x = np.logspace(-3,0,12) Leave all other values as set by default. b. Use 500 nodes for both networks. For repetition use rep=20. For BA, use m=2. For x-axis, use: x = np.logspace(-3,0,12) Leave all other values as set by default. c. Use probability of infection probI=0.01 for both networks. For BA, use m=2. For x-axis, use: x = np.arange(100,1100,100) Leave all other values as set by default. Problem 3 [30 marks]: Mini Project – Warmup and Exploration In this part, you will get a chance to warmup and explore three different data sets before you choose one of them for the mini project in Problem 4. In each of the following subproblems (3.1, 3.2, and 3.3) you will have two parts: A) reproduce a plot, B) create a new compelling plot that you choose. The goal of these two parts is to get a good chance to explore the data and see if you can come up with interesting ideas with it. This will be helpful for you to make a choice for your mini project in Problem 4. You can re-use the plot you make in part B in the mini project (if you end up choosing that data set). Remember that you need to invest more time in the mini project. You will receive a mark for any progress you make (even if you don’t end up reproducing the plot in part A). Problem 3.1 [14 marks]: COVID-19 – Our World In Data In this problem, you will use a data set, named owid-covid-data.csv (see folder data), from “Our World” in Data website.5 The data is about COVID-19 cases in all countries on every day during the last year and this year, until 23 March 2021. The website provides many visualisations related to COVID in different countries and over days.6 The data you will find in data folder is downloaded from this link which has more information about the data.7 A. Use the owid-covid-data.csv file to reproduce the plot shown below, as accurately as possible. The upper panel contains both solid lines and dashed lines. The dashed lines are the smoothed version (column ‘new_cases_smoothed_per_million’). The remaining parts are self-explanatory. You will get marks for reproducing the plot as accurately as possible, taking into consideration the steps undertaken to reach the final figure. B. Use the data set to perform a compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown). You are allowed to re-use this analysis in the mini project if you end up choosing this data set. 5 https://ourworldindata.org 6 see this for example about vaccination progress in different countries: https://ourworldindata.org/covid-vaccinations 7 See this for general info: https://github.com/owid/covid-19-data/tree/master/public/data and this for codebook: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv Problem 3.2 [8 marks]: Game of Thrones – A Song of Ice and Data In this problem, you will use a data set about Game of Thrones characters (see GoT_character.csv in data folder). Read more about this data here.8 A. Use the GoT_character.csv file to produce a relationship network of GoT characters. The final network in Gephi may look something like below (which is not great looking admittedly). You can use other tools to visualise the network, but you need to create two files: 1) a node list file: each node is a GoT character. It has three columns: Id (you create it), Label (the name of the character), and isNoble (whether it is noble or not). 2) an edge list file: each directed edge is one of four types of relationships: mother (x y: x is a mother of y), father (x y: x is a father of y), heir (x y: x is a heir of y), and spouse (x y and y x: x is a spouse of y and vice versa). It has 6 columns: Id (you create it), Source, Target, Type (=‘Directed’ for all), Label (mother, father, heir, or spouse), and Weight (=1 for all). Spouse relationships are represented as two lines in this data frame (Source: x, Target: y and Source: y, Target: x). Make sure that all nodes specified in columns “mother”, “father”, “heir”, and “spouse” are existing in the node list file (you can set isNoble value to match the value of their relative). Colour nodes using isNoble variable. Show node labels and edge labels. B. Use the data set to perform a compelling extra analysis (not necessarily related to networks). You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough. Please provide 1-2 8 https://data.world/data-society/game-of-thrones (start from “Finally,..”) sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown). You are allowed to re-use this analysis in the mini project if you end up choosing this data set. Problem 3.3 [8 marks]: Famous People – Pantheon In this problem, you will use a data set about “famous people” (see person_2020_update.csv file in data folder). The data contains famous individuals with information about time and location of their birth and death as well as other information like occupation and how “memorable” they are. The data is downloaded from the Pantheon project.9 A. Use the person_2020_update.csv file to reproduce the plot shown below, as accurately as possible. You are free to use any geographic map library. I used the Python library Basemap. There is a section about it in the Data Science book we studied from.10 You can also read more about it here.11 Before you plot, make sure you only consider individuals who were born strictly after 1920, and have non-null values for the longitude and latitude coordinates of their birth and death locations. Note: Basemap is not usually installed by default with Anaconda. The recommended way to install it is using the following in bash: $ conda create --name basemap-env python=3.6 basemap $ conda activate basemap-env This will create an environment called basemap-env. If all goes well, you can do the following in Python: import os os.environ['PROJ_LIB'] = '/Users/edmondawad/opt/anaconda3/envs/basemap-env/share/proj' from mpl_toolkits.basemap import Basemap Where you have to replace the path above with yours. 9 You can explore data here (using different filters): https://pantheon.world/explore/rankings?show=people&years=-3501,2020 10 https://jakevdp.github.io/PythonDataScienceHandbook/04.13-geographic-data-with-basemap.html 11 https://matplotlib.org/basemap/users/geography.html B. Use the data set to perform a compelling extra analysis (not necessarily related to geographic maps). You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown). You are allowed to re-use this analysis in the mini project if you end up choosing this data set. Problem 4 [40 marks]: Mini Project – A Data-driven Blog Post For this mini project, you will need to write a data-driven blog post. Structure: There are tons of tutorials and examples online. Here’s a good guide (you can ignore some parts that are not relevant for this project). https://playbook.datopian.com/dojo/writing-a-data-oriented-blog-post/#_11-simple-steps-to- create-data-driven-blog-posts Here are some good examples of data-driven blog posts. Let’s start with some fancy ones by The Pudding: https://pudding.cool/2017/08/the-office/ https://pudding.cool/2017/08/screen-direction/ Check their website, they have some interactive posts as well: https://pudding.cool You are not expected to write a post to this level of quality (certainly not an interactive one), but you may consider these as the upper limit. Nevertheless, you may find their tutorial useful: https://pudding.cool/process/how-to-make-dope-shit-part-1 To be fair, you may aim to write a post that’s more similar to these ones: https://buffer.com/resources/social-media-language/ https://datahub.io/blog/automated-kpis-collection-and-visualization-of-the-funnels Size: Your post should have between 500 – 2000 words, and it should contain 4-7 pieces of output (e.g., plots, tables, summary statistics). Code/Platform: You will need to write your post in Jupyter Notebook (using Python), where you include Python code, output (plots, tables, statistics, analysis), and text (change cell type to “Markdown” for text). Markdown cells uses markdown language (very simple; look it up). See how the cell looks when it is being edited vs. when it is run below. Data: You will have to choose one topic to write your post about. You will choose one of the three data sets you played with in Problem 3 (hence the exploration task). While you don’t have to, you can add new data sets, as long as they’re still about the same topic (e.g., if you decide to use the famous people data set, you may also add a data set that contains the number of Twitter followers for some of these famous people). Alternatively, you may choose to do analysis using simulations that extend the work in Problem 2 (contagion in networks). If you would like to use data from outside these, that’s also acceptable, but you will need to consult with me first. The most important thing is that your post should have one topic/theme. For example, don’t include plots from Game of Thrones and COVID-19 together (unless you can come up with a compelling connection). Submission: In addition to submitting your post as a Jupyter and PDF files (as specified above), you should also consider submitting a link to your files on GitHub, and a link to your online post (You can upload your post online without “publishing” it; so it won’t be archived by search engines). One way to do this is using Hackmd (https://hackmd.io). You don’t need to create an account; you can login using your GitHub account. In order to upload your Jupyter notebook file, you need to download it as a Markdown file. This will create a .zip file which contains one .md file, and a set of your plots. Here’s an example of how it will look like on HackMD (this is a tutorial that I gave for BEE1038 last year and this year): https://hackmd.io/@ZAIixDaKTDe2glG9TRGiwA/rkEwwmyrd Here’ another example by someone else: https://hackmd.io/@linnil1/Sy0p1s9ZX Audience: You should assume that the audience of your post has some knowledge and interest about the topic (e.g., for Game of Thrones, you may assume they have watched some episodes), but they have no idea about the data set. Your audience is well educated and have basic knowledge that is comparable to your university’s 2nd-year students who never coded. Assessment: Your blog post will receive a wholistic mark, based on the following rubrics: Primary: • Insightfulness: is the analysis insightful and compelling? Are there some interesting, thought-provoking, and/or surprising findings? • Soundness: Is the analysis sound? Are all comparisons fair (not comparing apples to oranges)? Was proper filtering done? Are plots used appropriately based on data types (e.g., using bar plots for categorical data points, scatter plots for independent data points, lines for time series)? • Presentation: Is the narrative coherent? does it all read as a one story? Is it easy to understand the post? Is it easy to follow the ideas and the main story? Are there appropriate (and readable) labels, legends, and captions provided for plots? Secondary: • Visual appeal: are the plots visually appealing (beautiful, appropriate colours, non-trivial yet simple plots)? • Pre-processing: Has there been an appropriate level of pre-processing of data? Is the code easy to follow? Is the code efficient? Did the student put some effort in cleaning or re- shaping data? • Practice: Did the student use Git/GitHub or any other version control while preparing this post? Did the student upload their post to Hackmd or a similar online platform?
欢迎咨询51作业君