程序代写案例-DATA2001/DATA2901

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

School of Computer Science
Uwe Roehm
DATA2001/DATA2901: Data Science, Big Data, and Data Diversity 1.Sem./2021
Practical Assignment: Bushfire Risk Analysis
Group Assignment (20%) 14.05.2021
Introduction
In this practical assignment of DATA2001/DATA2901 you are asked to gather and integrate several
datasets to perform a data analysis of the bushfire risk of different neighbourhoods in Sydney.
You find links to online documentation, data, and hints on tools and schema needed for this
assignment in the ’Assignments’ section in Canvas.
Note: Please keep apprised of updates on the FAQ / Clarification page on Ed: https://
edstem.org/courses/5592/discussion/462995. Any clarification we issue until 23:59 AEST
21st May 2021 should be considered part of the assignment specification - these will mostly
be updates regarding available resources and information such as Milestone 1 requirements
Disclaimer: This assignment is mainly about data integration. Note that the age and varying
quality of the provided data do not allow to reliably assess the actual bushfire risk.
Data Set Description and Preparation
Your task in this assignment is to calculate a bushfire risk score with regard to bushfire protection
for different neighbourhoods in Sydney. The neighbourhood ’score’ is expressed as a measure
of several factors which we assume to affect the risk of bush fires within an area — vegetation,
population density, number of dwellings etc.
In order to calculate this score, you will need to integrate different data sources. As a starting
point, we provide you with some census-based datasets which give you input on at least three
factors: population density, dwelling and business locations. We also provide some spatial data
with the vegetation and risk categories provided by the NSW Rural Fire Service. We leave it up-
to you to integrate further data and to refine the suggested risk score. Some ideas would be the
availability of specific emergency services, or the prevalence of waterways etc.
Based on your computed risk scores, perform then a correlation analysis against the ABS pro-
vided median income and median rent costs of each neighbourhood.
Your submission should consist of your Jupyter notebook that you used for integrating the data
sets and for performing and visualising your analysis.
Milestone 1: Load and integrate the provided datasets into the university provided PostgreSQL
database by the tutorials in Week 11.
Provided datasets: We provide in Canvas several CSV files with Statistical Area 2 (SA2) data
from the Australian Bureau of Statistics (ABS), as well as some bush fire prone land vegetation
spatial data from the NSW Rural Fire Service (keep checking Canvas/Ed for any later additions or
updates at https://edstem.org/courses/5592/discussion/462995):
1
StatisticalAreas.csv: area id, area name, parent area id
Neighbourhoods.csv: area id, area name, land area, population, dwellings, businesses, median income,
avg monthly rent, bounding box
BusinessStats.csv: area id, number of businesses, accommodation and food, retail trade,
agriculture forestry and fishing, health care and social assistance,
public administration and safety, transport postal and warehousing
RFSNSW BFPL.shp: gid, category, shape leng, shape area, geom
Linked datasets: We have also linked the SA2 boundary data in Canvas. This gives you a direct
download link to a zip archive data provided by the ABS. This archive extracts to a shapefile and
it’s associated metadata. For reference, we provide the columns of the shapefile once it has been
loaded into the database
SA2 2016 AUST.shp: g id, sa2 main16, sa2 5dig16, sa2 name16, sa3 code, sa3 name, sa4 code, sa4 name,
gcc code, gcc name, ste code, ste name, areasqkm16, geom, geom2
Milestone 1 data loading requirements For the purposes of Milestone, we require that at least
one member per team load the provided and linked data into the university provided postgres
server (hosted at soitpw11d59.shared.sydney.edu.au).
We require that the tables be put in the public (default) schema and the lowercase version of
their filenames:
Filename Tablename
StatisticalAreas.csv statisticalareas
Neighbourhoods.csv neighbourhoods
BusinessStats.csv businessstats
RFSNSW BFPL.shp rfsnsw bfpl
SA2 2016 AUST.shp sa2 2016 aust
2
Task 1: Data Integration and Database Generation
Build a database using PostgreSQL that integrates data from the following sources:
1. Sydney neighbourhood dataset (based on provided CSV files with SA2-data from ABS).
2. Spatial data in the SA2 ESRI Shape data file from the ABS at https://www.abs.gov.au/
AUSSTATS/[email protected]/DetailsPage/1270.0.55.001July%202016)
3. Census data for the given neighbourhoods including population count, dwelling and busi-
nesses counts.
4. Bush Fire Prone Land in NSW; Originally from the Rural Fire service, but modified for this
task - you will need to do some transforming of this data
5. You are encouraged to extend and refine both scoring function and source data. For
full points when integrating at least one additional data set.
Milestone 1: Load and integrate the provided and linked datasets into PostgreSQL by the tutorials in Week 11.
Task 2: Fire Risk Analysis
1. Compute the fire risk score for all given neighbourhoods according to the following formula
and definitions (adjust as needed if you integrated any additional datasets):
fire risk = S(z(population density)+z(dwelling & business density)+z(bfpl density)−z(assistive service density))
With S being the logistic function (sigmoid function), and z the z-score (”standard score”) of a
measure - the number of standard deviations from the mean (assuming a normal distribution):
z(measure, x) =
x− avgmeasure
stddevmeasure
Measure Definition Risk Data Source
population density population divided by neighbourhood’s land area + Neighbourhoods.csv
dwelling density number of dwellings divided by neighbourhood land area + Neighbourhoods.csv
business density number of businesses divided by neighbourhood land area + BusinessStats.csv
bfpl density area and category of BFPL divided by neighbourhood land area + RFSNSW BFPL.shp
assistive service density number of assistive services divided by neighbourhood land area - BusinessStats.csv
2. Store the computed measures and scores of each neighbourhood in your database. Create
at least one index which is helpful for data integration or the fire risk score computation.
3. Determine whether there is a correlation between your fire risk score and the median income
and rent of a neighbourhood.
Task 3: Documentation of your Bushfire Risk Analysis
Write a document (Jupyter notebook or Word document or PDF file, no more than 5 pages plus
optional Appendix) in which you document your data integration steps and the main outcomes
of your fire risk data analysis, including the correlation study with the bush fire statistics. Your
document should contain the following:
1. Dataset Description
What are your data sources and how did you obtain and pre-process the data?
2. Database Description
Into which database schema did you integrate your data (preferable shown with a diagram)?
Which index(es) did you create, and why?
3
3. Fire Risk Score Analysis
Show which formula you applied to compute the Fire Risk score per neighbourhood, and
give an overview of fire risk results. This can be done either in text by highlighting some
representative results, or with a graphical representation onto a map (preferred).
4. Correlation Analysis
How well does your score correlate to the affluence of the neighbourhoods? Compare both
the median household incomes and the rental prices of each region.
4
Task 4: DATA2901 Task for Advanced Class Only
1. For teams in the advance class, integration of at least one additional data set is compulsory.
2. One of the additional data sources must come from a web source such as be Web Scraping
or using a Web-API, rather than just a downloadable additional CSV data set.
3. Include in the fire risk analysis some data that was inferred using a machine learning or
natural language processing step. NOTE: This may change depending on the difficulty
of the task, please keep apprised with the clarification thread for details - https://
edstem.org/courses/5592/discussion/462995
General Coding Requirements
1. Solve this assignment with a Python Jupyter notebook in Python and SQL (Adv: also Unix).
2. Use the provided Jupyter and PostgreSQL servers from the tutorials.
3. If you use any extra libraries which are not installed in the labs, disclose in your documenta-
tion which library and what version.
Deliverables and Submission Details
There are four deliverables:
1. source code of the data integration and analysis tasks,
2. a brief report/documentation (up to 5 pages, as of content description above), and a
3. short demo in the labs of Weeks 12 and 13 with the whole team present.
4. Please also provide access to your database with the schema and the processed data.
All deliverables are due in Week 12, no later than 8pm, Friday 28 May 2021. Late submission
penalty: -20% of the awarded marks per day late. The marking rubric is in Canvas.
Please submit the source code and a soft copy of your documentation as a zip or tar file electron-
ically in Canvas, one per each group. Name your zip archive after your Class and group number
X with the following name pattern: data2001 assignment2021s1 Class groupX.zip
Students must retain electronic copies of their submitted assignment files and databases, as the
unit coordinator may request to inspect these files before marking of an assignment is completed. If
these assignment files are not made available to the unit coordinator when requested, the marking
of this assignment may not proceed.
All the best!
Group member participation
This is a group assignment. The mark awarded for your assignment is conditional on you being
able to explain any of your answers to your tutor or the lecturers if asked.
If members of your group do not contribute sufficiently you should alert your tutor as soon as
possible. The tutor has the discretion to scale the group’s mark for each member as follows, based
on the outcome of the group’s demo in Week 12 or 13:
Level of contribution Proportion of final grade received
No participation or no demo. 0%
Passive member, but full understanding of the submitted work. 50%
Minor contributor to the group’s submission. 75%
Major contributor to the group’s submission. 100%
5

欢迎咨询51作业君