辅导案例-DATA1002/1902

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DATA1002/1902 (Sem2, 2019) Project Stage 1 Due: 11:59pm on Friday October 11, 2019 (week 9) Value: 5% of the unit This assignment is done in groups of up to 4 students (we expect 3 or 4 students in most groups, but it may happen that sometimes a group is smaller, eg if there are not enough students in a lab). We recommend that all students in a group be attending the same lab session, so you can work together more easily. The group must all be enrolled in the same unit (either all from DATA1002, or all from DATA1902).
Group formation procedure: In week 6 lab, you should form a group. In choosing who you want to work with, we suggest that you aim to be able to agree on the domain of the data you will work with (eg finance, biology, meteorology, sociology, literature, etc.). If possible, also make sure that you have some common free times so you can get together to work on the project. Another goal is diversity in skills (eg someone good at coding, someone good at writing). Finally, it is very important to be clear with one another how much time and effort you will devote to the work (you don’t have to all give the same effort, as long as everyone knows that each can be relied on to do what they commit to). If necessary, the tutor may rearrange group membership. The group members should report all their unikeys to the tutor, and unit staff will then join them as members of an official group on Canvas. If, during the course of the assignment work, there is a dispute among group members that you can’t resolve, or that will impact your group’s capacity to complete the task well, you need to inform the unit coordinator, [email protected]. Make sure that your email names the group, and is explicit about the difficulty; also make sure this email is copied to all the members of the group. We need to know about problems in time to help fix them, so set early deadlines for group members, and deal with non-performance promptly (don’t wait till a few days before the work is due, to complain that someone is not delivering on their tasks). If necessary, the coordinator will split a group, and leave anyone who didn’t participate effectively, in a group by themselves (they will need to achieve all the outcomes on their own).
The project work for this stage: You need to obtain a data set. This may
be any data that interests you. We prefer that you use publicly available
data (so we can check your work if we need to) but it is OK for you to
work on privately-owned data as long as you have permission to use it,
and permission to reveal it to the markers. As you will see in the marking
scheme, if you aim for higher marks, then you should make sure that the
data is sufficiently large that automated processing shows genuine
benefits, and that it is produced by combining data from at least two
different sources.
You are then to ensure high-quality data that can be usefully analysed; we
expect you to write Python code that does whatever transforming and
cleaning is appropriate, The details of this aspect all vary a lot, depending
on the data you obtained. For example, you might have several CSV files
or alternatively you may have a JSON file; the work needed may be
removing instances that have corrupted or missing values, or correcting
obvious spelling mistakes, etc). In any case, you are required to get the
data to be fairly clean; for some data sets, you need to clean the data, in
others that were carefully curated before you got them, you would at least
check that the data is clean.
Finally, we ask you to show some very simple analysis, that reports on
some aggregate summaries. This is not intended to be a detailed
exploration of the data (that will come in Stage Two), but simply a
demonstration that the data is now in a form where you can work with it,
and that you have the required skills in Python coding.
During the project, you need to manage the work among the group
members. We advise that you do NOT allocate a separate job to each
person. That is, don’t get one member to find the data, another to clean it,
another to analyse it. This would mean that work is badly spread through
the time period for each person, and also it makes the outcome very
vulnerable if one member is slow or doesn’t do a good job, because each
job depends on the previous ones. Instead, we recommend that every
person do each activity, and that you compare regularly and take
whichever is better (or even, find a way to combine the good features of
each). So, each member should hunt for a dataset, and then everyone
looks at all the datasets found, and either choose the dataset that has most
potential, or even combine several datasets together. Similarly, each
member should try to clean the data, and then see who found what issues,
and produce a dataset that has all the aspects clean at once. Note that this
project stage is not a huge amount of work; it can all easily be done by
one person.
What to submit, and how: There are four deliverables in this Stage of
the Project. All four should be submitted by one person, on behalf of the
whole group. The marks will be associated with the report.
• Submit a written report on your work, in pdf. This should be submitted
through Turnitin, via the link in the Canvas site. The report should be
targeted at a tutor or lecturer whose goal is to see what you did, so they
can allocate a mark. The report should have a three-section structure that
corresponds to the marking scheme: a section that describes the data
source(s), the format/contents of the data, the rights associated with the
data; a section that describes the initial transformation and cleaning that
you did (include here the parts of Python code that you used, or a
description that is detailed enough to be followed); and a section that
describes and explains some simple analysis that you have done (again,
show the code and also the output of the analysis).  There is no required
minimum or maximum length for the report; write whatever is needed to
show the reader that you have earned the marks, and don’t say more than
that!
• Submit a copy of the raw data as you obtained it. This should be
submitted through the Canvas system, as a single file (if you got multiple
files from your sources, you need to compress them into a single file for
submission)  
• Submit a copy of the cleaned and transformed data set. This should be
submitted through the Canvas system, as a single file.  
• Submit a copy of the processing Python code you wrote for cleaning
and analysis. This should be submitted through the Canvas system, as a
single file.  

Marking: Here is the mark scheme for this assignment. Note that all
members of the group receive the same score. 
The marking of each of the components will depend on the volume and
diversity of data you have. For volume, we will consider the number of
“values”: for the most common case, rectangular data eg CSV, the
contents of a field for an item would be a value. So if you have 100 rows
of data, each with 5 attributes, that would be 500 values. For JSON data,
the keys don’t count, and the values count based on their atomic (string,
numnbver etc) components: so if one attribute’s value somewhere is a list
of 5 numbers, that counts as 5 values; if it is a dictionary with 7 keys,
each associated to a string, that counts as 7 values. For diversity, what
matters is whether there are truly independent sources of the data. If you
get several data sets, but they are all from the Australian census, that only
counts as one source; similarly, if you get datasets that are all from the
World Bank, they care considered only one source. But if you get some
data from Australian census, and some from US Census, that counts as
two sources. In each component of the marking, the score you can get is
capped depending on the volume and diversity.
• To gain a Pass mark in any component, your data must have at
least 100 values (we say this is a “simple” data set). To gain a
Distinction level mark in any component, you must have at least 500
values, and they must come from at least two independent sources (a
“medium” dataset). To gain full marks in any component, you must
have used at least 3 sources where there are at least 1000 values from
each of these sources (a “complex” dataset). To be considered for full
marks, there must be a real challenge in relating the data values in the
three sets. It is not enough to simply take datasets that use the same
definitions of attributes etc, nor is it ok just to use unrelated data, where
there is not connection made across the information.
• There is 1 mark for the work on obtaining a dataset (as described in
Section 1 of the report, and as evidenced in the submitted raw data set). A
pass (adequate) score indicates that you have at least a simple dataset
with genuine data, that you have clearly showed where you obtained the
data, that you have described the contents of the dataset (explaining
clearly both the format, and the meaning of the various aspects). A
distinction level score (good work) is awarded if, in addition to the above,
your dataset is at least medium scale, your description shows clearly that
you have appropriate rights to use the data in the ways that you do use it,
and your explanation shows sensible reflection of the strengths and
limitations of the data that you obtained. Full marks (excellent work)
indicates that you have achieved all the distinction-level requirements and
in addition, that your data set is complex.
• There are 2 marks for the work on transforming and cleaning the data
set to support later processing in the tool of your choice (as described in
Section 2 of the report, and as evidenced in the changes between the raw
data set and the cleaned data set). A pass score indicates that you have
produced a version of a simple dataset that is able to be used for analysis
(it may still have data quality problems, but not so much as to prevent
analysis). At least one aspect of data quality must have been checked and
(if there is some problem) it has been handled, but at this level, the
checking and handling could be done by manual inspection rather than by
code. A distinction score indicates that you have passed and also that the
data is at least medium scale, and that you have carefully examined the
source data set for data quality and format difficulties, and that you have
code that automatically checks for, and deals with, several of these issues.
Full marks is awarded if, in addition, you have been able to effectively
and automatically integrate the data from a complex dataset (have code
that transforms related data from the different sources into common
formats and conventions, so the connections can be used in your analysis).
• There are 2 marks for the simple analysis work (as described in Section
3 of the report, and evidenced in the submitted code). A pass score is
awarded if you have written Python code which runs on the dataset, and
correctly reports on at least one suitable summary statistic (such as the
highest value, or the number of different values) for one attribute of the
dataset. A distinction score is given if your code gives at least one useful
aggregate for each of the attributes in the data (which must be at least
medium scale), and furthermore, the summaries include several different
kinds (eg one may be the max, another the average, yet another the
number of distinct values). Full marks would be awarded if, in addition
to the above, your code gives statistics that break down the data from a
complex dataset in sensible ways (eg, if the data contains a state attribute,
it reports the summaries for data from each state separately, as well as the
overall summaries). 
Late work: As announced in CUSP, late work (without approved special
consideration or arrangements) suffers a penalty of 5% of the available
marks (that is, 0.25 marks), for each calendar day after the due date. No
late work will be accepted more than 10 calendar days after the due date.
If this stage is missed or badly done, the group can be given a clean data
set, for a domain chosen by the instructor, to use in the rest of the project.