程序代写案例-DATA1002

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Page 1 of 8
DATA1002 Exam preparation.
Questions from the 2019 Final exam
You are should produce a PDF with your answers, and upload/submit the pdf to the
Canvas site for this practice. Question 3(b)(ii) and Question 6(a) ask for a figure; you can
produce a diagram illustrating your ideas using drawing features in your document
preparation tool; or, if you prefer, you can sketch by hand, take a scan of the drawing
with a phone camera, and include the scanned image in your document.
Note that in the actual exam, you will have a limited time of 3 hours to do all the steps,
from downloading the questions to submitting a pdf with your answers.
Please make sure that in answering a question, you write the question number, along
with your answer.

Q1. [30 points]
Each part is worth 2 points.
1(a) Explain the meaning of “provenance of a dataset”
1(b) Write a formula that can be placed in a cell A5 of an Excel spreadsheet, so the value
of A5 is equal to the value in A3 when both the following hold: B3 has a positive value,
and C3 is greater than D3; the value of A5 should be 0 in other cases.
1(c) How many bytes are used to store a single character in ASCII encoding?
1(d) Suppose a data scientist needs to communicate results to business managers;
describe a goal that the targets are interested in, to which the results should be
connected.
1(e) Explain what is meant when we say a logical structure for a dataset is
“denormalised”.
1(f) Sometimes, data scientists remove from their dataset any item where some value is
missing. Describe one way in which doing so can lead to poor results.
1(g) When is access control described as fine-grained?
1(h) Give a formula (involving adding or subtracting various powers of 2) for the numeric
value whose representation in 2-byte two’s complement signed binary is 0xF24A
1(i) Define the “lie factor” (a term used by Tufte for charts)
1(j) Give an algorithm that can be used for regression, that is not linear regression.
1(k) What type of attribute is predicted in a classification task?

Page 2 of 8
1(l) What is an algorithm used for a clustering task?
1(m) Explain the meaning of collaborative filtering
1(n) When is a merge performed in a version control system?
1(o) What is meant when we say that a classifier has the “equalized odds” fairness
property.

Page 3 of 8
Q2 [15 points]
You have been given a text file products.csv containing lines of comma-separated
data about some household products. The first few lines look like this (note that the first
line is a header, and also note that the fields do not themselves contain any commas):
prodID,makerName,energyScore,category
37089,Artem,3,dishwasher
47115,Goldrod,2,fridge
51092,4Star,1,dryer
53490,FASTAr,3,fridge
2(a) [5 points] We would like to find the prodID and makerName, for each product in the
file whose energyScore is less than 4. Write a Python program that accesses
products.csv and prints out the desired information. You do not need to deal with
misformatted files or other errors. You are allowed to use a library like Pandas, but this is
not required.
2(b) [10 points] We would like to find, for each combination of a maker whose name
starts with “fa”, and an energyScore, how many products there are (that have the given
energyScore and are produced by the particular maker) and which have category
“dishwasher”. You should ignore case in maker names; so “FASTAr” and “Fastar” and
“fastar” are all considered the same maker. Write a Python program that accesses
products.csv and prints out the desired information. You do not need to deal with
misformatted files or other errors. You are allowed to use a library like Pandas, but this is
not required.

Page 4 of 8

Q3 [15 points]
Here is a data set with information about universities in the state of Victoria.
University Category
Fulltime
Employment
Educational
experience
Deakin Regional 77.9 82.2
LaTrobe City 66.8 76.1
Monash Go8 77.3 77
RMIT City 73.8 77
Melbourne Go8 77.5 73.9
Victoria City 66.7 74.6

Fred Foolish has produced the following chart, displaying the data from this table as a
linechart with two data series. The goal of the visualisation is to provide insight into the
relationship between fulltime employment and educational experience, in particular,
how this relationship might vary between university categories. Fred has chosen to
encode Fulltime Employment (which is a quantitative or numeric attribute) as the
position on the y-axis of the solid line, and Educational experience (another quantitaive
or numeric attribute) is encoded as the position on y-axis of the dashed line, the
nominal/categorical University is encoded on the x-axis, and the university Category
(another nominal/categorical attribute) is shown as an extra piece of text above the
University name on the x-axis.

0
10
20
30
40
50
60
70
80
90
Regional City Go8 City Go8 City
Deakin LaTrobe Monash RMIT Melbourne Victoria
Fulltime Employment Educational experience

Page 5 of 8

3(a) [4 points] Identify some aspects of Fred’s visualisation that do not make it easy for a
reader to gain insight into the relationship between Fulltime Employment and
Educational experience, in particular, how this relationship might vary between
Categories.
3(b) [11 points] Propose a different visualisation of some of the data from the table
above, that will provide better insight into the relationship between Fulltime
Employment and Eductaional experience, in particular, how this relationship might vary
between Categories. You should structure your answer in three subparts in the spaces
provided below: (i) describe which attributes will be shown and how each of these
attributes will be encoded, (ii) sketch how the visualisation will look (you do not need to
place the marks in exact positions for the given data), and (iii) explain why your proposal
conveys the important information more clearly than using Fred’s encoding.
3(b)(i) Your encoding (3 points):
3(b)(ii) Sketch of your proposed visualisation (4 points):
3(b)(iii) Explain advantages of your proposed visualisation (4 points):

Page 6 of 8
Q4 [10 points]
4(a) [5 points] A data scientist needs to follow data management policies that are set by
their organization and/or their client. Give one example of a data management policy
that might apply to a data science project, and explain why this policy can be important.
Also, describe a mechanism that can be used to help the data scientist in following this
policy.
4(b) [5 points] One important piece of metadata about a dataset, is the description of
the data format in which the data is stored. Give one example of some information that
could be kept in two different formats and describe these formats. Also, explain one way
in which this metadata about the data format can be recorded.

Page 7 of 8
Q5 [10 points]
In Stage3 of the group project, you produced Python code to analyse a dataset, and to
produce a predictive model for some aspect of the dataset. Write a description aimed at
an employer you would like to work for, about the code you wrote, and how this
demonstrates skills that will be useful in other situations the employer might want you
to work on.

Page 8 of 8
Q6 [20 points]
6(a) [8 points] Trace the execution of the following Python code (including diagrams
with the state of the notional machine after each line of code is executed), and also
write the output that will be printed when this is run.
def myminus(x, y):
total = x - y
print("In myminus, x =", x)
print("In myminus, y =", y)
print("In myminus, total =", total)
return total

x=5
y=6
value = 17
total = 10
y = myminus(value + 1, total)
print("x =", x)
print("y =", y)
print("value =", value)
print("total =", total)

6(b) [6 points] Explain in English the purpose of the following Pandas code, and explain
how each of the operations is performed, to achieve this purpose. Also show what is
printed when this code is executed.
import pandas as pd
dict = {"Category":\
{"Deakin":"Regional","LaTrobe":"City","Monash":"Go8",\
"RMIT":"City", "Melbourne":"Go8","Victoria":"City"},\
"Fulltime Employment":\
{"Deakin":77.9,"LaTrobe":66.8,"Monash":77.3,"RMIT":73.8,\
"Melbourne":77.5,"Victoria":66.7},\
"Educational experience": \
{"Deakin":82.2,"LaTrobe":76.1,"Monash":77.0,"RMIT":77.0,\
"Melbourne":73.9,"Victoria":74.6}}
df = pd.DataFrame(dict)
df1 = df[df["Fulltime Employment"] > 70.0]
df2 = df1["Educational experience"]
df3 = df2.max()
print(df3)

6(c) [6 points] Write an explanation, for a potential student considering studying
DATA1002, of the concept of Excel’s pivot table, and why it is worth learning how to
create pivot tables.

欢迎咨询51作业君