Page 1 of 8 DATA1002 Exam preparation. Questions from the 2019 Final exam You are should produce a PDF with your answers, and upload/submit the pdf to the Canvas site for this practice. Question 3(b)(ii) and Question 6(a) ask for a figure; you can produce a diagram illustrating your ideas using drawing features in your document preparation tool; or, if you prefer, you can sketch by hand, take a scan of the drawing with a phone camera, and include the scanned image in your document. Note that in the actual exam, you will have a limited time of 3 hours to do all the steps, from downloading the questions to submitting a pdf with your answers. Please make sure that in answering a question, you write the question number, along with your answer. Q1. [30 points] Each part is worth 2 points. 1(a) Explain the meaning of “provenance of a dataset” 1(b) Write a formula that can be placed in a cell A5 of an Excel spreadsheet, so the value of A5 is equal to the value in A3 when both the following hold: B3 has a positive value, and C3 is greater than D3; the value of A5 should be 0 in other cases. 1(c) How many bytes are used to store a single character in ASCII encoding? 1(d) Suppose a data scientist needs to communicate results to business managers; describe a goal that the targets are interested in, to which the results should be connected. 1(e) Explain what is meant when we say a logical structure for a dataset is “denormalised”. 1(f) Sometimes, data scientists remove from their dataset any item where some value is missing. Describe one way in which doing so can lead to poor results. 1(g) When is access control described as fine-grained? 1(h) Give a formula (involving adding or subtracting various powers of 2) for the numeric value whose representation in 2-byte two’s complement signed binary is 0xF24A 1(i) Define the “lie factor” (a term used by Tufte for charts) 1(j) Give an algorithm that can be used for regression, that is not linear regression. 1(k) What type of attribute is predicted in a classification task? Page 2 of 8 1(l) What is an algorithm used for a clustering task? 1(m) Explain the meaning of collaborative filtering 1(n) When is a merge performed in a version control system? 1(o) What is meant when we say that a classifier has the “equalized odds” fairness property. Page 3 of 8 Q2 [15 points] You have been given a text file products.csv containing lines of comma-separated data about some household products. The first few lines look like this (note that the first line is a header, and also note that the fields do not themselves contain any commas): prodID,makerName,energyScore,category 37089,Artem,3,dishwasher 47115,Goldrod,2,fridge 51092,4Star,1,dryer 53490,FASTAr,3,fridge 2(a) [5 points] We would like to find the prodID and makerName, for each product in the file whose energyScore is less than 4. Write a Python program that accesses products.csv and prints out the desired information. You do not need to deal with misformatted files or other errors. You are allowed to use a library like Pandas, but this is not required. 2(b) [10 points] We would like to find, for each combination of a maker whose name starts with “fa”, and an energyScore, how many products there are (that have the given energyScore and are produced by the particular maker) and which have category “dishwasher”. You should ignore case in maker names; so “FASTAr” and “Fastar” and “fastar” are all considered the same maker. Write a Python program that accesses products.csv and prints out the desired information. You do not need to deal with misformatted files or other errors. You are allowed to use a library like Pandas, but this is not required. Page 4 of 8 Q3 [15 points] Here is a data set with information about universities in the state of Victoria. University Category Fulltime Employment Educational experience Deakin Regional 77.9 82.2 LaTrobe City 66.8 76.1 Monash Go8 77.3 77 RMIT City 73.8 77 Melbourne Go8 77.5 73.9 Victoria City 66.7 74.6 Fred Foolish has produced the following chart, displaying the data from this table as a linechart with two data series. The goal of the visualisation is to provide insight into the relationship between fulltime employment and educational experience, in particular, how this relationship might vary between university categories. Fred has chosen to encode Fulltime Employment (which is a quantitative or numeric attribute) as the position on the y-axis of the solid line, and Educational experience (another quantitaive or numeric attribute) is encoded as the position on y-axis of the dashed line, the nominal/categorical University is encoded on the x-axis, and the university Category (another nominal/categorical attribute) is shown as an extra piece of text above the University name on the x-axis. 0 10 20 30 40 50 60 70 80 90 Regional City Go8 City Go8 City Deakin LaTrobe Monash RMIT Melbourne Victoria Fulltime Employment Educational experience Page 5 of 8 3(a) [4 points] Identify some aspects of Fred’s visualisation that do not make it easy for a reader to gain insight into the relationship between Fulltime Employment and Educational experience, in particular, how this relationship might vary between Categories. 3(b) [11 points] Propose a different visualisation of some of the data from the table above, that will provide better insight into the relationship between Fulltime Employment and Eductaional experience, in particular, how this relationship might vary between Categories. You should structure your answer in three subparts in the spaces provided below: (i) describe which attributes will be shown and how each of these attributes will be encoded, (ii) sketch how the visualisation will look (you do not need to place the marks in exact positions for the given data), and (iii) explain why your proposal conveys the important information more clearly than using Fred’s encoding. 3(b)(i) Your encoding (3 points): 3(b)(ii) Sketch of your proposed visualisation (4 points): 3(b)(iii) Explain advantages of your proposed visualisation (4 points): Page 6 of 8 Q4 [10 points] 4(a) [5 points] A data scientist needs to follow data management policies that are set by their organization and/or their client. Give one example of a data management policy that might apply to a data science project, and explain why this policy can be important. Also, describe a mechanism that can be used to help the data scientist in following this policy. 4(b) [5 points] One important piece of metadata about a dataset, is the description of the data format in which the data is stored. Give one example of some information that could be kept in two different formats and describe these formats. Also, explain one way in which this metadata about the data format can be recorded. Page 7 of 8 Q5 [10 points] In Stage3 of the group project, you produced Python code to analyse a dataset, and to produce a predictive model for some aspect of the dataset. Write a description aimed at an employer you would like to work for, about the code you wrote, and how this demonstrates skills that will be useful in other situations the employer might want you to work on. Page 8 of 8 Q6 [20 points] 6(a) [8 points] Trace the execution of the following Python code (including diagrams with the state of the notional machine after each line of code is executed), and also write the output that will be printed when this is run. def myminus(x, y): total = x - y print("In myminus, x =", x) print("In myminus, y =", y) print("In myminus, total =", total) return total x=5 y=6 value = 17 total = 10 y = myminus(value + 1, total) print("x =", x) print("y =", y) print("value =", value) print("total =", total) 6(b) [6 points] Explain in English the purpose of the following Pandas code, and explain how each of the operations is performed, to achieve this purpose. Also show what is printed when this code is executed. import pandas as pd dict = {"Category":\ {"Deakin":"Regional","LaTrobe":"City","Monash":"Go8",\ "RMIT":"City", "Melbourne":"Go8","Victoria":"City"},\ "Fulltime Employment":\ {"Deakin":77.9,"LaTrobe":66.8,"Monash":77.3,"RMIT":73.8,\ "Melbourne":77.5,"Victoria":66.7},\ "Educational experience": \ {"Deakin":82.2,"LaTrobe":76.1,"Monash":77.0,"RMIT":77.0,\ "Melbourne":73.9,"Victoria":74.6}} df = pd.DataFrame(dict) df1 = df[df["Fulltime Employment"] > 70.0] df2 = df1["Educational experience"] df3 = df2.max() print(df3) 6(c) [6 points] Write an explanation, for a potential student considering studying DATA1002, of the concept of Excel’s pivot table, and why it is worth learning how to create pivot tables.
欢迎咨询51作业君