代写接单案例-Section A ST2195 Programming for Data Science Sample exam paper 2021-2022

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top

Section A

ST2195 Programming for Data Science Sample exam paper 2021-2022

1. Consider the following objects in Python: i=(2,3), j=[2,3] and k={2,3}. State whether of the following operation are possible (in Python). Justify your answers in one sentence. There is at least one correct statement, and negative marks apply for wrong choices.

(a) i[1]=4

(b) print(j+1)

(c) k[0]=1 (d) j[1]=4

• Marks: 6

• Answer:

(a) Not possible. A tuple cannot be changed.

(b) Not possible. Cannot perform numerical operation to a list.

(c) Not possible. A set is not indexed. (d) Possible. A list can be changed.

2. In which of the circumstances below do ridgeline plots provide the most appropriate choice? Provide justification for your answer in no more than two sentences.

(a) When we want to study the empirical density of a variable.

(b) When we want to compare frequencies of one variable across different categories of another

variable.

(c) When we want to monitor changes in the distribution of a variable across different categories of

another variable.

(d) When we want to explore the association between two continuous variables

• Marks: 6

• Answer:

(c) Appropriate. It provides the empirical density of the continuous variable across the categories of the other variable.

3. Which of the statements below is correct. Provide justification for your answer.

(a) When training a machine learning pipeline the main aim is to achieve a high training error. (b) When training a machine learning pipeline the main aim is to achieve a moderate training error. (c) When training a machine learning pipeline the main aim is to achieve a low training error.

1

 

(d) When training a machine learning pipeline achieving a low training error may not be the primary aim.

• Marks: 6

• Answer:

(d) The primary aim is to achieve a low test error ideally (but not necessarily) with a low training as well.

4. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) An IDE is an alternative operating system to Microsoft Windows. (b) An IDE typically provides a source-code editor

(c) Some source-code editors provide auto-completion of code and syntax highlighting (d) There are only 4 source-code editors for R and 3 for Python.

(e) A source-code editor for Python cannot be used for R (f) An IDE is necessary for writing code.

• Marks: 6

• Answer:

(b), (c)

5. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) Jupyter notebooks cannot handle Python code.

(b) R Markdown is an authoring framework that combines Markdown with R

(c) R Markdown files cannot be opened without installing R first.

(d) Jupyter Notebooks are open-source web-browser based applications.

(e) Jupyter notebooks were named after the first names of its creators, Julia and Peter.

(f) R Markdown files can be converted in a variety of formats including HTML, PDF, and Microsoft

Word documents.

• Marks: 6

• Answer: (b), (d), (f)

6. Note from which language (R or Python) each of the following code chunks is from:

C1. vec = c(1, 4, 7)

C2. paste("Hello", "world")

C3. import numpy

C4. library("mlr")

C5. phrase = "Hello world"; print(phrase.lower())"

C6. vec = (1, 4, 7)

C7. list("a", 5, 1:3)

C8. ["a", 5, (1, 2, 3)]

C9.

2

 

      if (mark >= 50)

          print("pass")

C10.

      if mark >= 50:

          print("pass")

C11. plot(1:5, 2:6)

C12. df = pandas.read_csv(fdate + '.csv')

C13. df = read.csv(paste0(fdate, ".csv"))

C14. head(fd)

C15. df.head()

C16. plt.subplots()

C17. ggplot(df, aes(x = x)) + geom_histogram()

C18. write.table(df, file = "df.csv")

C19. apply(df, 2, sum)

C20. df[-c(1, 3, 4), ]

• Marks: 10

• Answer:

R: C1, C2, C4, C7, C9, C11, C13, C14, C17, C18, C19, C20 Python: C3, C5, C6, C8, C10, C12, C15, C16

Section B

1. For each of the following statements about R, state if they are always correct or not. Provide justification for your answer of no more than two sentences.

(a) A list is also a data frame. (b) A data frame is also a list.

(c) A data frame can contain data objects of different types such as vectors and matrices. (d) Data frames can contain lists.

(e) We can select the elements of both lists and data frames using their names

• Marks: 10

• Answer:

(a) Not necessarily, a list can contain vectors matrices, data frames even other lists. This is not the case for data frames.

(b) Correct. It can be viewed as a list of vectors of potentially different type. (c) Incorrect. It can only contain vectors of different types.

(d) Incorrect. In fact, lists can contain data frames. (e) Correct. This can be done using the $ sign.

2. For each of the following statements about R, state if they are always correct or not. Provide justification for your answer of no more than two sentences.

3

 

(a) The rows of a table in a relational database represent records. (b) An attribube in a relational database is a tuple of rows.

(c) SQLite uses a separate server process to operate. (d) The SQL query

  SELECT employee_id, salary, department

  FROM employee

  WHERE employee_id >= 102 AND salary >= 100

  ORDER BY salary

returns all available records and attributes from the table employee that have employee_id greater or equal to 102 and salary greater or equal to 100, ordered in increasing salary.

(e) The following R code chunk

  inner_join(employee, company, by = "sector") %>%

    filter(department == "HR")

Find all records in tables employee and company that have matching values of sector, and return only those records where department is “HR”.

• Marks: 10

• Answer:

(a) Correct.

(b) Incorrect. An attribute in a relational database is a column of a table.

(c) Incorrect. SQLite, unlike other RDBMS does not require a separate server process to operate. (d) Incorrect. The statement will return all available records for the attributes employee_id, salary,

department under the stated conditions. (e) Correct.

3. Explain in no more than 2 sentences, why the following statements are wrong.

(a) Git is a repository hosting service for GitHub.

(b) A Git repository cannot be accessed without an internet connection.

(c) The command git add is used to add another user in the repository.

(d) Structured data are stored in a local hard drive, while unstructured data are in the cloud.

(e) CSV files are special instances of XML files.

(f) yx returns y modulo x.

(g) A dictionary in Python is a collection that is ordered, unchangable and indexed.

(h) The command matrix(1:12, nrow = 3) in R will create a matrix with 3 rows and 1 column

with elements 1, 2, 3.

(i) A data frame in R can only hold factors and numeric variables.

(j) Mutable objects in Python are objects whose value changes depending on the operations

performed on them.

(k) ggplot2 is an R system for data wrangling.

• Marks: 10

• Answer:

(a) GitHub is a repository hosting service for Git.

(b) A Git repository can be setup and accessed in a local computer.

(c) git add can be used to add file contents to the index.

(d) Structured data is data that is organized according to a predetermined set of rules; unstructured

data sets are data sets for which it is difficult to have a predetermined set of rules for organizing

4

 

them.

(e) A CSV file (comma delimited file) is not a special instance of XML file. CSV can be used to

represent structured data is in two-dimensional arrays; XML is a markup language that defines

a set of rules for encoding information in objects.

(f) yx returns y raised to the power of x

(g) A dictionary is a collection which is unordered, changeable and indexed.

(h) matrix(1:12, nrow = 3) creates a matrix with 3 rows and 4 columns.

(i) A data frame in R can hold multiple variable types.

(j) Mutable objects are objects whose calues can be changed (e.g. lists, sets and dictionaries).

(k) ggplot2 is an R packages for graphics.

4. Match the commands C1-C4 with the output in O1-O4. C1. git status

C2. print(type(2.3))

C3. paste("type", "'float'")

C4. git checkout master

O1. "type 'float'"

O2. <type 'float'>

O3. Switched to branch 'master' O4. On branch master

• Marks: 10

• Answer:

C1 - O4, C2 - O2, C3 - O1, C4 - O3

5. Consider a data set consisting of the following variables on several customers of a bank:

• balance: credit card outstanding balance

• cleared: whether this balance was cleared in time • student: whether the person is a university student • income: the income of the person

(a) Describe what graphs you would produce to demonstrate how the variables balance, and income affect the likelihood of the person clearing the credit card outstanding balance in time (b) Suppose that when you look at frequencies students tend to clear their outstanding credit card balances in time less often than the rest of the population. But if you focus on people with high outstanding credit card balances, students are more likely to pay in time than the rest of the

population. Describe why could this be the case and what graphics you will use to depict that. • Marks: 10

• Answer:

(a) A scatter plot of balance and income can be used by labeling the points according to cleared. Also, side-by-side boxplots or violin plots separately for categories of cleared.

(b) This could imply that the reason students tend to miss payment is because they tend to have higher balances, and not because they are students per se. When compared with other members of the population with high balances they are actually more reliable. To depict that, we can use

5

 

a grouped barplot of the cleared proportions separately for cases with balance being below and above its median.

6. Suppose we are interesting in predicting a continuous variable y based on several features X. and we have a dataset with several missing values on X features. We are comparing two model learning models, namely ridge regression and random forests. Provide brief answers to the following:

(a) Provide the type of the machine learning models.

(b) Discuss the process of training the machine learning models indicating how the missing values

will be handled.

(c) Discuss what criteria you will use to choose between the trained machine learning models.

• Marks: 10 • Answer:

(a) supervised learning, regression.

(b) The date can be split into a training and test set. The test set will be left aside and not used at

all during the training process. The missing values can handled by one of the relevant methods, e.g. by filling with the sample average of this variable from the cases without missing values. It is important that this is done through a machine learning pipeline so that the sample average is calculate by points in the training dataset only. In order to tune the parameters of the machine learning pipelines with ridge regression and random forests, cross-validation can be used

(c) The trained and tuned machine learning pipelines will be used to obtain predictions to the test set (that was left aside in the process of part (b)). The best model will be the one with better predictive performance according to the mean squared error criterion.

6

 

 

51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468