Examination:

Programming for Data Science (ID2214)

Course code: ID2214

Course name: Programming for data science

Literature and tools:

The following rules apply to both parts of the examination:

Literature, other documents, including lecture slides, notes, etc. are not allowed.

Computers, tablets, phones, etc. are not allowed for searching for answers or

communicating with anyone except for the examiner. A text editor/word pro-

cessor (on a computer, tablet or phone) may be used for writing answers to the

questions, and on part II, any tool, such as an editor, Jupyter notebook or IDE

may be used for developing the programs.

Date and time: April 14, 2020, 08:00-12:00

Examiner: Henrik Bostro¨m

Requirements to pass: 5 points on part I and 10 points on part II.

On part I, keep the text short and to the point.

On part II, only the Python standard library, the NumPy and pandas libraries

may be assumed, in addition to functions explicitly stated in the tasks.

The answers (including blank ones) should be numbered and ordered.

Unreadable answers will be ignored.

Good luck!

Part I (Theory, 10 points)

1a. Methodology, 2 points

Assume that we want to develop a model and estimate its performance on

independent data. We have therefore have decided to randomly split an available

dataset into a training and test set, using the former to train the model and

the latter to estimate its performance. However, since the learning algorithm

that we would like to use cannot directly deal with missing values, we have

decided to employ some imputation technique prior to applying the algorithm.

The question is now whether there may be any potential risk in applying the

imputation technique using the whole dataset, before randomly splitting the

data into training and testing. Explain your answer.

1

1b. Data preparation, 2 points

Assume that we want to discretize numerical features prior to applying the

decision-tree learning algorithm. Will it have any effect on the resulting tree if

we employ equal-width or equal-sized binning? Explain your reasoning.

1c. Performance metrics, 2 points

Assume that we have a binary classification model that, given a test instance,

outputs an estimate of the probability for the positive class. By choosing some

other threshold than 0.5 for whether to assign a positive or negative label to the

test instances, the predictive performance may be affected. Should we increase

or decrease the threshold to increase a) precision and b) recall, of the positive

class? Explain your answer.

1d. Combining models, 2 points

The predictive performance of an ensemble of classifiers, for which the predic-

tions are formed by averaging predictions of the individual members, is depen-

dent on the diversity of the members. Describe how an ensemble of na¨ıve Bayes

classifiers would be trained, if similar techniques that are used to form random

forests would be employed.

1e. Association rules, 2 points

Assume that we have generated a set of association rules with a specified support

and confidence, from a dataset with a set of binary features and binary class

labels, encoded as itemsets. Assume that we have selected a subset of the rules,

for which the heads (consequents) contain only a class label. If we want to use

this subset of rules to classify a novel test instance, i.e., to assign one of the two

class labels, what are the potential problems we may encounter? Explain your

reasoning.

Part II (Programming, 20 points)

2a. Data preparation, 10 points

Your task is to define the following Python function that aggregates multiple

rows with the same identifier into a single row:

aggregate(df)

which given a pandas dataframe df, where the columns correspond to numerical

features, except for a column named CLASS, which contains (categorical) class

values, and a column named ID, which contains identifiers for the instances. In

case one or more instances share the same identifier, these should in the new

dataframe be represented by a single row, where the identifier should appear in

the ID column, and the value for each numerical feature of the instance should

be the mean of the values for instances sharing the identifier, and the value for

the class label of the instance should be the mode of the values for instances

sharing the identifier (in case the mode is not unique, one of the mode values

2

may be chosen arbitrarily).

For example, given a dataframe df:

ID V1 V2 CLASS

0 1 1 1 A

1 1 2 0 A

2 2 3 1 B

3 2 4 0 C

4 3 5 1 C

then aggregate(df) should return the following dataframe:

ID V1 V2 CLASS

0 1 1.5 0.5 A

1 2 3.5 0.5 B

2 3 5.0 1.0 C

Hint: You may obtain unique values from a pandas data series values by

values.unique(), the mean by values.mean() and the first mode value by

values.mode()[0].

2b. Combining models, 10 points

Your task is to define the following Python function:

stacking(level0_df,level1_df,base_learners,learner)

which given two pandas dataframes level0_df and level1_df, with the same

set of columns, which correspond to features, except for a column named CLASS,

which contains the class values, and where the rows correspond to instances, a

list of learning algorithms base_learners (see below), and a learning algorithm

learner (see below), will return a list of base models and a stacking model,

where each base model is trained on level0_df using the corresponding base

learning algorithm, and the stacking model is trained using the class labels of

level1_df and the predictions of the base models on level1_df as features.

Each learning algorithm alg, i.e., as specified by learner and the elements

of base_learners above, can be used to generate a model from a dataframe by:

model = alg.fit(df)

Each resulting model can be used to make predictions (produce a list of class

labels) for a dataframe (in which the CLASS column is ignored) by:

predictions = model.predict(df)

For example, given two dataframes df0 and df1, with the same columns, and

the learning algorithms alg1,alg2,alg3,alg4, then:

stacking(df0,df1,[alg1,alg2,alg3],alg4)

should return a list of three (base) models trained using alg1,alg2,alg3

on df0, and a (stacking) model trained using alg4 on a dataframe with four

columns; one column for each base model with the predictions for df1, and one

column containing the class labels of df1.

3

欢迎咨询51作业君