程序代写案例-200A

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Data 100 & 200A Principles and Techniques of Data Science
Spring 2019 Final
INSTRUCTIONS
• You have 2 hours and 50 minutes to complete the exam.
• The exam is closed book, closed notes, closed computer, closed calculator, except for the provided midterm 1
reference sheet and up to three 8.5" × 11" sheets of notes of your own creation.
• Mark your answers on the exam itself. We will not grade answers written on scratch paper.
Last name
First name
Student ID number
CalCentral email ([email protected])
Exam room
Name of the person to your left
Name of the person to your right
All the work on this exam is my own.
(please sign)
Terminology and Notation Reference:
exp(x) ex
log(x) loge x
Linear regression model E[Y |X] = XTβ
Logistic (or sigmoid) function σ(t) = 11+exp(−t)
Logistic regression model P (Y = 1|X) = σ(XTβ)
Squared error loss function L(y, θ) = (y − θ)2
Absolute error loss function L(y, θ) = |y − θ|
Cross-entropy loss function L(y, θ) = −y log θ − (1− y) log(1− θ)
Bias Bias[θˆ, θ] = E[θˆ]− θ
Variance Var[θˆ] = E[(θˆ − E[θˆ])2]
Mean squared error MSE[θˆ, θ] = E[(θˆ − θ)2]
21. (6 points) Visualization
You have data on 500 rental housing units, including rent price, number of bedrooms, and square footage.
(a) (2 pt) Fill the box for all suitable visualizations for comparing the distribution of rent price for 2, 3, and
4 bedroom units.2 Scatterplot 2 Violin plots 2 Barplot 2 Boxplots 2 None of these
(b) (2 pt) Fill the box for all suitable visualizations for examining how rent price and square footage relate.2 Density plots 2 Scatterplot 2 Histograms 2 Boxplots 2 None of these
The scatterplot generated by sns.jointplot below to the left visualizes a dataset containing 600 (x, y) pairs.
You partition the pairs by their x value into three groups:
• The 200 with the lowest x values are in group 0.
• The 200 with intermediate x values are in group 1.
• The 200 with the highest x values are in group 2.
(c) (1 pt) Bubble the letter of the boxplot that shows the distribution of x values in each of the three groups.
© A © B © C © D
(d) (1 pt) Bubble the letter of the boxplot that shows the distribution of y values in each of the three groups.
© A © B © C © D
2. (4 points) Classification and Regression Trees (CART) and Random Forests
(a) (1 pt) When predicting a qualitative outcome, the predicted class at each terminal node is:
© the most common class in the node © the average class for the node © neither
(b) (1 pt) When predicting a quantitative outcome, the predicted value at each terminal node is:
© the most common outcome in the node © the average outcome for the node © neither
(c) (2 pt) The main idea behind ensemble prediction methods such as Random Forests is that aggregating
predictions from predictors applied to multiple bootstrap samples of the learning set:
© reduces bias © reduces variability © reduces both bias and variability © none of these
Name: 3
3. (8 points) Feature Engineering
These instructions are not identical to the similar question on midterm 2, so please read carefully.
For each dataset depicted below in a scatterplot, fill in the squares next to all of the letters for the functions f
that would make it possible to choose scalars β0, β1, and β2 such that yi = β0 + xiβ1 + f(xi)β2 for all (xi, yi)
pairs in the dataset. The input to each f is a scalar x shown on the horizontal axis, and the corresponding y
value is shown on the vertical axis.
(A) f(x) = −x
(B) f(x) = x
(C) f(x) = x2
(D) f(x) = |x|
(E) None of the above
(i) (2 pt)2 A 2 B 2 C 2 D 2 E (ii) (2 pt)2 A 2 B 2 C 2 D 2 E
−1 0 1 2 3 4 5
−6
−4
−2
0
2
4
6
x
y
−1 0 1 2 3 4 5
−6
−4
−2
0
2
x
y
(iii) (2 pt)2 A 2 B 2 C 2 D 2 E (iv) (2 pt)2 A 2 B 2 C 2 D 2 E
−2 0 2 4 6 8 10 12
2
2.5
3
3.5
4
4.5
x
y
−3 −2 −1 0 1 2 3
2
3
4
5
6
7
8
x
y
44. (11 points) Regression
Suppose you are interested in predicting a one-dimensional quantitative random
variable Y (outcome) in terms of a two-dimensional quantitative random variable
X (covariates). You have a learning set (x1, y1), . . . , (x3, y3) of three observations,
given in the table to the right.
i xi yi
1
[
0
3
]
-3
2
[
3
0
]
0
3
[
6
0
]
3
(a) (2 pt) You estimate the regression function E[Y |X] using ridge regression with no intercept term and a
shrinkage parameter of λ. Select the correct expression for the regression function, where β = [β1 β2]T .
© E[Y |X] = XTβ
© E[Y |X] = XTβ + ξ
© E[Y |X] = XTβ + λ · (β21 + β22)
© E[Y |X] = XTβ + ξ + λ · (β21 + β22)
© None of these
(b) (2 pt) Why are the letters X and Y capitalized in the models in (a) for the regression function?
© They represent random variables.
© They represent matrix-valued elements of the learning set.
© They represent unknown parameters that need to be estimated from the learning set.
(c) (2 pt) There are two derivations below: left and right. Select the one that expresses the regularized
empirical risk (mean squared error) for ridge regression for this dataset in terms of β1 and β2 when λ=1.
© 1
3
[
(−3+3β2)2 + (3β1)2 + (3+6β1)2
]
+(β21+β
2
2) ©
1
3
[
(−3−3β2)2 + (−3β1)2 + (3−6β1)2
]
+(β21+β
2
2)
= 16β21 + 12β1 + 4β
2
2 − 6β2 + 6 = 16β21 − 12β1 + 4β22 + 6β2 + 6
(d) (3 pt) Find the β1 and β2 that minimize the regularized empirical risk objective from part (c) above.
Show your work for full credit.
β1 = β2 =
(e) (2 pt) Choose the value of λ for which the values of β1 and β2 that would minimize the regularized
empirical risk for ridge regression would also minimize the mean squared error on the learning set.
© -1 © 0 © 1 © None of these © Impossible to tell
Name: 5
5. (12 points) Logistic Regression
(a) (4 pt) Select always, sometimes, or never to describe when each statement below is true about a logistic
regression model P (Y = 1|X) = σ(XTβ), where Y is binary and X is vector-valued. X and β are finite.
© Always © Sometimes © Never: P (Y = 1|X) > XTβ
© Always © Sometimes © Never: P (Y = 1|X) = P (Y = 0| −X)
© Always © Sometimes © Never: P (Y = 1|X) < 1
© Always © Sometimes © Never: σ(XTβ) ≤ σ(XT (2 · β))
(b) (8 pt) Complete the code below to assign best_λ to the regularization parameterλ among choices that
gives the smallest 2-fold cross-validation risk for L1-regularized logistic regression. The learning set x has
n observations and m features. Correct labels are named y. Assume minimize can minimize any function.
n, m = x.shape
assert len(y) == n
from scipy.optimize import minimize as scipy_minimize
def minimize(f, k):
"Return the k-length array that minimizes f."
return scipy_minimize(f, np.zeros(k))['x']
def sigma(t):
return 1/(1 + np.exp(-t))
def log_loss(y, z):
return -y * np.log(z) - (1 - y) * np.log(1 - z)
def risk(λ):
"Return 2-fold cross-validation risk for regularization hyperparameter λ."
def objective(b):
losses = log_loss(y_train, sigma(x_train @ b))
return __________________________________________________________________________
half = n // 2
x_1, x_2 = x[:half, :], x[half:, :]
y_1, y_2 = y[:half], y[half:]
x_train, y_train = ________________________________, ________________________________
b = minimize(objective, m)
risk_1 = ____________________________________________________________________________
x_train, y_train = ________________________________, ________________________________
b = minimize(objective, m)
risk_2 = ____________________________________________________________________________
return (risk_1 + risk_2)/2
choices = [2 ** c for c in range(-10, 10)]
best_λ = ________________________________________________________________________________
66. (13 points) Taxi Trips
The streets of Anytown, USA, are a perfect grid with every block exactly one tenth of a mile long. Taxi drivers
only pick up and drop off at intersections. You have acquired a log of all 200,000 taxi trips in Anytown during
2018, which includes measurements for the following 3 variables: number of blocks, total distance, and fare.
Taxi fare in Anytown is $2/mile + $3/trip. Assume no measurement error and no missing values.
(a) (1 pt) What is the rank of the matrix with rows corresponding to the 200,000 taxi trips and columns to
the 3 variables?
© 1 © 2 © 3 © 4 © None of these © Impossible to tell
(b) (1 pt) What is the rank of a four-column matrix containing columns for the three variables as well as a
column of all ones?
© 1 © 2 © 3 © 4 © None of these © Impossible to tell
(c) (4 pt) If you apply PCA to this four-column matrix, which of the following statements will be true?
U , Σ, and V refer to the singular value decomposition of the centered four-column matrix X: X = UΣV T .
© True © False: The sum of the singular values will be the total variance of the dataset.
© True © False: All singular values will be greater than zero.
© True © False: The dot product of the first and second columns of V will be equal to 1.
© True © False: A scatterplot of the first two columns of UΣ will form a straight line.
A simple random sample of 500 taxi trips in Anytown is taken on May 1, 2018. The data matrix for this
sample has 4 columns: number of blocks, total distance, fare, and the duration of the trip.
(d) (5 pt) Fill the boxes for all of the following that are possible to compute reliably via bootstrapping, using
this 4-column sample from May 1 as well as the original 3-column population dataset for all of 2018.2 A confidence interval for the mean trip duration in 2018.2 A confidence interval for the mean trip duration in 2019.2 An estimate of the variance of the average duration computed for each possible sample of 500 trips
taken on May 1, 2018.2 An estimate of the variance of the average duration computed for each possible sample of 100 trips
taken on May 1, 2018.2 A confidence interval for the difference between the average trip distance on May 1, 2018, and the
average trip distance for all of 2018.2 None of the above
(e) (2 pt) Using this random sample, you decide to fit an ordinary least squares linear regression model
with no intercept using the normal equations, where the outcome is trip duration and the design matrix
contains three covariates: number of blocks, total distance, and fare. Which problem below do you expect
to encounter first?
© The model will overfit because the sample is too small.
© Processing the dataset will require more than one computer.
© Gradient descent will reach a local minimum that is not the global minimum.
© It will be impossible to fit the parameters of the model.
© None of these
Name: 7
7. (22 points) Tables
Fill in both the Python code and the SQL query to produce each result below, assuming that the following
tables are stored both as Pandas DataFrames and Sqlite tables. Only the first few rows are shown for
each table. No tables have missing values. The scores table contains a row for each submitted homework
in a course and contains columns for the id of the student who submitted, the hw number, and the score.
Assume that a student can submit a homework at most once, every enrolled student has submitted at least
one homework, and every homework has been submitted by at least one student. The hws table contains a
row for each homework and contains columns for the name, which is always the letters “hw” followed by the
homework number, as well as the max possible score for the homework. Each homework appears exactly once
in the hws table. The ugrads table contains the student ID (sid) for each undergraduate enrolled at the
university. There are both undergrads and graduate students in the course.
scores hws ugrads
id hw score
90210 1 4
94720 3 3
90210 3 2
name max
hw1 5
hw2 5
hw3 3
sid
94709
94114
94720
(a) (6 pt) Select the number of undergrads enrolled in the course.
Python: _____________________(_____________________________.isin(_______________________________))
SQL: SELECT COUNT(*) FROM ugrads WHERE _______________________________________________________;
(b) (6 pt) List the IDs for students who are missing at least one homework.
Python: counts = _______________________________________________________________________________
list(__________________[_________________________________________________________].index)
SQL: SELECT id FROM scores ______________________________ COUNT(*)<(SELECT COUNT(*) FROM hws);
(c) (6 pt) Build a table with one row per homework containing the total number of points scored by under-
graduates on that homework.
Python: (scores.__________________________________________________________________________________
.groupby(_______________________________)________________________________________________)
SQL: SELECT hw, ______________ FROM scores _______________________________________ GROUP BY hw;
(d) (4 pt) Select the average fraction of possible points that was scored on each submitted homework.
Python: names = scores['hw'].apply(lambda n: f'_________________________________________________')
np.mean(scores['score'] / np.array(hws.set_index('name')________________________________))
SQL: SELECT __________________________ FROM scores JOIN hws ON _______________________________;
88. (8 points) Bias and Variance
(a) (2 pt) Select the best description of the bias of an estimator θˆ for a parameter θ, where θˆ is computed
based on a simple random sample from the population of interest.
© The difference between θ and the value of θˆ for a particular sample.
© The average difference between θ and θˆ computed for each element of the sample.
© The average difference between θ and θˆ over all possible simple random samples from the population.
© The difference between the expected value of θ and the value of θˆ for a particular sample.
You estimate the maximum possible outcome (number of dots) of rolling of a fair 6-sided die using the outcome
of a single roll of the die (pretending you don’t already know the number of dots on the faces of the die).
(b) (2 pt) What is the bias of this estimator?
© -3.5 © -2.5 © -1.5 © 0 © 1.5 © 2.5 © 3.5 © Other: ________
(c) (1 pt) What is the variance of this estimator?
© 0 © 52 © 72 © 3512 © 4712 © 556 © 616 © Other: ________
(d) (1 pt) What is the mean squared error of this estimator?
© 0 © 52 © 72 © 3512 © 4712 © 556 © 616 © Other: ________
(e) (2 pt) How would the magnitude (absolute value) of the estimator’s bias change if your estimator was
the maximum outcome that appeared in 100 rolls of the die instead of just one roll?
© It would increase © It would decrease © It would stay the same
9. (4 points) Sampling Distributions
The temperature difference between the outside and inside of a room measured each minute has this distribution:
You sample 10 differences at random with replacement. For each distribution below, select the histogram that
best visualizes the distribution.
© A © B © C © D © E © F: The sampling distribution of the sample 90th percentile.
© A © B © C © D © E © F: The sampling distribution of the sample median.
© A © B © C © D © E © F: The sampling distribution of the sample standard deviation.
© A © B © C © D © E © F: The bootstrap distribution of the sample median.
Name: 9
10. (8 points) Distributed File Systems
A large file is partitioned into 8 equal-sized blocks, and each block is replicated 2 times. For each block, a
simple random sample of 2 machines is chosen (without replacement) out of 4 total machines, and one replica
of the block is stored on each machine chosen in this way.
(a) (2 pt) What is the expected number of blocks stored on each machine?
© 0 © 1 © 2 © 4 © 8 © 16 © 32 © Other: ________
(b) (2 pt) If one machine fails, what is the chance that the entire file can still be read?
© 0 © 18 © 14 © 12 © 34 © 78 © 1 © Other: ________
(c) (2 pt) If two machines fail, what is the chance that the entire file can still be read?
© 0 © 12 ©
(
1
2
)8 © 16 © 56 © ( 56)8 © 1−( 56)8 © 1 © Other: ________
(d) (2 pt) If three machines fail, what is the chance that the entire file can still be read?
© 0 © 12 ©
(
1
2
)8 © 16 © 56 © ( 56)8 © 1−( 56)8 © 1 © Other: ________
11. (4 points) Distributed Computing
Assume you have the long-running remote function f defined below and you want to call it 10 times.
@ray.remote
def f():
run_time = 10 * np.random.random() # Chosen uniformly at random from 0 to 10 seconds
time.sleep(run_time) # Do nothing for run_time seconds
return run_time
(a) (1 pt) Assuming there is no overhead, what is the expected amount of time required to call f 10 times
if all calls are executed serially?
© 1 second © 5 seconds © 10 seconds © 50 seconds © 100 seconds © None of these
(b) (1 pt) Assuming there is no overhead, what is the maximum amount of time required to call f 10 times
if all calls to f are executed in parallel?
© 1 second © 5 seconds © 10 seconds © 50 seconds © 100 seconds © None of these
(c) (2 pt) Select all of the expressions below that correctly compute a list of the return values from calling
f 10 times and execute all calls in parallel.
2 [f() for _ in range(10)]
2 [f.remote() for _ in range(10)]
2 ray.get([f.remote() for _ in range(10)])
2 [ray.get(f.remote()) for _ in range(10)]
2 None of these
10
12. (0 points) Optional: Draw a Picture About Berkeley Data Science

欢迎咨询51作业君