程序代写案例-DSC 40A

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

DSC 40A - Homework 1
Due: Tuesday, January 12, 2021 at 11:59pm PDT
Write your solutions to the following problems by either typing them up or handwriting them on another
piece of paper. Homeworks are due to Gradescope by 11:59pm PDT on Tuesday.
Homework will be evaluated not only on the correctness of your answers, but on your ability to present
your ideas clearly and logically. You should always explain and justify your conclusions, using sound
reasoning. Your goal should be to convince the reader of your assertions. If a question does not require
explanation, it will be explicitly stated.
Homework should be written up and turned in by each student individually. You may talk to other students
in the class about the problems and discuss solution strategies, but you should not share any written
communication and you should not check answers with classmates. You can tell someone how to do a
homework problem, but you cannot show them how to do it.
This policy also means that you should not post or answer homework-related questions on Piazza,
which is a written medium. This includes private posts to instructors. Instead, when you need help with a
homework question, talk to a classmate or an instructor in their office hours.
For each problem you submit, you should cite your sources by including a list of names of other students
with whom you discussed the problem. Instructors do not need to be cited.
This homework will be graded out of 55 points. The point value of each problem or sub-problem is indicated
by the number of avocados shown.
Problem 1. Syllabus
Read the syllabus file on Canvas carefully and state in your solution that you have done so.
Problem 2. Baseball Statistics
There are thirty baseball teams in Major League Baseball. In the regular season of baseball, each team plays
162 games, though in 2020, due to the Covid-19 pandemic, each team played only 60 games. Many different
statistics are kept for each team, one of the most basic being the total number of runs scored by the team
throughout the season.
Suppose you have access to a data set consisting of the total number of runs scored by each of the teams in
the 2019 regular season, as well as in the shortened 2020 regular season. Below please report your answers
rounded to 2 digits after the decimal point.
a) From your data set, you calculate that the average number of runs scored by each team in
2020 was 268.367. Later, you learn that your data set had an error. The San Diego Padres actually
scored 325 runs, but your data set accidentally recorded this as 25 runs. From this information alone
(without access to the full data set), can you correct the mistake and find the correct average number
of runs per team in 2020?
b) In general, are you able to correct an average if one of the data values changes? What about
the median? The mode?
c) You want to get a sense for how each team performed in the 2020 regular season versus the
2019 regular season. How might you use your data set to quantify each team’s improvement or decline
from 2019 to 2020?
1
Problem 3. Averages
Which of the following statements must be true? Remember to justify all answers.
a) Some of the numbers in the data set must be greater than the average.
b) At least half of the numbers in a data set must be greater than the average.
c) Exactly half of the numbers in a data set must be greater than the average.
d) Not all of the numbers in the data set can be greater than the average.
Problem 4. Fahrenheit or Celsius?
The other day, you went online to look at how other cities around the world compare to San Diego in terms
of the climate. You looked up Osaka, Japan and found that their temperature is recorded in Celsius. The
table below shows the monthly temperatures in these two places.
San Diego, US (◦F) 66 66 67 69 69 72 76 77 77 74 70 66
Osaka, Japan (◦C) 9 10 14 20 25 28 32 33 29 23 18 12
You want to compare the median monthly temperatures in San Diego and Osaka. One way is to convert
all the degrees in Fahrenheit to Celsius before computing the median temperature for San Diego. This way,
you compare the median temperatures in both places using the same units.
You friend Skip is lazier and thinks you can skip some of that work: “Why don’t we find the median
temperature in San Diego in Fahrenheit first, and then only convert the median value to Celsius? That way
we only need to do the conversion once or twice.”
a) Is Skip correct that you’ll get the same median temperature either way? Show your work.
b) More generally, let g(t) be the function which takes in a temperature in degrees Fahrenheit
and outputs the temperature in Celsius. That is, g(t) = 59 × (t − 32). Then this problem can be
formulated mathematically as the validity of the following equation:
Median(g(t1), . . . , g(tn)) = g(Median(t1, . . . , tn)).
Prove this equation or disprove it by counter-example.
c) Finally, you find that the median temperatures for San Diego and Osaka are quite close to each
other. Can you say that the climate at the two cities are similar to each other? Where do you think
the median temperature is more representative of the overall climate, San Diego or Osaka? What is
your reasoning?
Problem 5. Minimize risk or maximize likelihood?
In our lecture, we argued that one way to make a good prediction h is to minimize the mean absolute error:
R(h) =
1
n
n∑
i=1
|h− xi|.
We saw that the median of y1, . . . , yn is the prediction with the smallest mean error. Your friend Max thinks
that instead of minimizing the mean error, it is better to maximize the following quantity:
M(h) =
n∏
i=1
e−|h−xi|.
2
The above formula is written using product notation, which is similar to summation notation, except terms
are multiplied and not added. For example,
n∏
i=1
ai = a1 · a2 · . . . · an.
Max’s reasoning is that for some models, e−|h−xi| is used to compute how likely prediction h will appear
given the observation xi – hence it is called “likelihood.” Then, we should attempt to maximize the chance
of getting the prediction h, given the set of observations. In this problem, we’ll see if Max has a good idea.
a) For an arbitrary fixed value of xi, sketch a graph of the basic shape of the likelihood func-
tion L(h) = e−|h−xi|. Explain, based on the graph, why larger values of L(h) correspond to better
predictions h.
b) Informally, a minimizer of a function f is an input xmin where f achieves its minimum value.
More formally, xmin is a minimizer of f if f(xmin) ≤ f(x) for all values of x. In the same way, xmax
is a maximizer of f if f(xmax) ≥ f(x) for all values of x.
Suppose that f is some unknown function which takes in a real number and outputs a real number.
Suppose that c is an unknown positive constant, and define the function g(x) = e−c·f(x). Prove that
if xmin is a minimizer of f , then it is also a maximizer of g.
c) At what value h∗ is M(h) maximized? Is this a reasonable prediction? Discuss the pros and
cons of using Max’s prediction strategy, and describe scenarios where this gives a good prediction and
where this gives a bad prediction, in your opinion.
Problem 6. Does more data help?
A question that is perhaps timely is: All the companies are hoarding as much data as they can get. But are
predictions necessarily better with more data?
Let’s start with a simple experiment. Imagine that our data contains randomness (as will all phenomena on
Earth) and are generated identically and independently from a uniform distribution within an interval on
the real line: that is, for each data point, there is equal chance of it taking any value within a fixed interval.
Suppose that the length of that interval is 2, but we do not know the center of the interval. Based on the
data we observe, we want to estimate, or predict, the center of the interval, which we call θ, and we want to
see whether more data will help us in determining θ.
We can generate synthetic data to perform this experiment. We’ll start by fixing a value θ, which will
determine the interval from which we generate our data. In Python, let’s first import the numpy package
and generate a true θ parameter by picking a number randomly from 0 to 10:
import numpy as np
theta = np.random.uniform(0.0,10.0)
Note in the above code that the random.uniform() function takes in two arguments, the lower and upper
limits of an interval, and outputs a random number in that interval, chosen from a uniform distribution.
Now we have fixed the true model that generates data. The value of θ, or the center of the interval from
which we will generate data, is determined, though we don’t know what is. We’ve decided that our interval
will have a width of 2, which means the data will be uniformly distributed across the interval [θ − 1, θ + 1].
Next, we can generate synthetic data from our model using the random.uniform() function. For example,
if we wish to generate 10 data points, we can set number of data n = 10 and write:
n = 10
x = np.random.uniform(theta-1.0, theta+1.0, n)
3
Here the additional third argument in the random.uniform() function specifies how many identically dis-
tributed outputs we wish the function to generate. The default setting (as in our previous lines of code) is
1.
If we would like to take a look at the generated data, we can print it:
print (x)
You will see that our data x are stored as an array and are randomly distributed in a certain range. With
the data generated, we can now start to infer what the value of θ is. Run the code given above and look at
the data in array x. Can you guess the value of θ based on the data in array x?
How good of a guess have we made? Since we have access to the true values of θ, we can use loss functions
to tell us how close our predictions are from the truth. For example, a small absolute error |h∗ − θ| or a
small squared error (h∗ − θ)2 can mean that the estimator h∗ is close to the truth θ.
If we use the squared loss, the best estimate of θ is the mean value h =
1
n
n∑
i=1
xi. Luckily, the numpy package
has a built-in function for that:
h = np.mean(x)
We can then compare our estimate against the truth theta using the squared loss and check how good our
estimate is:
err = (h-theta)**2
Recall that the z**2 notation means taking z to the second power. Smaller values of err correspond to
better predictions.
a) Now try this computation for n = 1, . . . , 100 using a for loop, and observe how the error
changes. Use the pyplot package to plot the results. For example, if err were to take values of
[1, 2, 4, 8, 16, 32, 64, 128] when n takes values of [1, 2, 3, 4, 5, 6, 7, 8], then we could make the plot using
the following code:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4,5,6,7,8], [1,2,4,8,16,32,64,128])
plt.show()
What does the curve look like?
If your plot is too noisy, make a few repeated runs of the algorithm, store the results, and average
over the err you obtain for all the runs. How do you think this error scales with the amount of data?
Hint: try plotting 1.0/err.
b) Next, repeat this experiment for the absolute loss (where the best estimate for it is the
median value). What scaling do you observe?
c) For this (growing) data set, are our predictions better if we have more data? Explain your
intuition for why this is the case.
Turn in your plots and your answers to the questions above, but do not turn in your code for this problem.
4

欢迎咨询51作业君