DSC 40A - Homework 1 Due: Tuesday, January 12, 2021 at 11:59pm PDT Write your solutions to the following problems by either typing them up or handwriting them on another piece of paper. Homeworks are due to Gradescope by 11:59pm PDT on Tuesday. Homework will be evaluated not only on the correctness of your answers, but on your ability to present your ideas clearly and logically. You should always explain and justify your conclusions, using sound reasoning. Your goal should be to convince the reader of your assertions. If a question does not require explanation, it will be explicitly stated. Homework should be written up and turned in by each student individually. You may talk to other students in the class about the problems and discuss solution strategies, but you should not share any written communication and you should not check answers with classmates. You can tell someone how to do a homework problem, but you cannot show them how to do it. This policy also means that you should not post or answer homework-related questions on Piazza, which is a written medium. This includes private posts to instructors. Instead, when you need help with a homework question, talk to a classmate or an instructor in their office hours. For each problem you submit, you should cite your sources by including a list of names of other students with whom you discussed the problem. Instructors do not need to be cited. This homework will be graded out of 55 points. The point value of each problem or sub-problem is indicated by the number of avocados shown. Problem 1. Syllabus Read the syllabus file on Canvas carefully and state in your solution that you have done so. Problem 2. Baseball Statistics There are thirty baseball teams in Major League Baseball. In the regular season of baseball, each team plays 162 games, though in 2020, due to the Covid-19 pandemic, each team played only 60 games. Many different statistics are kept for each team, one of the most basic being the total number of runs scored by the team throughout the season. Suppose you have access to a data set consisting of the total number of runs scored by each of the teams in the 2019 regular season, as well as in the shortened 2020 regular season. Below please report your answers rounded to 2 digits after the decimal point. a) From your data set, you calculate that the average number of runs scored by each team in 2020 was 268.367. Later, you learn that your data set had an error. The San Diego Padres actually scored 325 runs, but your data set accidentally recorded this as 25 runs. From this information alone (without access to the full data set), can you correct the mistake and find the correct average number of runs per team in 2020? b) In general, are you able to correct an average if one of the data values changes? What about the median? The mode? c) You want to get a sense for how each team performed in the 2020 regular season versus the 2019 regular season. How might you use your data set to quantify each team’s improvement or decline from 2019 to 2020? 1 Problem 3. Averages Which of the following statements must be true? Remember to justify all answers. a) Some of the numbers in the data set must be greater than the average. b) At least half of the numbers in a data set must be greater than the average. c) Exactly half of the numbers in a data set must be greater than the average. d) Not all of the numbers in the data set can be greater than the average. Problem 4. Fahrenheit or Celsius? The other day, you went online to look at how other cities around the world compare to San Diego in terms of the climate. You looked up Osaka, Japan and found that their temperature is recorded in Celsius. The table below shows the monthly temperatures in these two places. San Diego, US (◦F) 66 66 67 69 69 72 76 77 77 74 70 66 Osaka, Japan (◦C) 9 10 14 20 25 28 32 33 29 23 18 12 You want to compare the median monthly temperatures in San Diego and Osaka. One way is to convert all the degrees in Fahrenheit to Celsius before computing the median temperature for San Diego. This way, you compare the median temperatures in both places using the same units. You friend Skip is lazier and thinks you can skip some of that work: “Why don’t we find the median temperature in San Diego in Fahrenheit first, and then only convert the median value to Celsius? That way we only need to do the conversion once or twice.” a) Is Skip correct that you’ll get the same median temperature either way? Show your work. b) More generally, let g(t) be the function which takes in a temperature in degrees Fahrenheit and outputs the temperature in Celsius. That is, g(t) = 59 × (t − 32). Then this problem can be formulated mathematically as the validity of the following equation: Median(g(t1), . . . , g(tn)) = g(Median(t1, . . . , tn)). Prove this equation or disprove it by counter-example. c) Finally, you find that the median temperatures for San Diego and Osaka are quite close to each other. Can you say that the climate at the two cities are similar to each other? Where do you think the median temperature is more representative of the overall climate, San Diego or Osaka? What is your reasoning? Problem 5. Minimize risk or maximize likelihood? In our lecture, we argued that one way to make a good prediction h is to minimize the mean absolute error: R(h) = 1 n n∑ i=1 |h− xi|. We saw that the median of y1, . . . , yn is the prediction with the smallest mean error. Your friend Max thinks that instead of minimizing the mean error, it is better to maximize the following quantity: M(h) = n∏ i=1 e−|h−xi|. 2 The above formula is written using product notation, which is similar to summation notation, except terms are multiplied and not added. For example, n∏ i=1 ai = a1 · a2 · . . . · an. Max’s reasoning is that for some models, e−|h−xi| is used to compute how likely prediction h will appear given the observation xi – hence it is called “likelihood.” Then, we should attempt to maximize the chance of getting the prediction h, given the set of observations. In this problem, we’ll see if Max has a good idea. a) For an arbitrary fixed value of xi, sketch a graph of the basic shape of the likelihood func- tion L(h) = e−|h−xi|. Explain, based on the graph, why larger values of L(h) correspond to better predictions h. b) Informally, a minimizer of a function f is an input xmin where f achieves its minimum value. More formally, xmin is a minimizer of f if f(xmin) ≤ f(x) for all values of x. In the same way, xmax is a maximizer of f if f(xmax) ≥ f(x) for all values of x. Suppose that f is some unknown function which takes in a real number and outputs a real number. Suppose that c is an unknown positive constant, and define the function g(x) = e−c·f(x). Prove that if xmin is a minimizer of f , then it is also a maximizer of g. c) At what value h∗ is M(h) maximized? Is this a reasonable prediction? Discuss the pros and cons of using Max’s prediction strategy, and describe scenarios where this gives a good prediction and where this gives a bad prediction, in your opinion. Problem 6. Does more data help? A question that is perhaps timely is: All the companies are hoarding as much data as they can get. But are predictions necessarily better with more data? Let’s start with a simple experiment. Imagine that our data contains randomness (as will all phenomena on Earth) and are generated identically and independently from a uniform distribution within an interval on the real line: that is, for each data point, there is equal chance of it taking any value within a fixed interval. Suppose that the length of that interval is 2, but we do not know the center of the interval. Based on the data we observe, we want to estimate, or predict, the center of the interval, which we call θ, and we want to see whether more data will help us in determining θ. We can generate synthetic data to perform this experiment. We’ll start by fixing a value θ, which will determine the interval from which we generate our data. In Python, let’s first import the numpy package and generate a true θ parameter by picking a number randomly from 0 to 10: import numpy as np theta = np.random.uniform(0.0,10.0) Note in the above code that the random.uniform() function takes in two arguments, the lower and upper limits of an interval, and outputs a random number in that interval, chosen from a uniform distribution. Now we have fixed the true model that generates data. The value of θ, or the center of the interval from which we will generate data, is determined, though we don’t know what is. We’ve decided that our interval will have a width of 2, which means the data will be uniformly distributed across the interval [θ − 1, θ + 1]. Next, we can generate synthetic data from our model using the random.uniform() function. For example, if we wish to generate 10 data points, we can set number of data n = 10 and write: n = 10 x = np.random.uniform(theta-1.0, theta+1.0, n) 3 Here the additional third argument in the random.uniform() function specifies how many identically dis- tributed outputs we wish the function to generate. The default setting (as in our previous lines of code) is 1. If we would like to take a look at the generated data, we can print it: print (x) You will see that our data x are stored as an array and are randomly distributed in a certain range. With the data generated, we can now start to infer what the value of θ is. Run the code given above and look at the data in array x. Can you guess the value of θ based on the data in array x? How good of a guess have we made? Since we have access to the true values of θ, we can use loss functions to tell us how close our predictions are from the truth. For example, a small absolute error |h∗ − θ| or a small squared error (h∗ − θ)2 can mean that the estimator h∗ is close to the truth θ. If we use the squared loss, the best estimate of θ is the mean value h = 1 n n∑ i=1 xi. Luckily, the numpy package has a built-in function for that: h = np.mean(x) We can then compare our estimate against the truth theta using the squared loss and check how good our estimate is: err = (h-theta)**2 Recall that the z**2 notation means taking z to the second power. Smaller values of err correspond to better predictions. a) Now try this computation for n = 1, . . . , 100 using a for loop, and observe how the error changes. Use the pyplot package to plot the results. For example, if err were to take values of [1, 2, 4, 8, 16, 32, 64, 128] when n takes values of [1, 2, 3, 4, 5, 6, 7, 8], then we could make the plot using the following code: import matplotlib.pyplot as plt plt.plot([1,2,3,4,5,6,7,8], [1,2,4,8,16,32,64,128]) plt.show() What does the curve look like? If your plot is too noisy, make a few repeated runs of the algorithm, store the results, and average over the err you obtain for all the runs. How do you think this error scales with the amount of data? Hint: try plotting 1.0/err. b) Next, repeat this experiment for the absolute loss (where the best estimate for it is the median value). What scaling do you observe? c) For this (growing) data set, are our predictions better if we have more data? Explain your intuition for why this is the case. Turn in your plots and your answers to the questions above, but do not turn in your code for this problem. 4
欢迎咨询51作业君