FIT3154 Assignment 1 Due Date: 12:00PM, Monday, 12/9/2022 Introduction There are a total of three questions worth 17 + 15 + 14 = 46 marks in this assignment. This assignment is worth a total of 20% of your final mark, subject to hurdles and any other matters (e.g., late penalties, special consideration, etc.) as specified in the FIT3154 Unit Guide or elsewhere in the FIT3154 Moodle site (including Faculty of I.T. and Monash University policies). Students are reminded of the Academic Integrity Awareness Training Tutorial Activity and, in par- ticular, of Monash University’s policies on academic integrity. In submitting this assignment, you acknowledge your awareness of Monash University’s policies on academic integrity and that work is done and submitted in accordance with these policies. Submission: No files are to be submitted via e-mail. Correct files are to be submitted to Moodle. Scans of handwritten answers are acceptable but theymust be clean and legible. You must ensure your submission contains answers to the questions in the order they appear in the assignment. Submission must occur before 12:00 PM Monday, 12th of September, and late submissions will incur penalties as per Faculty of I.T. policies. 1 Question 1 (17 marks) Background: The Gamma Distribution In this question we will be looking at the gamma distribution. This is a distribution frequently used in the analysis of non-negative real numbers. This is introduced in Lecture 6, but here we will be using a different (but equivalent) form parameterised instead in terms of the mean µ and shape φ with probability density p(y |µ, φ) = 1Γ(φ) ( φ µ )φ yφ−1 exp ( −φy µ ) . where Γ(·) is known as the “gamma function”. Figure 1 shows several example gamma distributions. The following facts about the gamma distribution will prove very helpful for answering the questions. If Y ∼ Ga(µ, φ) is a random variable following a gamma distribution with mean µ and shape φ then E [Y ] = µ, and V [Y ] = µ 2 φ so that µ controls the mean, and φ inversely scales the variance. Given a sample y = (y1, . . . , yn) the maximum likelihood estimate, and Fisher information, for µ are well known to be µˆ = y¯ and Jn(µ) = nφ µ2 (1) respectively, where y¯ = (1/n) ∑n i=1 yi is the sample mean. The estimates for φ are complicated, so initially we will proceed assuming φ is known; we will later relax this assumption. Bayesian Analysis of the Gamma Distribution with Known φ Let us now consider a Bayesian analysis of the gamma distribution with known φ using the hierarchy: yi |µ ∼ Ga(µ, φ), i = 1, . . . , n µ | a, b ∼ IG(a, b) where IG(a, b) denotes an inverse-gamma distribution with shape a and scale b1. A useful fact is that if Y ∼ IG(a, b) and a > 1 then E [Y ] = b a− 1 . One reason people use the inverse-gamma distribution as a prior for the mean of the gamma distribution is that it is flexible; Figure 2 demonstrates several possible different inverse-gamma distributions. It is also convenient, as the posterior distribution is then also an inverse gamma distribution of the form µ |y, φ, a, b ∼ IG (nφ+ a, nφy¯ + b) . The posterior mean of µ is then given by E [µ |y] ≡ µˆa,b = φny¯ + b φn+ (a− 1) . (2) which can be used as an estimate (best guess) of the unknown population µ. You should install the R package invgamma. This gives you access to the functions dinvgamma(), pinvgamma() and qinvgamma() which will be crucial in answering some questions. Important: for some reason, this package refers to the scale parameter of an inverse gamma distribution as a rate parameter. 1https://en.wikipedia.org/wiki/Inverse-gamma_distribution 2 Consider using (2) as an estimate of the unknown population µ from a sample of data y = (y1, . . . , yn) coming from a gamma distribution with mean µ and shape φ. The bias and variance of this estimator is biasµ(µˆa,b) = (1− a)µ+ b nφ+ a− 1 , Vµ [µˆa,b] = nµ2φ (nφ+ a− 1)2 (3) Please answer the following questions. Provide appropriate working/mathematical justification. 1. If φ = 1, how can we interpret the hyperparameters a and b? [2 marks] 2. Describe the dependency of the variance of the estimator on the population values of µ and φ. [1 mark] 3. Is the estimator (2) asymptotically efficient as n→∞? [2 marks] It is common to use so called “normalized” squared-error for many problems in which the mean and variance are linked. For the gamma distribution the normalised squared-error is L(µ, µˆ) = (µˆ− µ) 2 (µ2/φ) . and the normalized squared-error risk is therefore R(µ, µˆ) = ( φ µ2 ) E [ (µˆ− µ)2] . Please answer the following questions regarding this loss function. 4. Why do you think the normalized squared-error loss might be preferable to the usual (unnor- malized) squared-error loss? [1 mark] 5. Making use of (3), what is the normalized squared-error risk for the estimator when b = 0? Comment on the risk. [2 marks] 6. Prove that when a = 1, the choice b = 0 dominates the choice b > 0. [1 mark] 7. Plot the normalized squared-error risk (expected normalized squared-error) for n = 5, φ = 2 and µ ∈ (0.01, 10) for (a = 1, b = 0) (i.e., maximum likelihood), (a = 1/2, b = 1) and (a = 2, b = 1/2). You may plot all curves on the one plot. For readability, use a logarithmic scale for the y-axis and set the limits for the y-axis from (0.1, 2). Describe the key features of the risk curves. Over the range of µ values considered in our plot do any of the estimators appear to be inadmissable (that is, not admissable)? [3 marks] 8. Given an a > 1 and a b > 0, for which population value of µ does the estimator (2) attain the smallest normalized squared-error risk? [2 marks] Sometimes the alternative log-mean parameterisation v = logµ is used for the gamma distribution. This is in some (statistical) ways a more natural parameterisation than the mean µ. Under this parameterisation the probability density is p(y | v, φ) = 1Γ(φ)φ φyφ−1 exp (−φe−vy − φv) where ex ≡ exp(x). Please answer the following questions. 7. Derive the Fisher information Jn(v) for the log-scale parameter v. (hint: remember the observa- tions are independently and identically distributed) [2 marks] 8. Provide an argument as to why v = logµ might be more “natural” by comparing the Fisher informations for v, Jn(v), to the Fisher information for µ, Jn(µ) given by (1). [1 mark] 3 Question 2 (15 marks) In this question we will use the gamma distribution to analyse some topical and relevant data: case numbers of people in Victoria, Australia infected with the novel coronavirus (Covid-19). The data we will use was obtained from the the Victorian government public health website. In particular, we will analyse the daily case numbers for the month of July, 2022 from covid.daily.cases.07.2022.csv. You will also need to download the file gamma.fit.R. This contains functions that will assist you in fitting a gamma distribution to the data; in particular it contains code for maximum likelihood estimation and Bayesian sampling. Let us first analyse the data using maximum likelihood and the asymptotic distribution of maximum likelihood estimates. Please answer the following questions. Provide working/justifications/code as required to support your answers. 1. Fit a gamma distribution to the data using maximum likelihood via the function gamma.ml(y). What is the estimated mean number of daily cases µ for the month of July 2022? Provide an approximate 95% confidence interval for this estimate using the asymptotic distribution of the maximum likelihood estimate. [3 marks] 2. The average daily case numbers in the previous month was roughly 7,200. Using the asymptotic distribution of the ML estimate of µ, test the hypothesis that the average number of daily cases in July 2022 is the same as in June 2022. Clearly specify the hypothesis you are testing and discuss the resulting p-value. [3 marks] Let us perform a Bayesian analysis. We will use the function gamma.sampling(y,a,b,n.samples) to generate samples from the posterior distribution for the following hierarchy: yj |µ, φ ∼ Ga(µ, φ) µ | a, b ∼ IG(a, b) φ ∼ C+(0, 1) where y is the data, a and b are the hyperparameters for the inverse-gamma prior distribution for µ, and n.samples is the number of samples to generate. The hierarchy uses a default weakly informative half-Cauchy prior for φ, so we do not need to specify any hyperparameters for this prior. Using the daily case numbers from the previous four and a half months, we find that a reasonable prior guess at the average daily case numbers is around 8,500, with a 95% prior probability of the average daily case numbers ranging anywhere from 5,000 to 13,000 cases. 3. Select appropriate values of a and b for our inverse gamma prior distribution to (approximately) encode the specified information (hint: use the pinvgamma() function). [2 marks] 4. Generate 100, 000 samples from the posterior distribution using your specified prior hyperparam- eters. Plot the prior and posterior distributions of µ (hint: use the dinvgamma() function and histogram, as appropriate). [2 marks] 5. Report on the posterior mean and 95% credible intervals for µ. How do these compare to our maximum likelihood estimates? (hint: use the quantile() function) [2 marks] 6. Do you think the Bayesian analysis supports the hypothesis that the average daily case numbers for July, 2022 are the same as June, 2022? [1 mark] 7. It is important for authorities to have an idea on the likely range of daily case numbers. Provide an estimate of the 97.5-th percentile of daily case numbers using your Bayesian samples. Provide a 95% credible interval for this estimate (hint: use the qgamma.mu() function provided). [2 marks] 4 Question 3 (14 marks) This question will require you to analyse a regression dataset using Bayesian lasso regression. In particular, you will be looking at predicting the fuel efficiency of a car (in kilometers per litre) based on characteristics of the car and its engine. This is clearly an important and useful problem. The dataset fuel.2022.csv contains n = 1, 000 observations on p = 9 predictors obtained from actual fuel efficiency tables for car models available for sale during the years 2017 through to 2020. The data dictionary for this dataset is given in Table 1. To analyse this data in a Bayesian fashion you will use the bayesreg package in R. 1. Fit a regression model to the fuel efficiency data using the least-squares method in R (i.e., using lm()). From this output, what variables do you believe are potentially associated with increased fuel efficiency, and why? [2 marks] 2. Analyse the fuel efficiency data using the Bayesian lasso; use at least 10, 000 samples from the posterior. Compare posterior means, standard deviations and t-statistics to those obtained using least-squares. How do they compare, and if you believe they are different, why do you think they are different? [2 marks] 3. Write down your best estimate of the regression equation relating fuel efficiency to the explana- tory variables. [2 marks] 4. If we wanted to advise a car manufacturer regarding the choice of an engine to have likely good fuel efficiency, what does this model suggest we should look for? [2 marks] 5. From your Bayesian analysis, which variables seem unimportant/unlikely to be associated with fuel efficiency? Justify your answer [2 marks] 6. Imagine that a friend proposes purchasing a particular model of car with the properties given in Table 2. Use your Bayesian model to: (a) Predict the mean fuel efficiency for this car. [1 mark] (b) Provide a histogram of the posterior distribution of the mean fuel efficiency for this car, and a 95% credible interval on the plausible range of mean fuel efficiency for this car. [1 mark] (hint: use the rv$beta matrix in the return value from bayesreg() to access the samples of the coefficients) 7. An engineer suggests they believe there is likely an interaction effect between engine displacement and the number of cylinders. Using the Bayesian lasso, and statistics produced by the Bayesian lasso code, assess whether you think this is the case, and what effect it has on the model? [2 marks] 5 Variable name Description Values Model.Year Year of sale 2017− 2020 Eng.Displacement Engine Displacement (litres, l) 0.9− 8.4 No.Cylinders Number of Cylinders 3− 16 Aspiration Engine Aspiration (Oxygen intake) N: Naturally∗ OT: Other SC: Supercharged TC: Turbocharged TS: Turbo+supercharged No.Gears Number of Gears 1− 10 Lockup.Torque.Converter Lockup torque converter present? N∗ and Y Drive.Sys Drive System 4∗: 4-wheel drive A:All-wheel F:Front-wheel P:Part-time 4-wheel R:Rear-wheel Max.Ethanol Maximum % of Ethanol allowed 10− 85 Fuel.Type Type of Fuel G∗: Regular Unleaded GM: Mid-grade Unleaded Recommended GP: Premium Unleaded Recommended GPR: Premium Unleaded Required Comb.FE Fuel Efficiency (km/l) 4.974− 26.224 Table 1: Fuel efficiency data dictionary. The ∗ denotes the reference category for each categorical variable. Variable Model.Year Eng.Displacement No.Cylinders Aspiration No.Gears Lockup.Torque.Converter Drive.Sys Max.Ethanol Fuel.Type Value 2018 3.5 6 N 8 N R 15 GPR Table 2: Engine Properties. some text 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Figure 1: Example probability density functions for the gamma distribution with mean µ and shape φ. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1.2 Figure 2: Example probability density functions for the inverse gamma distribution with shape a and scale b. 7
欢迎咨询51作业君