STAT 577: Final Exam Due on May 18, 2022
Do not discuss this work with anyone, except with your instructor 1.
Predictive Model For Electricity Usage The Electricity.csv datafile contains the daily electricity usage (Usage) of about 63 different households in the Knoxville area over the span of about two years. Of particular interest is developing a predictive model for Usage. With such a model, the utility company can better plan for total and peak demands. It is known that the most important factor that drives electricity usage is temperature (these households have electric heat and air-conditioning). The file contains the following information: • ID - household ID • Usage - daily electricity usage of a household • PayPlan - Yes or No depending on whether they have signed up for a payment plan option • DaylightHours - the number of hours of daylight for that day • coolinghours - technical term that describes how hard an AC unit has to work (if at 3pm is 78 degrees, then that hour’s contribution to coolinghours is 78-65=13; this is summed over all hours of the day, ignoring hours that are below 65) • heatinghours - technical term that describes how hard an electric unit has to work (if at 3pm is 32 degrees, then that hour’s contribution to heatinghours is 65-32=33; this is summed over all hours of the day, ignoring hours that are above 65) • HoursAbove65 - total number of hours for that day where temperature exceeded 65 • HoursBelow65 - total number of hours for that day where temperature was below 65 • low - low temperature of the day • high - high temperature of the day • median - median value of the 24 hourly temperatures of the day • mean - mean value of the 24 hourly temperatures of the day • q1 - 25th percentile of the 24 hourly temperatures of the day • q3 - 75th percentile value of the 24 hourly temperatures of the day • day - day of week • YearMonth - year and month of that day ID, Day and YearMonth are not to be used as predictors. You can do something like ELECTRICITY$ID <- NULL to null out these variables. The “smoothed scatterplot” below shows the relationship between Usage and average daily temperature and adds a curve describing the trend. It appears that usage decreases as the daily mean tempreture increases through 65 degree Fahrenheit, and tends to increase as the mean tempreture increases beyond 65 degrees. Make (but don’t include) a histogram of Usage. It’s a little skewed, but let’s model the values as-is. That is, do not transform it (the response variable). Split the data into 5000 training rows with the remainder being the test (use set.seed(577) on the same line as the required sample command). 1 20 40 60 80 ELECTRICITY$mean (a) Fit a vanilla linear regression model using the training data. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. (b) Fit a regularized multiple linear regression model using the training data. Audition alpha values along the sequence 0, 0.1, 0.2, . . . , 0.9, 1. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. (c) Fit a KNN model using your training data. Audition k values of 1, 15, 40, 80, and 120. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. (d) Train a regression tree model to predict Usage. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. (e) Using the training data, train a random forest model to predict electricity Usage. Audition values of mtry of 1, 3, 8, and 12. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. (f) Fit a gradient boosted tree model using the training data. Be sure to train this model appropriately. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. Additionally, using the plot command with a gbm object, report a plot that shows how the model is capturing the relationship between Usage and average temperature (the “mean” variable), i.e. do plot(myGBM,"main"), where myGBM is a gbm model object. Describe this plot. (g) Fit a neural network model with one hidden layer using the training set. Audition number of nodes in {1, ..., 6}. Report the estimated generalization RMSE and the RMSE on the holdout sample. Report the variable importance plot, and comment on which predictors appear most important for predicting electricity Usage. 2 ELECTRICITY$Usage 0 50 100 150 (h) Is one model a compelling choice versus the others? Why or why not? Show evidence to support your claim. 2. Predictive Model for Making a Purchase The PURCHASE.csv data contains a small part of a customer database from a bank. Of interest is the variable Purchase, which tells us if a customer did or did not make a purchase at a major chain retailer in the following 30 days. Predictor variables include Visits (number of visits to the store in the last 90 days), Spent (how much the customer has spent in the last 90 days, PercentClose (the percentage of purchases this customer makes in general that are within 5 miles of their home address, Closeset and CloseStores which details how closest the nearest store in the chain is to the customer and how many stores of that chain are within 5 miles of home). 0 5100 1000 Visits Spent Let’s try to predict Purchase (Buy/No). Using set.seed(577), randomly split the data into 50% training and 50% test sets. (a) What will the naive model classify everyone in the data? What is the estimated generalization accuracy of the this model? What is its accuracy in the holdout sample? (b) Using the training data, train a logistic regression model to predict Purchase. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. (c) Using the training data, train a regularized logistic regression model to predict Purchase. Audition alpha values along the sequence 0, 0.1, 0.2, . . . , 0.9, 1. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. (d) Using the training data, train a classification tree model to predict Purchase. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. Also, report a plot of the optimal tree model you obtain. (e) Using the training data, trian a random forest model to predict Purchase. Audition values of mtry of 1, 3, and 5 (pure bagging). Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. (f) Using the training data, train a gradient boosted tree model to predict Purchase. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. In addition, provide plots of how the (optimal) model is representing the relationship between the probability of purchasing and the first and second most important predictors. (To get the plots, remember to convert the column we are predicting to numbers). 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Purchase No Buy Purchase No Buy (g) Using the training data, train a support vector machine with a radial basis kernel to predict Purchase. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. (h) Using the training data, train a neural network model with one hidden layer to predict Purchase. Audition number of nodes in {1,...,6}. Report the estimated genralization metrics, as well as the accuracy and auc on the holdout samples. Also, report the variable importance plot, and comment on which predictors appear most important for predicting Purchase. (i) Is one model a compelling choice versus the others? Why or why not? Show evidence to support your claim. 3. Clustering Kroger Customers One key concern of analytics practitioners is to ensure that the right products are being sold to the right customers at the right price and at the right time. The KROGER.csv datafile contains the spending habits of 2000 customers at Kroger over the span of two years (this is why Kroger has a loyalty card; so it can track purchases over time!). Specifically, it gives the total amount of money each customer has spent on 13 broad categories of items (alcohol, baby, meat, etc.). Kroger would like to do some segmentation and cluster analysis to discover if there are “customer types”? For example: • House-spouses that takes care of both the cooking and grocery shopping? • College students who buys cheap food and drinks that require minimal preparation? • Parents of newborns? • Casual shoppers buying products here and there? • Health-conscious shoppers? • Extreme couponers? The segments above may indeed exist, and if so, Kroger could fine-tune marketing and advertising campaigns to meet the needs of each group. This is a much more effective strategy than using a single campaign designed for everyone. However, we need to let the data suggest what clusters exist in the data instead of inventing nice-sounding groups of our own. Prior to running any clustering analysis, it is often advised that we do a careful data preprocessing. This includes transformation to symmetry, scaling, and converting any categorical variables to indicator variables. This data does not have any categorical variables. Therefore, the two preprocessing steps that need to be done are transformation to symmetry and scaling. Use the scale_dataframe function (find this function in the lec 22 Clustering Illustraions.R file) and carryout the following transformation and scaling KROGER.SCALED = scale_dataframe( log10(KROGER+0.01) ) Next you will conduct clustering analysis using your KROGER.SCALED dataframe. (a) Let’s try k-means clustering. Produce the “elbow plot” exploring values of k from 1 to 20 taking iter.max=30 and nstart=25. Since there is no natural choice for the number of clusters here, let’s choose k=3. Run kmeans on KROGER.SCALED with centers=3, iter.max=30, and nstart=25. Report the cluster centers and cluster frequencies of the three clusters. Describe each of the clusters by giving cluster defining characteristics. 4 Kroger is interested in using the cluster identities to aid in identifying segments where customized offers could be designed (e.g., people who cook, people with pets, people with babies, etc.). For targeted advertising, it probably makes more sense to cluster on the fraction of the total money spent by the customer on each of the categories (instead of the raw amount). If we find a segment that spends a much larger fraction of their shopping budget on baby items, we can target them with baby-specific promotions, etc. Copy KROGER (whose contents shouldn’t have been modified since the data was read in) into a data frame called FRACTION. Then, write a for loop that defines the values in row i of FRACTION to be the fractional amounts of the values in the ith row of KROGER. For example if x is a vector of the 13 dollar amounts, then x/sum(x) would be a vector giving these 13 fractional amounts. Verify that the sum of each row of FRACTION is 1 ( you can do this by running summary(apply(FRACTION,1,sum)), which translated into English means “summarize the row totals of each row of the FRACTION dataframe”), then NULL out the OTHER column from FRACTION (one of the 13 columns is now redundant since the values in a row add to 1, so might as well get rid of the least interesting one). Then create a new scaled data by doing FRACTION.SCALED = scale_dataframe( log10(FRACTION+0.01) ) (b) Run a hierarchical clustering this time using hclust with arguments dist(FRACTION.SCALED) and method="ward.D2". Provide a plot of the dendrogram. Creat FRACTION.SCALED.WITH.ID as a copy of FRACTION.SCALED, and add columns k3, k4, and k5 to it which contain the cluster identities (found from cutree) when 3, 4, or 5 clusters are found. Using aggregate, find the average value of each column in FRACTION.SCALED.WITH.ID broken down by k3, again but broken down by k4, and again broken down by k5 (e.g. round(aggregate(.~k3,data=FRACTION.SCALED.WITH.ID,FUN=mean),2), etc.). Characterize each of the 5 clusters with a short, meaningful description (e.g., fast-food junkies who spend most of their money on snacks and prepared food). 5