程序辅导案例 > C/C++ >

程序代写接单- STAT40750 Statistical Machine Learning

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

University College Dublin An Coláiste Ollscoile, Baile Átha Cliath

SPRING TERM ASSESSMENT 2020/2021 END OF TERM ASSIGNMENT STAT40750 Statistical Machine Learning (Online) Dr. Michael Fop Deadline for submission: 16th May 2021 at 23:00 Instructions for students - Please read carefully • This assignment contributes towards 60% of the final grade. Total marks: 100. • Full marks will be awarded for complete and correct answers to all questions. • For full marks, you must show clearly all steps and computations in your answers. You must show complete workings, correct answers alone will not achieve full marks. • AllwrittensheetsmustbescannedanduploadedtoBrightspaceasasingle document in PDF format. • Multiple submissions before deadline are allowed and only the latest one will be considered for marking. • You may refer to your notes or online references when answering these questions, or use software for numerical calculations. However, you must not communicate with anyone else and the work in the assessment must be a your own original production and must be carried out independently. • Candidates are required to read, complete and upload the School of Mathematics and Statistics Honour Code form, which is available from Brightspace. • Submission after deadline will incur in penalization as UCD rules (see “Mod- ule details” document). • Plagiarism is strictly prohibited (see “Module details” document and “Infor- mation materials” tab). UCD 2020/2021 1 of 8 1. A coffee shop wants to evaluate the purchasing behavior of its customers, investi- gating which food items are most commonly bought and consumed together. Over a certain period of time, the team working in the shop recorded data on 7440 transactions involving these 5 main food items of interest: Bread, Cake, Coffee, Sandwich, Soup. Summary tables of the frequencies of occurrence of these food items are reported below. Item Count Coffee 4528 Bread 3097 Cake 983 Sandwich 680 Soup 326 Triplet Pair Count Coffee, Bread 852 Coffee, Cake 518 Coffee, Sandwich 362 Coffee, Soup 150 Bread, Cake 221 Bread, Sandwich 161 Bread, Soup 62 Cake, Sandwich 65 Cake, Soup 42 Sandwich, Soup 52 Count Bread, Cake, Coffee 95 Bread, Cake, Sandwich 16 Bread, Cake, Soup 10 Bread, Coffee, Sandwich 68 Bread, Coffee, Soup 21 Bread, Sandwich, Soup 10 Cake, Coffee, Sandwich 44 Cake, Coffee, Soup 23 Cake, Sandwich, Soup 11 Coffee, Sandwich, Soup 34 (a) Given the information in the tables above, what is the smallest maximum probability that the items {Cake, Coffee, Sandwich, Soup} will occur in the same transaction? Discuss. [2] UCD 2020/2021 2 of 8 (b) Calculate the support and confidence measures of the following rules: 1. Cake ⇒ Coffee 2. Soup ⇒ Sandwich 3. (Cake, Sandwich) ⇒ Coffee 4. (Coffee, Soup) ⇒ Sandwich (c) For the same rules above in (b), calculate the lift measure and provide an interpretation. [5] (d) The Apriori algorithm is employed to screen the rules. Using a support thresh- old 0.001 and a confidence threshold 0.2, will all the rules above in (b) be included in the final set of selected rules? Discuss. [2] (e) Compute the standardized lift for the final set of rules left after the application of the Apriori algorithm in (d). Interpret the values found and compare them to the lift values calculated in (c). What can you conclude? [8] [Total 20] UCD 2020/2021 3 of 8 [3] 2. An Italian bank collected data on the credit worthiness of their customers with the purpose of building a model for predicting the credit worthiness of future customers. The dataset consists of 1532 customers and for each customer the following variables were recorded. Variable Class Age Duration Housing Description The credit worthiness of the customer – “Good” or “Bad” The customer’s age in years Loan duration in months The costumer’s housing status – “Shared”, “Rented”, “Owned” A logistic regression model was fitted to the data, yielding the following output. Call: glm(formula = Class ~ Duration + Age + Housing, family = ''binomial'', data = ItalianCredit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.407613 0.458842 0.888 0.3744 Duration -0.101058 0.007976 -12.670 < 2e-16 *** Age 0.037592 0.006487 5.795 6.84e-09 *** HousingRent 0.018245 0.156605 0.117 0.9073 HousingOwn 0.399397 0.156681 2.549 0.0108 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.001 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (a) Explain the motivation behind using the logistic function in the formulation of the model. What is the logistic regression equation corresponding to this fitted model? [3] (b) How does a person’s housing status affect the odds of having a good credit score? Comment. (c) The following three customers apply for a bank loan. Variable Customer 1 Customer 2 Customer 3 Duration 20 10 10 Age 31 45 28 Housing “Rented” “Owned” “Shared” [5] UCD 2020/2021 4 of 8 (i) What is the probability that each of these customers has a good credit rating? (ii) Using the probabilities calculated in (i) and assuming a classification threshold of 0.5, would you classify these customers as having a good or bad credit rating? [6] (d) A range of threshold values τ is considered to assess the classification per- formance of the model on the data. For each threshold value, classification accuracy, classification error, sensitivity and specificity measures are reported in the table below. τ Accuracy Error Sens. Spec. 0.2 0.518 0.482 0.941 0.238 0.3 0.600 0.400 0.827 0.450 0.4 0.668 0.332 0.686 0.656 0.5 0.684 0.316 0.507 0.801 0.6 0.674 0.326 0.304 0.920 0.7 0.623 0.377 0.093 0.975 0.8 0.606 0.394 0.013 0.999 What is the optimal classification threshold? What is the accuracy value corresponding to such threshold? [6] [Total 20] 3. In your own words, provide a concise and pertinent discussion for each of the following points. Answers should not exceed 20 lines. Long answers and going off on a tangent will not be awarded full marks. (a) What are the reasons behind using the standardized lift measure in association rule learning? [5] (b) Describe the advantages of the kernel trick and how it is employed to imple- ment support vector machines. [5] [Total 10] UCD 2020/2021 5 of 8 4. Provide your answer and a concise explanation for each of the following questions. (a) A k-means algorithm with K = 2 is employed to cluster the following 5 × 2 data matrix: The algorithm is initialized from the set of centroids μ1 = (1, 1) and μ2 = (3, 2). After one iteration of the algorithm, is the k-means objective function value equal to 25? Remember to justify your answer. [5] (b) A marketing researcher uses the k-means algorithm to cluster data concerning a sample of customers of a large apparel retailer. The researcher does not have any prior information about the number of clusters K. The clustering allocation for different values of K is compared to an external classification of the subjects into different types of purchasing behaviors. The table below reports the adjusted Rand index (ARI) values for each value of K considered: K ARI 2 0.43 3 0.51 4 0.63 5 0.59 6 0.39 Do you think that K = 4 corresponds to the optimal number of clusters for these data? Justify your answer. [3] (c) A financial institution implements a classification tree to classify applicants to loans according to credit worthiness, “Good” or “Bad”, with the main intent of detecting applicants with “Good” credit rating outlook. An excerpt of the output of a classification tree implemented using the rpart function on a sample of data is displayed below.  1 3 2 4  X = 2 −1 2 3 1 −2 UCD 2020/2021 6 of 8 n= 275 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 275 95 Good (0.3454545 0.6545455) 2) CheckingAccountStatus.none< 0.5 161 77 Good (0.4782609 0.5217391) 4) Amount>=8760.5 17 0 Bad (1.0000000 0.0000000) * 5) Amount< 8760.5 144 60 Good (0.4166667 0.5833333) 10) CheckingAccountStatus.gt.200< 0.5 120 57 Good (0.4750000 0.5250000) 20) Duration>=22.5 48 18 Bad (0.6250000 0.3750000) 40) NumberPeopleMaintenance< 1.5 37 10 Bad (0.7297297 0.2702703) * 41) NumberPeopleMaintenance>=1.5 11 3 Good (0.2727273 0.7272727) * 21) Duration< 22.5 72 27 Good (0.3750000 0.6250000) 42) Amount< 1282 25 11 Bad (0.5600000 0.4400000) 84) Property.RealEstate< 0.5 17 5 Bad (0.7058824 0.2941176) * 85) Property.RealEstate>=0.5 8 2 Good (0.2500000 0.7500000) * 43) Amount>=1282 47 13 Good (0.2765957 0.7234043) * 11) CheckingAccountStatus.gt.200>=0.5 24 3 Good (0.1250000 0.8750000) * 3) CheckingAccountStatus.none>=0.5 114 18 Good (0.1578947 0.8421053) * Is the sensitivity measure of the classification tree on these data equal to 0.8088? If no, what is the sensitivity measure? Justify your answer. [5] (d) Given a collection of competing classifiers and some data, does the use of a validation set to select the best one in the collection guarantee that the selected one will always provide the best predictive performance on future unseen observations? Remember to justify your answer. [4] (e) Two hundred labeled samples are used to train two binary classifiers M1 and M2. For classifier M1, the dataset is divided into training and validation sets of 100 samples each and the classifier is trained on the training set. The performance of M1 on this validation set provides a 80% accuracy. For classifier M2, the dataset is divided into a training set of 150 samples and a validation set of 50 samples, and the classifier is trained on the training set. The performance of M2 on the corresponding validation set provides and accuracy of 90%. Is classifier M2 to be preferred to classifier M1? Justify your answer. [3] [Total 20] UCD 2020/2021 7 of 8 5. Data analysis task Data description In this data analysis task, you will analyze data concerning a corpus of movie reviews extracted from the review aggregator website Rotten Tomatoes. For each review, a number of numerical features have been constructed, which are related to frequencies of certain characters, frequencies of certain words, sentiment scores, and coordinates in a latent embedding representation of the text. Each review has an associated class label, denoting whether the review is negative or positive. The file data_rotten_tomatoes_review.csv includes the data: column phrase contains the raw review text, column class indicates the class label, (negative or positive), while the remaining columns contain all the numerical features. Task Using the available review-specific numerical features, the task is to build a classifier to classify the sentiment of a movie review, negative or positive. (a) Use at least 3 of the supervised classification methods described in this course to predict the sentiment class label of a review on the basis of the numerical features. (b) Employ an appropriate framework to compare and tune the different methods considered, evaluating and discussing their relative merits. (c) Evaluate the predictive classification performance of the best model you find. Provide a discussion about the ability of the selected model at detecting cor- rectly negative and positive reviews. Guidelines: • Write a short report and submit it along the main assignment submission, i.e. the assignment solutions submission for all 5 exercises must be a single pdf file. • The report for this data analysis task should be no longer than 4 pages (ap- proximately), code excluded. • Include the R code used for analysis in the submission. You can include the code as a separate R script file. Alternatively, the report can be produced using R Markdown, with the code included in the main text or as an appendix. The code must be working and the analysis must be reproducible in all parts. UCD 2020/2021 8 of 8 —o0o— [Total 30]