April 2021 Data Mining II Final Homework Set Directions: Complete FOUR

exercises. 1. Consider the cad1 data set in the package gRbase. These observations are from individuals in the Danish Heart Clinic. (a) Learn a Bayesian Network using a structural learning knowledge, and prior knowledge obtained through the definitions of the variables in the help files. You do not have to use all of the variables. Make sure to detail your network construction process. (b) Construct the above network in R, and infer the Conditional Probability Tables using the cad1 data. (Hint: extractCPT or cptable may be used from the gRain package). Identify any d-separations in the graph. (c) Suppose it is known that a new observation is female with Hypercholesterolemia (high cholesterol). Absorb this evidence into the graph, and revise the probabilities. How does the probability of heart-failure and coronary artery disease (CAD) change after this infor- mation is taken into account? (d) Simulate a new data set with 100 observations either conditional upon this new information in part (C) using the original paramaterization. Present this new data in a table. Estimate the probability of Smoker and CAD given the other variables in your model. (Hint: you may try simulate.grain from the gRain package, you may use predict.grain as well). 2. The sinking of the Titanic is a famous event in history. The titanic data was collected by the British Board of Trade to investigate the sinking. Many well-known facts, from the proportions of first-class passengers to the women and children first policy, and the fact that that policy was not entirely successful in saving the women and children in the third class, are reflected in the survival rates for various classes of passenger. You have been petitioned to investigate this data. Analyze this data with tool(s) that we learned in class. Summarize your findings for British Board of Trade. In your report, please touch on the following questions. Is their evidence that women and chil- dren were the evacuated first? What characteristics/demographics are more likely in surviving passengers? What characteristics/demographics are more likely in passengers that perished? How do your results support the popular movie Titanic (1997)? For example, what is the probability that Rose (1st class adult and female) would survive and Jack (3rd class adult and male) would not survive? 3. Specify the structure of a Bayesian Network that contains four nodes {W,X,Y,Z} and has satisfies the following set of independencies. WX W Z|X ZW|Y W Y X Y W X|Z X Z|W,Y 1 April 2021 Data Mining II 4. Data released from the US department of Commerce, Bureau of the Census is available in R. >data(state) >?state Build a Gaussian Graphical Model using the Graphical Lasso for the 8 predictors (Population, Income, Illiteracy, Life Exp, Murder, HS Grad, Frost, Area) using a range of penalties. What do you find for different penalties, and how does it compliment (and/or contradict) a model fit with SOM? 5. Write a function to implement single linkage, average linkage and complete linkage agglomerative clustering. Write your function as general as possible, and comment your code. The functions need to be compatible with any dissimilarity. Demo your code with the iris data for each linkage and plot the results. 2 April 2021 Data Mining II 6. From Probabilistic Graphical Models textbook (Koller). 3