ECE2191 Probability Models in Engineering Assignment 1 Second semester 2020 Dr. Faezeh Marzbanrad 1 About the assignment In this assignment, you will use your knowledge about probability concepts to extract some necessary information from a given data set. The aim is to diagnose a heart condition in a group of patients. This assignment is going to account for 15 percent of your total mark. Thus, please pay attention to the following notes • Some tasks of the assignment needs to be completed using Matlab only and the rest could be completed either manually or by using Matlab • You will need to submit your codes and a PDF file of your report which contains your answers to the different tasks of the assignment • Any form of plagiarism must be avoided. Your codes will be checked by MOSS software developed by Stanford University to find any possible similarities. • Note that the assignment is to be completed individually. 2 Problem description Atrial fibrillation (AF) is an abnormal heart rhythm (arrhythmia) characterized by the rapid and irregular beating of the atrial chambers of the heart. It often begins as short periods of abnormal beating, which become longer or continuous over time. In order to diagnose AF, we Figure 1: ECG waveform for a normal cardiac cycle. 1 Figure 2: ECG of atrial fibrillation (top) and normal sinus rhythm (bottom). The purple arrow indicates a P wave, which is lost in atrial fibrillation. analyze the ECG signal. An ECG is a time-varying physiological signal which reflects the ionic current flow causing the cardiac (heart) fibers to contract and relax subsequently. It is obtained by recording the potential difference between two electrodes placed on the skin. ECG represents the successive atrial depolarization and repolarization as well as ventricular depolarization and repolarization occurring at every normal cycle of heartbeat. These events manifest as the peaks and troughs of the ECG waveform, namely P, Q, R, S, and T, shown in Figure 1. Note that R-R interval refers to the time between successive R-peaks. In the following, we explain that obtaining information about R-R intervals and the absence or presence of P-waves helps us to diagnose AF more effectively. When a person is diagnosed with AF, typical characteristics of the ECG are the absence of P waves, and irregular R–R intervals due to irregular conduction of impulses to the ventricles. At very fast heart rates, atrial fibrillation may look more regular, which may make it more difficult to separate from other conditions. In Fig. 2, the ECGs of a normal and AF rhythm has been brought, where it can be seen that in the case of AF, the R-R peaks are irregular and the P-waves are almost gone. 3 Data We have collected relevant information of 1613 patients in a the file "Assignment_Data.mat". More specifically, we have extracted the standard deviation (STD) of the R-R intervals for each ECG recording. In addition, for each patient, we have checked whether the P-wave is present or not. For simplicity, from now on, we refer to STD of the R-R peaks as SDRR. A screenshot of how the data looks is shown in Fig. 3. As can be seen, the first column shows the ID of the patients, the second column determines whether a P-wave has been detected in the ECG signal or not, the third column shows the SDRR, and lastly, the fourth column shows whether the patient has been clinically diagnosed with AF or not. In the second column, a 1 indicates presence of a P-wave and a 0 indicates absence of a P-wave. In the fourth column, a 1 indicates a positive diagnosis of AF and a 0 indicates a negative diagnosis (normal). As an example, for the first patient, the SDRR is 0.0726, and a P-wave has been detected. However, he has not been diagnosed with AF. Keep in mind that the values of SDRR and P-value can be affected by noise. For example the P-wave might have been missed in a normal case, just due to the noise, not an abnormal ECG. Once you import this table into your Matlab workspace (using "load('Assignment_Data.mat')"), you can access the element in the i-th row and the SDRR column by the command "Data.SDRR(i)", and similarly can use other column names to access their elements. In addition, in order to access 2 Figure 3: The provided table of data 3 Figure 4: The result of executing "Data(2 : 5, 1 : 3)" a subset of the table, you can use ":". For example, in order to access the data from row i to j of the ID column, you can use the command "Data.ID(i : j)". You can also view specific contents of the table by addressing the row and column number, as well. For example, the command "Data(2 : 5, 1 : 3)" will give you all the data from row 2 to row 5 and from column 1 to column 3. The result of executing "Data(2 : 5, 1 : 3)" has been shown in Fig. 4. However to work with the values in the table and perform operations on them, you need to use the Data.[column name] format (e.g. "Data.SDRR(i)"). 4 Preliminary tasks [compulsory but not graded] In the following sections, we are going to work with a subset of the data to train a probability model. To this end, we will need some of the data for training the model and the rest for testing. By running the following script you randomly select 1200 patients’ data and put them in a new table as the training table and put the rest in another table as the test table. Then these two data tables are saved as "Train.mat" and "Test.mat". You should submit these two files with your code. Note that these train and test sets and hence the results will be unique to your student ID. From now on you will work with these two sets, instead of the original Data. id=input('What is your student ID? '); rng(id); K=1200; N=length(Data.ID); i_n=randperm(N); i_tr=Data.ID(i_n(1:K)); i_ts=Data.ID(i_n(K+1:end)); Train=Data(i_tr,:); Test=Data(i_ts,:); save('Train','Train'); save('Test','Test'); Note: When you run the code, it asks for your student ID, you should type in your 4 own student ID. Run this code "only once" when you start the assignment, then the train and test sets are saved. If you exit MATLAB or clear your workspace, you can load the saved Test and Train data. 5 Probabilities Note: This part should be completed using Matlab. If we choose a subject randomly from the Train set, Q5.1: What is the probability of the subject being normal? Q5.2: What is the probability of the subject having AF disease? Q5.3: What is the probability of presence of P-wave for a subject in the train set? Q5.4: Find the mean and variance of SDRR for subjects in the train set (Matlab built-in functions can be used). Q5.5: Find the range of SDRR values (in the 3rd column of the Train set). Divide the range into 10 equal-sized intervals (bins). Next, find the probability of an SDRR value lying in each bin. The probability found for the i-th interval approximates ∫ xi+d xi fX(x)dx, where fX(x) is the PDF of SDRR and [xi, xi + d] is the i-th bin. Plot a bar graph showing the probability of the SDRR value in each bin [Hint: use the histogram and bar functions] Q5.6: What is the probability of the SDRR value being in the fourth bin. Marking scheme 2 marks = 1 mark for correctness of your approach (you should explain the concepts behind the code, in your report) + 0.5 mark for correctness of the codes + 0.5 mark for demonstration 6 Conditional probabilities Note: This part should be completed using Matlab. If we randomly select a subject from the Train set, find the following: Q6.1: The probability of the P-wave being present if we know that the subject is not diagnosed with AF. Q6.2: The probability of the P-wave being present if we know that the subject is diagnosed with AF. Q6.3: For each of the bins found in Q5.5, find the an approximate of ∫ xi+d xi fX|N (x)dx, where fX|N is the conditional PDF of SDRR given that the subject is Normal (not AF). Plot the bar graph showing the conditional probability of SDRR lying in each bin. Q6.4: For each of the bins found in Q5.5, find the an approximate of ∫ xi+d xi fX|A(x)dx, where fX|A is the conditional PDF of SDRR given that the subject is diagnosed with AF. Plot the bar graph showing the conditional probability of SDRR lying in each bin. Q6.5: If we know that a patient has AF, what is the probability that its SDRR is in the seventh bin? Q6.6: The mean of SDRR for the patients diagnosed with AF. Q6.7: The mean of SDRR for the patients with normal condition. 5 Marking scheme 2.5 marks = 1 mark for correctness of your approach (you should explain the concepts behind the code, in your report) + 1 mark for correctness of the codes + 0.5 mark for demonstration 7 Classification based on P-wave Note: Parts 7.1 to 7.5 can be completed either using Matlab or manually (on paper). You can choose based on your preference. From the probabilities found for the Train data, find the following (Q7.1-Q7.4): Q7.1: The probability of having a Normal condition if no P-wave has been detected Q7.2: The probability of having AF if no P-wave has been detected Q7.3: The probability of having a Normal condition if P-wave has been detected Q7.4: The probability of having AF if P-wave has been detected Q7.5: Based on you answers to Q7.1 to Q7.4, which diagnosis is more likely for the four subjects with IDs 12, 100, 125 and 132 in the "original Data" set? Compare your prediction with the actual diagnosis, how do you explain your finding? Q7.6: For all the data in the "Test set", use the conditional probabilities you found in Q7.1 to Q7.4 based on the presence of the P-value, to make a prediction whether the pa- tient has AF or has a normal condition. Compare your predictions with the actual results for the subjects in the Test set, and find the accuracy, sensitivity and speci- ficity of your prediction. Hint: Accuracy means the percentage of correct predictions. Moreover, in medical diagnosis, test sensitivity is the ability of a test to correctly iden- tify those with the disease (true positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true negative rate). Watch: https://www.youtube.com/watch?v=FnJ3L-63Cf8 Marking scheme 3 marks = 2 marks for correctness of your approach (if you choose the Matlab option, you should explain the concepts behind the code, in your report) + 0.5 mark for correctness of the codes (Q7.6) + 0.5 mark for demonstration 8 Classification based on SDRR Note: The parts 8.1 to 8.3 can be completed either using Matlab or manually (on paper). You can choose based on your preference. From the probabilities found for the Train data, find the following (Q8.1-Q8.2): Q8.1: For each of the bins found in Q5.5, find the probability of having a Normal condition when SDRR lies in each bin. For example, if the range of i-th bin is [xi, xi + d], then find the probability of having a normal condition when SDRR is in [xi, xi + d]. Repeat this process for all 10 bins. Q8.2: For each of the bins found in Section 5, part (f), find the probability of having AF when SDRR is in the range of the corresponding bin (similar process to part (a)). 6 Q8.3: Based on your answers to Q8.1 and Q8.2, which diagnosis is more likely for the four subjects with IDs 4, 64, 86, 191 in the original Data set? How do you justify this result? Q8.4: For all the data in the "Test set", based on the bin that the pateint’s SDRR belongs to, make a prediction whether the patient has AF or has a normal condition. Next, compare your predictions with the actual results, and find the accuracy, sensitivity and specificity of your prediction (see the hint for Q7.6). Marking scheme 3 marks = 2 marks for correctness of your approach (if you choose the Matlab option, you should explain the concepts behind the code, in your report) + 0.5 mark for correctness of the codes (Q8.4) + 0.5 mark for demonstration 9 Decision based on P-wave and SDRR Note: Parts 9.1 to 9.5 of the assignment can be completed either using Matlab or manually (on paper). You can choose based on your preference. Assume that P-wave and SDRR are "independent", find the following values Q9.1: For each of the 10 bins found in Q5.5, find the probability of having a Normal condition given that P-wave is present and SDRR is in the corresponding bin. For example, the range of i-th bin is [xi, xi + d], then find the probability of having a Normal condition when SDRR is in [xi, xi + d] and P-wave is present. Repeat this process for all i. Q9.2: For each of the 10 bins found in Q5.5, find the probability of having a Normal condition when P-wave is absent and SDRR is in the corresponding bin (similar process to part (a)). Q9.3: For each of the 10 bins found in Q5.5, find the probability of having AF when P-wave is present and SDRR is in the corresponding bin. Q9.4: For each of the 10 bins found in Q5.5, find the probability of having AF when P-wave is absent and SDRR is in the corresponding bin. Q9.5: Which diagnosis is more likely for subjects with IDs 4, 21 and 26 in the original Data set? How do you justify this result? Q9.6: For all the data in the Test set, based on the bin that the patient’s SDRR belongs to and the presence or absence of P-wave, make a prediction whether the patient has AF or has a normal condition. Next, compare your predictions with the actual results, and find the accuracy, sensitivity and specificity of your prediction Q9.7: In Sections 7, 8, and 9, we created three models for predicting whether a patient has AF or not. For each of these models, you found the accuracy, sensitivity, and specificity in the Test set. Compare these metrics among the three models and briefly explain the result of the comparison. Marking scheme 3 marks = 2 marks for correctness of your approach (if you choose the Matlab option, you should explain the concepts behind the code, in your report) + 0.5 mark for correctness of the codes (Q9.6) + 0.5 mark for demonstration 7 10 Additional questions Q10.1: In this assignment, the probabilities of a patient having AF or a Normal condition were almost equal. However, in general, the incidence rate of AF is 0.02. Do your proposed models suit the real-world scenarios? Do you need to modify them? How? Q10.2: If we don’t assume independence between P-wave and SDRR random variables, how do you incorporate both P-wave and SDRR in auto diagnosis? just explain, no Matlab implementation is required. Q10.3: Find a Gaussian PDF model (just the formula on paper) for the PDF of SDRR if we know that the patient has AF. Marking scheme 1.5 marks = 0.5 mark for each question (only based on your report) 11 Submission You are required to submit your code, "Train.mat" and "Test.mat" files (all in a single .zip archive) and a brief report (at most six A4 pages in .pdf format) via Moodle submission links by Monday 5 October at 9am. Your report should include your answers to all questions in parts 5 to 10. For Matlab-based questions, you should explain the concepts behind the codes as well. 8
欢迎咨询51作业君