BINF90001 – Semester 1, 2022 – Assignment 1 Due date: 17:00, Thursday 14 April 2022 Instructions • This assignment contains 5 questions worth a total of 50 points. It will contribute 20% to your assessment for this subject. • Submit your assignment as a PDF file via the LMS. Please refer to the submission instructions on the LMS for more information. • Late assignments will only be accepted under exceptional circumstances. Usually a medical certificate will be required. A late penalty may be imposed. • Your submission should clearly show your name and student ID number. • Provide tables, graphs, R code and concise text explanations to support your answers. Graphs must be clear and well-labelled, including informative axis labels and titles. All tables, graphs and code must be accompanied by explanation and interpretation. Satisfactory presentation forms part of the marking criteria: points will be deducted for excessive and/or poorly organised work. • Present all material related to each answer (such as R code and plots) together: do not pro- vide any appendix or supplementary information at the end. Use of R Markdown is highly recommended. • Do not discuss solutions to these problems on the online discussion forum. However, you can use the forum to seek clarifications of the questions. Data In this assignment you will analyse some SNP genotype and phenotype data. The data files are available from the Assignments page on the LMS. There are data from three studies: Study 1. 500 individuals each genotyped at 200 SNPs (genotypes1.csv) and their body mass index (BMI) recorded in kg/m2 (phenotype1.csv). Study 2. 1,500 individuals (not including any of the 500 above) each genotyped at the same 200 SNPs (genotypes2.csv) and their overweight status recorded as either true or false (phenotype2.csv). For this study, a person is defined as being overweight if their BMI is greater than 25 kg/m2. Study 3. 100 individuals each genotyped at the same 200 SNPs (genotypes3.csv) and no phenotypes recorded. Questions 1. (11 points) (a) Read all of the data into R as data frames (one data frame per file). Check that the study sizes reported above are correct. List the first 5 phenotypes in both Study 1 and Study 2. (b) For each SNP in Study 1, fit a simple linear regression model for BMI against the SNP genotype (i.e. 200 separate models each with a single predictor) and record the p-value from testing the null hypothesis of no association between the SNP and BMI. Consider only the additive model (1 parameter) for each SNP, rather than the general model (2 parameters). (c) Draw a Manhattan plot to visualise all of the p-values from these tests on a log10 scale. Briefly describe what you conclude from this plot. (d) Which SNP has the smallest p-value? What is the p-value? 1 2. (18 points) (a) You decide to combine studies 1 and 2 together. This requires making the phenotypes to be equivalent. Convert the phenotype from Study 1 to be the same as for Study 2, and then combine the two studies by creating a single data frame for the phenotype and one for the genotypes. (b) For the combined data, test each SNP for association with overweight status and record the 200 p-values. Again, consider only an additive genetic model. (c) Draw a Manhattan plot to visualise the p-values from these tests on a log10 scale. Briefly describe what you conclude from this plot. (d) Which SNP has the smallest p-value? What is the p-value? (e) i. Report the number of SNPs that are significant using the Bonferroni method to control the family-wise error rate at 5% across the 200 tests. ii. Report the number of SNPs that are significant using the Benjamini & Hochberg method to control the false discovery rate (FDR) at 5% across the 200 tests. iii. Using the Storey method with λ = 0.1, what is the expected number of null SNPs that are significant at level α? = 0.001. How many SNPs are observed to be significant at this level? What is the resulting FDR estimate? (f) Describe how you would report the number of significantly associated SNPs from the as- sociation analysis in the combined study. How would you decide which SNPs to report as significant and how would you summarise the possibility of error? 3. (7 points) (a) Identify the SNPs with the 8 smallest p-values from the previous question, and report their p-values. (b) Some of the significant SNPs may be in high linkage disequilibrium (LD). To investigate this, calculate the 8 × 8 correlation matrix between these 8 SNPs. What do you conclude about the results from the association analysis? (c) Defining ‘high LD’ to be a squared correlation coefficient > 0.5, identify the set of SNPs among the top 8 with the lowest p-values subject to no two SNPs in the set being in high LD with each other. (This process is called clumping and the resulting SNPs are called tag SNPs.) 4. (8 points) (a) Fit a logistic regression model that includes all of the tag SNPs as predictors (additive effects only, like the previous models). Report the parameter estimates and standard errors. (b) For each SNP, report the effect size as an odds ratio (OR) and explain how to interpret it. (c) Report 95% confidence intervals for each OR. 5. (6 points) (a) Using your fitted model from question 4, compute the risk (i.e. probability of being over- weight) for all individuals in Study 3. (b) Plot the risks with the individuals sorted in order of increasing risk. (c) What is the risk for indiv2001? (d) Lifestyle factors also have an impact on the risk of being overweight, not just genetic factors. Positive lifestyle (such as a good diet and regular exercise) can reduce the risk. For simplicity, suppose that lifestyle and genetic factors act independently. For indiv2001, how strong would the lifestyle factors collectively need to be (expressed as an odds ratio) in order to counteract the genetic risk (i.e. to make the overall risk be equivalent to the lowest predicted risk based on the SNP genotypes)? 2
欢迎咨询51作业君