B363 Final Project Suggested Topics (Final Report Due: 12/16, Wed midnight) 1. We have learned several algorithms to predict the replication start site in bacterial genome sequences. We have shown the methods can be applied to E. coli genome successfully that identified the replication start regions and the signal (k-mers) of the DnaA box. In fact, the genomes from thousands of E. coli strains have become available at https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/167/. You may implement the algorithms learned in the class to analyze and compare a subset of these genomes’ replication start sites. 2. We have learned in the class clustering algorithms to group genes with similar expression patterns across a biological process such as the diauxic shift. The dataset obtained by DeRisi and colleagues is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28. You may implement one clustering algorithm to analyze the dataset in order to identify the subset of genes showing increasing expression levels after diauxic shift. You may further implement a motif finding algorithm to identify the carbon source response element (CSRE) motif in the upstream regions of many of these genes. You can find the yeast genome sequence and the annotations of genes here: https://www.ncbi.nlm.nih.gov/genome/15?genome_assembly_id=22535 3. Evolutionary studies of SARS-Cov-2 genomes. The infectious disease Coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 was first identified in Wuhan, China, and is currently spread across many countries including the United States. Since the outbreak, thousands of COV-2 viral genomes have been sequenced from different countries around the world. For example, the collection (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLin eage_ss=SARS-CoV-2,%20taxid:2697049) at NCBI contains more than 30,000 genome sequences. You may use one of the algorithms learned in the class to construct a phylogenetic tree among selected subset of genome sequences to study how the viruses have spread across the world, and how new strains emerges during the pandemic. You may refer to a similar study by the NextStrain team: https://nextstrain.org/ncov/global. Requirement and Evaluation: The above are the suggested topics for the final project. You may choose to work on one of the above projects or other projects of your interests. You can work by yourself or in a team of two to complete the final project. You should not simply present the data with figures and visualization tools, but need to use some algorithms learned from the class to reach your conclusion. Please contact me by email at
[email protected] if you are not sure whether your idea is appropriate for the final project. I will host some discussions in the classes of 12/1 and 12/3 3:15-4:30p. Each team is required to give a presentation for 5-10 minutes about the project in the class on 12/8 and 12/10. You do not need to have final results for the project, but should present the idea and methods you plan to pursue. We will make those arrangement in the class of 12/3 – please make efforts to attend the zoom meeting on 12/3. Each team should submit a final report and related implementations of bioinformatics algorithms on canvas by the due date 12/16 Wednesday. We will evaluate them based on the following questions. For team project, you need to describe the contribution of each member to the project in your final report. 1) Is the formulated problem reasonable? (15%) 2) Have comprehensive data been collected? (15%) 3) Is the bioinformatics algorithm devised properly and described clearly? (15%) 4) Has the bioinformatics algorithm been implemented correctly and efficiently? (20%) 5) Is the conclusion meaningful and supported by the data? (20%) 6) Are the results presented clearly and intuitively? (15%)
欢迎咨询51作业君