COMP 462 / 561 - Homework #3 Due on November 16 2021, 23:59 on MyCourses IMPORTANT: Question 1 of this assignment requires a substantial amount of programming, and portions (c), (d), and (e) depend on you having a working program. Do not leave this to the last minute! Get started with the programming early, and get our help if you need. Question 1 (80 points). In this question, you will implement the Viterbi algorithm and apply it to the gene finding HMM seen in class to predict genes in a bacterial genome. Vibrio cholerae is the bacteria that causes cholera. It was first sequenced in 2000 ( see https://www.nature.com/articles/35020000 ). Download the genome sequence here: ftp://ftp.ensemblgenomes.org/pub/bacteria/release- 45/fasta/bacteria_81_collection/vibrio_cholerae/dna/Vibrio_cholerae.GFC_11.dna .toplevel.fa.gz Notes: • When I click the link above in Acrobat Reader, it prompts me for a password, which I don’t have. Instead, just copy-paste the link into your browser, and it won’t ask you for that password. Weird… • Although the genome of Vibrio cholera Vibrio cholera consists of two chromosomes, the sequence file contains more than two sequences, because it is incompletely assembled. Download the gene annotation here: ftp://ftp.ensemblgenomes.org/pub/bacteria/release- 45/gff3/bacteria_81_collection/vibrio_cholerae/Vibrio_cholerae.GFC_11.37.gff3. gz The format of both files should be fairly self-explanatory, but you can find more information about the gff3 file format here: http://gmod.org/wiki/GFF3. The file contains annotations of different types of functional elements. We will only consider those called “CDS” (for CoDing Sequences); ignore the other lines (labeled “gene” or “exon” or “mRNA”). Important notes: • One important detail we’ve omitted in our in-class discussion is that genes can actually be located either on the forward or the reverse DNA strand. In class, we have only considered the genes located on the forward strand. In this assignment, we will also only focus on the genes on the forward strand. It is not much more complicated to handle genes on the negative strand, but it’s just annoying, so we will pretend that genes on the negative strand do not exist, i.e. that portions of the genome occupied by genes on the reverse strand are actually intergenic regions. • Another level of complexity that we will ignore is the (fairly rare) possibility that two genes can overlap each other. This obviously breaks our HMM gene finder. Because such events are very rare, we will assume that this never occurs. The gff3 file given to you actually contains a few of those pairs of overlapping genes. You can deal with them in whichever way you find simplest; this is not going to affect your results significantly. • In bacteria, there are actually multiple possible start codons (not only ATG). a) (10 points) Using the genome sequence and gene annotation for Vibrio cholerae, infer the (i) average length of intergenic regions, (ii) average length of genic region, (iii) the nucleotide frequency table for intergenic regions; (iv) the codon frequency table for genic regions. b) (30 points) Using the programming language of your choice, implement the Viterbi algorithm. Your program does not need to work with an arbitrary HMM. It but only need to work for the specific bacterial gene-finding HMM seen in class (4 states: Intergenic, Start, Middle, Stop). The program’s argument should be I. A Fasta file containing one or more bacterial genome sequences. II. A configuration file (formatted as you want), which contains the information found in (a), containing the emission and transition probabilities for your HMM. (Assume that the initial state probability is 1 for the intergenic state, and zero for all other states). The output should be a GFF3 file with one line per predicted gene (only lines corresponding to portions annotated as genes need to be included), e.g. something like: DN38.contig00011 ena CDS 85269 86705 . + 0 . DN38.contig00011 ena CDS 87006 87230 . + 0 . DN38.contig00011 ena CDS 88079 89323 . + 0 . Submit your code, along with a simple explanation about how to run it. c) (10 points) Vibrio vulnificus is a species of bacteria closely related to Vibrio cholerae. In 2005, it caused an outbreak following the hurricane Katrina (https://www.ncbi.nlm.nih.gov/pubmed/16195696 ). Its genome was sequenced and can be downloaded here: ftp://ftp.ensemblgenomes.org/pub/bacteria/release- 45/fasta/bacteria_64_collection/vibrio_vulnificus/dna/Vibrio_vulnificus.ASM743 10v1.dna.toplevel.fa.gz Run your program on this genome, using the parameters obtained for Vibrio cholerae. Submit the GFF3 file with your gene predictions (forward strand only). d) (15 points) The genes of Vibrio vulnificus were annotated based on a variety of types of evidence. You can download the GFF3 file here: ftp://ftp.ensemblgenomes.org/pub/bacteria/release- 45/gff3/bacteria_64_collection/vibrio_vulnificus/Vibrio_vulnificus.ASM74310v1 .37.gff3.gz Evaluate the accuracy of your predictions against the gene annotation available here (always limiting our attention to genes on the positive strand). Specifically, report: • The fraction of annotated genes on the positive strand that: o Perfectly match both ends of one of your predicted genes o Match the start but not the end of a predicted gene o Match the end but not the start of a predicted gene o Do not match neither the start not the end of a predicted gene • The fraction of your predicted genes that: o Perfectly match both ends of one of an annotated genes o Match the start but not the end of an annotated gene o Match the end but not the start of an annotated gene o Do not match neither the start not the end of an annotated gene Note: Gene prediction is not an easy task! You should expect that only approximately half of the annotated genes are perfectly predicted. e) (15 points) What properties of annotated genes are associated to an elevated risk of being partially or completely missed by your predictor? What are the properties of genes predicted by your predictor that do not match an annotated gene? Write a paragraph for each question, supported by data and/or figures. Question 2. (20 points) If we think of an HMM as a machine to generate a random sequence of observations, the number of consecutive steps the path will remain at a given state follows a geometric distribution (https://en.wikipedia.org/wiki/Geometric_distribution ). This means, for example, that, in our gene-finding HMM, the length distribution of exons, introns, and intergenic regions will be assumed to be geometric. However, in reality, these regions have length distributions that can be far from geometric. Consider the very simple two- state gene finding HMM shown below. Assume that we have a desired gene length distribution, provided in the form of a discrete probability distribution for lengths ranging from 1 to 1000: Pr[length = k] = pk. Describe (in at most half a page) how you could modify the HMM below to produce a second HMM where the distribution over the duration of stay in the gene state (or states) is exactly the target length distribution. Note that the Gene state will probably need to be subdivided in several Gene sub-states. Describe not only the states of your HMM but also the transition probabilities. Good luck! Non-gene Gene
欢迎咨询51作业君