辅导案例-ITNPBD7
1
ITNPBD7 Spring 2020 - Resit/Deferred Assessment

DNA Sequence Analysis
Your task for this assignment is to use Hadoop and the MapReduce approach to find the average number of
letters between pairs of DNA tags across a sample genome. You are provided with two files - one containing
the sample genome and the second containing the set of tag pairs that you should search for. Examples of the
some of the data held in these two files are given below:
Sample Genome
CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG
TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG
AGATTTCCCTGAGAAAGTCATATTTAAGCTGCCATTTGAAGACCAAGGAATCATGACTAGAGACAAGAAGAGAGAACATAGAGTGATTATGGAGAATCTT
AGTATCAGTCCAGTCCTCAGTGACGGGACCCTAACTGACCTGCCCTTCTTTGGCTTAGATTGCTTAAATGGTTCTGGATGTGATGATGGTGCACCTTGCC
TATATTAGAGTAGAGTCTAAAGATTAGAATGATCCACAGGTTAATATGGGCCATTATAAAGAGATTAGTGATATTAACAATNTAGTATCAACATGGAGAT
TCTATTATTTCATTGGGGTTGCAAAATTGTGATTTTCTAATCATTTCACTTTTCCTATATTTATTGCCTGGAACTTTGTAAAGAAGAAATTGATCTTATT
Sample Start/End Tag Pairs
CAG,AGA
CCA,TGT
TGG,TCA
TGG,TCC
TGG,TCT
CCA,TGA
CCA,TGC
CCA,TGG
GTG,TGA
GAA,CAT
As an example of what you must do, consider the first two lines of the above data which are individually sent
to a mapper:
CAGGAAAGACAATTCCAAAATCAGTTAGAGTCCTGTTGGCGCGTGTAATACATCTCCACTTTGAAAATGAAGACAGGGGGTTACGAGTGTTATTAATGAG
TGGGAATGTAAATTAGTCCAGCCACTCTGGAGAACCGTATGGAGGTTCCTCCAAAAATTACAAATAGAACTACCATATGATCCAGCAATCCCATGCTATG

Your program should identify that the start and end tag pairs above are located at the following positions in
the first line:
CAG...AGA: 0..6
CAG...AGA: 21..26
CCA...TGT: 14..33
CCA...TGT: 55..87
TGG...TCC: 36..54
TGG...TCT: 36..52
CCA...TGA: 14..61
CCA...TGG: 14..36
GTG...TGA: 42..61
GTG...TGA: 86..96
GAA...CAT: 3..50

2
with the number of letters between these tags (not including the tags themselves) being:
CAG...AGA 3 & 2
CCA...TGT 16 & 19
TGG...TCC 15
TGG...TCT 13
CCA...TGA 44
CCA...TGG 19
GTG...TGA 16 & 7
GAA...CAT 44

For the second line, the tag pairs are located at:
CAG...AGA: 18..30
TGG...TCC: 0..16
TGG...TCC: 27..46
TGG...TCT: 0..25
CCA...TGA: 17..77
CCA...TGC: 17..93
CCA...TGG: 17..27
GAA...CAT: 3..73

with the number of letters between tags of:
CAG...AGA 9
TGG...TCC 13 & 16
TGG...TCT 22
CCA...TGA 57
CCA...TGC 73
CCA...TGG 7
GAA...CAT 67

For the two data lines shown at the start of this example, the average gap between tags would therefore be:
CAG...AGA 4.6666665
TGG...TCC 14.666667
TGG...TCT 17.5
CCA...TGA 50.5
CCA...TGT 22.5
CCA...TGC 73.0
CCA...TGG 13.0
GTG...TGA 11.5
GAA...CAT 55.5

Your task is to write the Map/Reduce code in Java needed to process the above data in such a way that it
produces the final output of the averages shown above but for the entire genome data rather than just the
two sample lines shown. You will submit a written report, detailing your design and the results you found. You
must also submit a Java file containing your code.

3
Step 1, HDFS – 20 Marks
Before you write any code, you will need to copy the data onto your own space in HDFS. In your report, give
details of how HDFS stores data such as this (assume the file is much bigger than it really is for the purpose of
your description). This section should be around half a page long, plus a diagram. Describe what HDFS is for,
the architecture it uses, and the roles of different nodes in the cluster. Document the hdfs commands you
used to create a directory for the data and place it there. Make sure everything you put here, including the
diagram, is your own work. Do not copy anything from other sources.
Step 2, Design – 20 Marks
Now consider the Map/Reduce design you will implement. Compare and contrast producing a design with and
without a Combiner and describe the role that the Combiner plays in improving the efficiency of your
solution. You should also describe what keys and values the mapper will emit, the combiner will emit and
what the final reducer will emit. You should consider how much data will be moved across the network in
each of your two designs and how many different reducers will be used in each case.
Step 3, Implement – 60 Marks
Once you have completed your designs, you should implement the design that uses a Combiner and show
how it improves the performance of the overall solution. It is advisable to use the DNASeqCount.java file
provided on the assignment page in Canvas as a starting point. A file called TestSeqCount has been provided
that will use the code from DNASeqCount.java and run it on the mochadoop Hadoop simulator. You are
advised to develop your solution with this first before finally running it on Hadoop. TestSeqCount uses the
sample data and tag pairs shown above so you can use it to check that you are getting the final answers
shown above.
The Hadoop run will use the full set of data and a larger set of tag pairs to produce a more detailed result so
do not expect the two alternatives to produce the same output (although you can test your Hadoop job with
the smaller data files if you wish).
If you have problems remotely accessing Hadoop, you can try only running your code with Mochadoop on the
dna-40.txt sample which contains the first 40 linest of DNA sequences and submit the results for this however
your submission may be tested on a much larger data set on Hadoop so you should be sure that it works.
Whether or not you use Hadoop, you should still provide the commands that would be needed to run your
solution on the real Hadoop system.
Submission Details
Please write up your work in a report and submit it via Canvas, clearly noting your 7 digit student ID number
on the front of your report but do not provide your name. Additionally, please submit your DNASeqCount.java
file via Canvas and ensure that your code is very well commented and that you have put your 7 digit ID
number at the top of your Java code in the commented area. Make sure your report also contains the results
you got when you ran your code. The deadline for submission is Monday 22nd of June at 4pm.

4
Plagiarism
Work which is submitted for assessment must be your own work. All students should note that the University
has a formal policy on academic misconduct which can be found here.
Plagiarism means presenting the work of others as though it were your own. The University takes a very
serious view of plagiarism, and the penalties can be severe (ranging from a reduced grade in the assessment,
through a fail for the module, to expulsion from the University for more serious or repeated offences).
Specific guidance in relation to Computing Science assignments may be found in the Computing Science
Student Handbook. We check submissions carefully for evidence of plagiarism, and pursue those cases we
find.
Late submission
If you cannot meet the assignment hand-in deadline and have good cause, please see the module coordinator
to explain your situation and ask for an extension. Coursework will be accepted up to seven days after the
hand-in deadline (or expiry of any agreed extension) but the mark will be lowered by three marks per day or
part thereof. After seven days the work will be deemed a non-submission and will receive an X.




51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: IT_51zuoyejun