辅导案例-COMP5111M01

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Module Code: COMP5111M01
Module Title: Big Data Systems c© UNIVERSITY OF LEEDS
School of Computing Semester 2 2018/2019
Calculator instructions:
- You are not allowed to use any calculator in this examination.
Dictionary instructions:
- A basic English dictionary is available to use: raise your hand and ask an invigilator, if you
need it.
Examination Information
- There are 4 pages to this examination.
- There are 2 hours to complete the examination.
- Answer all 3 questions.
- The number in brackets [ ] indicates the marks available for each question or part
question.
- You are reminded of the need for clear presentation in your answers.
- The total number of marks for this examination paper is 60.
- You are allowed to use annotated materials.
Page 1 of 4 Turn the page over
Module Code: COMP5111M01
Question 1
(a) Facebook is an example of a massively connected social media platform, generating huge
volumes of data. Give an example scenario where Facebook may batch process some data,
and an example scenario where Facebook may need to process data in real-time.
[2 marks]
(b) There are several big data platforms available with different characteristics and choosing
the right platform requires an in-depth knowledge about the capabilities of these platforms.
You need to decide the right platform to choose from and therefore you investigate what
your application’s needs are. Give two fundamental issues that you will consider before
making the right decision.
[2 marks]
(c) Data volume is predicted to grow at an enormous rate, with some studies predicting a
10-fold growth in world data by 2025. Give two reasons - with real-world examples - why
this trend is occurring.
[2 marks]
(d) State the similarities and differences between traditional computing clusters and the com-
puting clouds launched in recent years, considering the technical and economic aspects as
listed below:
• Hardware, software, and technical support.
• Resource allocation and provisioning methods.
• Infrastructure management and protection.
• Support of utility computing services.
[8 marks]
(e) You are designing an application that requires both data acquisition and pre-processing of
raw data for event filtering. Moreover you have the freedom to describe the underlying
hardware to use to perform the pre-processing. Which hardware architecture would you
choose for such an application? Justify your answer.
[3 marks]
(f) How does specialist hardware deployment and the use of a technology like Apache Storm
compare to the more traditional MapReduce solution?
[3 marks]
[Question 1 Total: 20 marks]
Page 2 of 4 Turn the page over
Module Code: COMP5111M01
Question 2
(a) Self-driving vehicles are a technology that is rapidly moving towards mass-market produc-
tion. Give examples of how a self-driving vehicle relates to the 5 Vs of Big Data (Volume,
Velocity, Variety, Veracity, Value).
[5 marks]
(b) The Hadoop Distributed File System (HDFS) is a popular storage mechanism for large
quantities of data. Explain how HDFS ensures the fault-tolerance of data stored on its
data nodes.
[2 marks]
(c) Containers are used in Hadoop V2. They are viewed as the Virtual Machine killer. Compare
containers and Virtual Machines using three criteria of your choice.
[3 marks]
(d) The original Hadoop’s MapReduce is used to process large sets of data on a large number of
collective servers. However, it often performs poorly while involving too many servers, e.g.
running 40K concurrent tasks over 4K servers. Clearly explain why such poor performance.
Outline a possible mitigation strategy. .
[5 marks]
(e) Apache Storm is an example of a Continuous Operator Model (COM) system, used to
process streaming data. Explain how Apache Storm guarantees that all data emitted by its
spouts will be processed.
[3 marks]
(f) Discuss two disadvantages of using Apache Storm to process streamed data.
[2 marks]
[Question 2 Total: 20 marks]
Page 3 of 4 Turn the page over
Module Code: COMP5111M01
Question 3
(a) Apache Spark is one of the most popular Big Data Systems in today’s industry. Discuss
two advantages that Spark offers over the more traditional Apache Hadoop framework, and
explain why these advantages are significant. Explain why Hadoop is still useful, and give
an example of how Hadoop could still be used.
[5 marks]
(b) Data deduplication is a specialized data compression technique for eliminating duplicate
copies of repeating data. Explain the concepts of both source-based and target-based
deduplication. Discuss an advantage and a disadvantage to each approach in the context
of Cloud Computing.
[5 marks]
(c) NoSQL is a broad class of database management systems which do not use a relational
database management model. Discuss two advantages and two disadvantages of using
NoSQL in the context of a big data system. Give an example scenario where use of a
NoSQL database would be appropriate.
[5 marks]
(d) Neo4j is an example of a NoSQL Graph database. Use an example to explain what type of
application a Graph database is suitable for. Discuss two advantages and two disadvantages
of graph databases.
[5 marks]
[Question 3 Total: 20 marks]
[Grand Total: 60 marks]
Page 4 of 4 End