辅导案例-COMP5111M01
Module Code: COMP5111M01 Module Title: Big Data Systems c© UNIVERSITY OF LEEDS School of Computing Semester 2 2018/2019 Calculator instructions: - You are not allowed to use any calculator in this examination. Dictionary instructions: - A basic English dictionary is available to use: raise your hand and ask an invigilator, if you need it. Examination Information - There are 4 pages to this examination. - There are 2 hours to complete the examination. - Answer all 3 questions. - The number in brackets [ ] indicates the marks available for each question or part question. - You are reminded of the need for clear presentation in your answers. - The total number of marks for this examination paper is 60. - You are allowed to use annotated materials. Page 1 of 4 Turn the page over Module Code: COMP5111M01 Question 1 (a) Facebook is an example of a massively connected social media platform, generating huge volumes of data. Give an example scenario where Facebook may batch process some data, and an example scenario where Facebook may need to process data in real-time. [2 marks] (b) There are several big data platforms available with different characteristics and choosing the right platform requires an in-depth knowledge about the capabilities of these platforms. You need to decide the right platform to choose from and therefore you investigate what your application’s needs are. Give two fundamental issues that you will consider before making the right decision. [2 marks] (c) Data volume is predicted to grow at an enormous rate, with some studies predicting a 10-fold growth in world data by 2025. Give two reasons - with real-world examples - why this trend is occurring. [2 marks] (d) State the similarities and differences between traditional computing clusters and the com- puting clouds launched in recent years, considering the technical and economic aspects as listed below: • Hardware, software, and technical support. • Resource allocation and provisioning methods. • Infrastructure management and protection. • Support of utility computing services. [8 marks] (e) You are designing an application that requires both data acquisition and pre-processing of raw data for event filtering. Moreover you have the freedom to describe the underlying hardware to use to perform the pre-processing. Which hardware architecture would you choose for such an application? Justify your answer. [3 marks] (f) How does specialist hardware deployment and the use of a technology like Apache Storm compare to the more traditional MapReduce solution? [3 marks] [Question 1 Total: 20 marks] Page 2 of 4 Turn the page over Module Code: COMP5111M01 Question 2 (a) Self-driving vehicles are a technology that is rapidly moving towards mass-market produc- tion. Give examples of how a self-driving vehicle relates to the 5 Vs of Big Data (Volume, Velocity, Variety, Veracity, Value). [5 marks] (b) The Hadoop Distributed File System (HDFS) is a popular storage mechanism for large quantities of data. Explain how HDFS ensures the fault-tolerance of data stored on its data nodes. [2 marks] (c) Containers are used in Hadoop V2. They are viewed as the Virtual Machine killer. Compare containers and Virtual Machines using three criteria of your choice. [3 marks] (d) The original Hadoop’s MapReduce is used to process large sets of data on a large number of collective servers. However, it often performs poorly while involving too many servers, e.g. running 40K concurrent tasks over 4K servers. Clearly explain why such poor performance. Outline a possible mitigation strategy. . [5 marks] (e) Apache Storm is an example of a Continuous Operator Model (COM) system, used to process streaming data. Explain how Apache Storm guarantees that all data emitted by its spouts will be processed. [3 marks] (f) Discuss two disadvantages of using Apache Storm to process streamed data. [2 marks] [Question 2 Total: 20 marks] Page 3 of 4 Turn the page over Module Code: COMP5111M01 Question 3 (a) Apache Spark is one of the most popular Big Data Systems in today’s industry. Discuss two advantages that Spark offers over the more traditional Apache Hadoop framework, and explain why these advantages are significant. Explain why Hadoop is still useful, and give an example of how Hadoop could still be used. [5 marks] (b) Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Explain the concepts of both source-based and target-based deduplication. Discuss an advantage and a disadvantage to each approach in the context of Cloud Computing. [5 marks] (c) NoSQL is a broad class of database management systems which do not use a relational database management model. Discuss two advantages and two disadvantages of using NoSQL in the context of a big data system. Give an example scenario where use of a NoSQL database would be appropriate. [5 marks] (d) Neo4j is an example of a NoSQL Graph database. Use an example to explain what type of application a Graph database is suitable for. Discuss two advantages and two disadvantages of graph databases. [5 marks] [Question 3 Total: 20 marks] [Grand Total: 60 marks] Page 4 of 4 End