程序代写案例-ISIT312/912-Assignment 2

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
ISIT312/912 Big Data Management
Spring 2021
Assignment 2


All files left on Moodle in a state "Draft(not submitted)" wi
ll not be evaluated. Please refer to
the submission dropbox on Moodle for the submission due date and time.


This assessment contributes to 20% of the total evaluation in the subject and includes 1 task
and multiple questions. The deliverable is specified in the task(s).


It is a requirement that all Laboratory and Assignment tasks in this subject must be solved
individually without any cooperation with the other students. If you have any doubts, questions,
etc. please consult your lecturer or tutor during lab classes or office hours. Plagiarism will result
in a FAIL grade being recorded for that assessment task.


The environment of implementation is the Ubuntu virtual machine which is imported from
the BigDataVM-2021v2_2.ova file. The conceptual and logical models can be built with the
UMLet tool with custom palettes. Both the VM and UMLet software are available on Moodle.










Task: Data Warehouse Design and Implementation in Apache Hive (20 marks)
The objective of this task is to design and implement a sample data warehouse in Apache Hive, which
is described in the following narrative.
A university plans to create a data warehouse to store information about the submissions of student
assignments and later on to analyse the contents of a data warehouse. It is expected that the planned
data warehouse will contain historical information collected over a long period of time.
This data warehouse will contain information about assignment submissions (abbreviated as
“submissions” hereafter), assignments, subjects, students, and degrees.
The following relationships exist between the above domain entities: Each submission belongs to one
assignment and is submitted by one or more students (for individual or group submissions). Each
student is enrolled into one degree. Each assignment belongs to one subject.
A submission is described by a mark, a submission date, and a file path (which refers to a location on
HDFS).
An assignment is described by a weight (percentage), a due date and a specification file path.
A subject is described by a subject code and subject name.
A student is described by a student number, first name, last name, and email address. A student number
and email address separately identify each student.
The time dimension contains four levels: day, week, session (Autumn or Spring) and year.
This data warehouse should support OLAP queries, including the common aggregations about
submissions per subject, per student, per degree, per day, per week, per session, or per year.
You can make reasonable assumptions on the keys of domain entities.
Complete the following questions:
Question 1. Develop a conceptual model for the above data warehouse. The dimensions and hierarchies
must be correctly presented. (4 marks)
Question 2. Specify the OLAP operations for the following specific queries by using relational-
algebraic notations (in the slides of Lectures 4 and 5): (2 marks)
(i) “Find the average slack period (i.e., number of days between submission date and due date) for
submissions per subject and per session”
(ii) “Find the average mark for each assignment for the subject ‘ISIT312-912’ in 2017” (Hint. Use
the “DICE” operation at page 46 of Lecture 4 slides.)
Question 3. Transform your conceptual model for Question 1 into a logical model with a star schema.
Note that all level tables in a star schema are flatten, i.e., denormalized. (3 marks)
Question 4. Create an (internal or external) Hive table (schema) for each table in your logical model
for Question 3. (4 marks)
Question 5. Populate the Hive tables for Question 4 with some sample data. More specifically, create
a file containing a few (e.g., three) sample records, which are determined by yourself, for each table in
the local system, and then load those files into Hive. Once done, use HQL to show all data. (5 marks)
Question 6. Implement the OLAP operations for Question 2 as HQL statements on the Hive tables for
Question 5. (2 marks)
Deliverable:
One PDF document to be submitted, which contains your solutions to the above questions, including
commands that you use in the Terminal and/or Zeppelin. Screenshots can be used to support your
solutions. For Question 5, your solution must show the sample data for Hive. For Question 6, the
correctness of the HQL statements are assessed, but the returned results of those statements are not
assessed as they are determined by your sample data.


欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468