辅导案例-7CCSMBDT
7CCSMBDT – Big Data Technologies Coursework 1


Coursework assigned: 7 February 2020.
Coursework submission deadline: 4:00pm, 21 February 2020.
Late submission deadline (capped at 50%): 4:00pm, 22 February 2020.

Overview: The coursework aims to make you familiar with the following concepts: (i)
Big Data characteristics and analytics, (ii) Big Data collection, and (iii) programming
using the MapReduce framework.
This coursework is formally assessed and is worth 10% of your final mark.
You will receive feedback as part of the marking of the coursework after 4 weeks from
the coursework submission deadline.

Submission: Include BOTH files below:
(i) A file, Coursework1.PDF, containing your answers. For tasks that require writing
code, write your code as part of the answer. For tasks that require showing output of a
program, show the output or part of the output if the file is large.
(ii) A file, Coursework1_code.ZIP, containing, for each program, the code of the
program (.py file) and a file containing the entire output of applying the program to the
required dataset. Name the code and output to indicate the task it corresponds to (e.g.,
task3.py for the code and task3.out for the output of Task 3).
Evaluation: The maximum number of marks (out of 100) for each task is given in
square brackets [] next to each question.

Plagiarism: “Plagiarism is passing off someone else’s work as your own, or submitting
a piece of your own work that you have already submitted as part of a different
programme, module or at a different institution. The penalties for plagiarising by the
College can be severe. Uploading work to KEATS is regarded by the Department as a
statement by the student concerned, confirming that the work has not been plagiarised.”

Late submission: "If you are submitting your coursework after the deadline, you must
submit a Mitigating Circumstances Form (MCF) to your Programme Administrator, with
evidence to justify why you have not submitted on time. If you do not do this or your
reasons are not acceptable, your coursework may be given a mark of zero." Please
speak to your personal tutor about the MCF. Lecturers have no control of submission
deadlines, nor can provide extensions.
7CCSMBDT – Big Data Technologies Coursework 1


Task 1. Big Data characteristics
(a) Why data from the transportation domain can be classified as Big Data? Justify your
answer by referring to the 5Vs (characteristics) of Big Data. [10]
(b) Describe the challenges entailed by each characteristic of Task 1(a). [15]
Note: Refer to lecture 1 for discussion of the characteristics and an example of a
domain of Big Data (game industry).

Task 2. Big data collection using Apache Sqoop.
(a) Discuss what happens when the following command is executed:
scoop export --connect jdbc:mysql://localhost/hadoop --username U
--password P --table mytable -- export-dir /user/hive/warehouse/mytable -m 1 -- input-
fields-terminated-by `\001`
Your answer should explain step by step how the database table, client, and
MapReduce cluster interact during the execution of the command. [15]
(b) What are the benefits of using Apache Sqoop to import data from a database table,
managed by a Relational Data Base Management System (RDBMS), compared to a
manual solution, such as custom code that reads the data from the table, writes them
into local files, and then using commands or custom code to copy the files into HDFS?
[10]
Note: Refer to lecture 2 for details on Scoop.

Task 3. MapReduce combiners.
Write a program task3_c1.py using mrjob, which applies a function f of your choice to a
small input file of your choice without using a combiner. Also, write a program
task3_c2.py using mrjob, which applies f to the same input file and it uses a combiner. f
must be inappropriate for being used with a combiner.
Please comment your code appropriately to explain what each step does.
Provide the output of both programs and explain why the output of the second program
is incorrect. [25]

Note: You can use redirection (e.g., python3 myprogram.py > myoutput.txt) to get the
output. You can execute the program in local mode (i.e., without -r hadoop).
7CCSMBDT – Big Data Technologies Coursework 1


Task 4. Join in MapReduce.
Download the datasets id_age_occ.csv and id_educ_marital.csv from KEATS.

Write a Python program based on the MapReduce framework, using mrjob, which
performs a join between these two datasets.
Please comment your code appropriately to explain what each step does. Provide the
output of your program on the datasets in a file program_task4.out. Your report should
also contain a small part of program_task4.out
[25]

Notes:
 You are asked to join the two files. Solutions that generalize two multiple files are
not needed.

 You can use two input files in the mrjob program. The following example creates
two input files and then applied wordcount.py on the files, which measures how
many times each word appears in the files.



[cloudera@quickstart Desktop]$ cat file1.txt
one two three
[cloudera@quickstart Desktop]$ cat file2.txt
one four five

[cloudera@quickstart Desktop]$ python3 wordcount.py file1.txt file2.txt
“five” 1
“four” 1
“one” 2
“three” 1
“two” 1


 The join attribute is the id (it is included in the files and is not something you need
to calculate). You can see its function in the join from the example output below. I
expect to see the example output, based on the example input.

7CCSMBDT – Big Data Technologies Coursework 1


 IMPORTANT The order of the attributes must be maintained. That is, every
record in the joined table has id, then the attributes age and occupation of
id_age_occ.csv and last the attributes education and marital status of
id_educ_marital.csv.


Example input:
(i) sample of id_age_occ.csv
1, 39, State-gov
2, 50, Self-emp-not-inc
3, 38, Private
4, 53, Private
(ii) sample of id_educ_marital.csv
1, Bachelors, Never-married
2, Bachelors, Married-civ-spouse
3, HS-grad, Divorced
4, 11th, Married-civ-spouse


Example output:
"1" [["39", " State-gov"], ["Bachelors", "Never-married"]]
"2" [["50", " Self-emp-not-inc"], ["Bachelors", "Married-civ-spouse"]]
"3" [["38", " Private"], ["HS-grad", "Divorced"]]
"4" [["53", " Private"], ["11th", "Married-civ-spouse"]]




[END of Coursework 1]
51作业君 51作业君

扫码添加客服微信

添加客服微信: IT_51zuoyejun