COM6012 2025 Assignment
Deadline: 13:00 Thursday 08 May 2025
Please carefully read the assignment brief before starting to complete the assignment.
Release Status:
Q1 - 10 marks
Q2 - 9 marks (updated on 27.04.2025 to clarify the naming and number of medication features)
Q3 - 10 marks
Q4 - 10 marks
An FAQ (last update 03.04.2025) will be updated when questions are raised for important
clarifications or tips.
Assignment Brief
How and what to submit
A. Create a folder YOUR_USERNAME-COM6012 containing the following:
1) AS_report.pdf: A report in PDF containing answers (including all figures and
tables) to ALL questions at the root of the zipped folder (like readme.txt in the lab
solutions). If an answer to a question is not found in this PDF file, you will lose the
respective mark. The report should be concise. You may include appendices/references
for additional information
but marking will focus on the main body of the report.
2) Code, script, and output files (see the sample solutions to the lab exercises as
examples): All files used to generate the answers for individual questions in the report
above, except the data, should be included. These files should be named properly
starting with the question number (separate files for the two questions): for example,
your Python code as Q1_code.py and Q2_code.py, your HPC script as Q1_script.sh
and Q2_script.sh, and your output files on HPC as Q1_output.txt and Q2_output.txt
(and Q1_figC1.jpg, etc.). The results must be generated from the HPC, not your local
machine. Figures must be created by Python code. We will apply a penalty if any of
these files are missing, 25% for each file. Double-check that these files are included by
downloading the zipped file on another machine and opening it to verify.
B. When you have finished ALL the questions, zip your folder YOUR_USERNAME-COM6012 to
include the above (one single report plus code, script, and output files for all questions, properly
named) and upload this YOUR_USERNAME-COM6012.zip file to Blackboard before the
deadline.
C. NO DATA UPLOAD: Please do not upload the data files used. Instead, use the relative file
path in your code, assuming data files are downloaded (and unzipped if needed) under the
folder ‘Data’, as in the lab.
D. Code and output: 1) Use PySpark 3.5.4 and Python 3.12 as covered in the lecture and lab
sessions to complete the tasks; 2) Submit your PySpark job to HPC with sbatch to obtain the
output.
Assessment Criteria (Scope: Sessions 1 to 8; Total: 39 marks)
1. Being able to use PySpark to analyse big data to answer data analytic questions.
2. Being able to perform tasks covered in Sessions 1 to 8 on large-scale data.
3. Being able to make useful observations and explain obtained results clearly.
Late submissions: We follow the Department's guidelines about late submissions, i.e., “If you
submit work to be marked after the deadline you will incur a deduction of 5% of the mark each
working day the work is late after the deadline, up to a maximum of 5 working days” but NO late
submission will be marked after the maximum of 5 working days because we will release a
solution by then. Please see this link.
Use of unfair means: "Any form of unfair means is treated as a serious academic offence and
action may be taken under the Discipline Regulations." (from the MSc Handbook). Please carefully
read this link on what constitutes Unfair Means if you are not sure.
Note: This assignment is for internal students (COM6012). External students (COM6012s) will be
assessed by exam only.