辅导案例-COMP3430

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
The Australian National University Research School of Computer Science, CECS
COMP3430 – Data Wrangling - 2019
Record linkage project Due 11:55pm Sunday 20 October 2019
Worth 20% of the final grade for COMP3430
Draft – Last update September 11, 2019
Overview and Objectives
For this project you will be having another look at the record linkage program that you developed in the lab sessions.
Specifically, we provide you with two new data sets and ask you to work with the programs we have developed in the labs,
and report on your findings. As with the previous assessments, the emphasis is on your understanding, descriptions, and
justification as much as the raw (numerical) record linkage evaluation results that you are able to achieve.
Important
• Submit one zip archive file, named uNNNNNNN record linkage project.zip, where uNNNNNNN is your ANU
ID. For example if your ANU ID is u1234567 you should submit the file u1234567 record linkage project.zip. Only use
underscores and not spaces, and only lower-case letters in your file name (as this will greatly help our marking efforts
– thanks). You receive a -1 mark penalty if you do not follow this naming convention.
• Make sure that your student ID is included on the first page of your submitted report. You receive a -1 mark
penalty if you do not include your student ID.
• Do NOT include your name anywhere in your submission. All marking will be done anonymously. You receive a -1 mark
penalty if you do include your name.
• The zip file must contain:
1. Your report, a .pdf document named uNNNNNNN record linkage project report.pdf
2. Your output file for the best linkage results you were able to obtain (see task 3 below), a .csv file named
uNNNNNNN record linkage project result.csv
• The allowed total maximum length of your report is four (4) A4 pages (single pages, not 4 double pages!) and around
1,500 words. We expect you to use at least 12 point font size with a standard font (such as Times New Roman
or Liberation Serif ) for all text in your submitted report. We encourage you to use larger font size or bold font for titles,
section headers, etc. Include the total word-count of your report on the first page of your report.
The 4 page maximum length does include any figures, tables, references and appendices.
• Your submitted report does not need to have a cover page.
• Word documents or any other formats besides PDF are not accepted and will not be marked.
• Hand-written submissions are not accepted and will not be marked.
• Make sure you submit the final version of your project before the submission deadline.
Submission
Submission will be done using Wattle. Click on the link COMP3430 record linkage project submission (to be made
available) in week 11 to upload your ZIP file.
You may submit as many draft versions of your project as you wish. However, you must make sure you submit a final
version before the submission deadline. We will mark the final version present at the due date. Note that
Wattle does not allow us to access earlier submitted versions of your project, therefore check carefully what
you submit as the final version!
We cannot accept submissions via email.
Penalties
The following will attract penalties:
-1 mark if you do not follow the file naming convention discussed above.
-1 mark if you do not include your student ID on the first page of your submitted report.
-1 mark if you do include your name in your submitted report.
-1 mark for every page over the maximum 4 page limit (so a 6 page report will attract a -2 penalty).
-1 mark if you use a font size smaller than 12 points, or a difficult to read font type.
Deadlines, Extensions and Late Submissions
The record linkage project is due 11:55pm, Sunday 20 October 2019.
Students will only be granted an extension on the submission deadline in extenuating circumstances, as de-
fined by ANU policy (http://www.anu.edu.au/students/program-administration/assessments-exams/deferred-examinations).
If you think you have grounds for an extension, you must notify the course convener as soon as possible and
provide written evidence in support of your case (such as a medical certificate). The course convener will then decide
whether to grant an extension and inform you as soon as practical.
In accordance with the CECS and ANU late submission policy, no late submissions will be accepted, except where an
extension has been approved by the course convener.
Plagiarism
No group work is permitted for this project.
We do encourage you to discuss your work, but we expect you to do the project work by yourself. If you are
unsure about what constitutes plagiarism, make sure you carefully read the ANU Academic Honesty Policy
(http://academichonesty.anu.edu.au/).
If you do include ideas or material from other sources, then you clearly have to make attribution by providing a reference to
the material or source in your submitted project report. We do not require a specific referencing format, as long as you are
consistent and your references allow us to find the source, should we need to while we are marking your report.
Marking
This project will be marked out of 20, and it will count for 20% of your final course mark.
Note that not all project tasks are equally difficult. For some of the tasks there is no single right or wrong answer. Marks
will be awarded based on your reasoning and the justification of your decisions and explanations, as well as clarity and
correctness of writing.
We will endeavour to release your marks and feedback within two teaching weeks after the submission deadline. If you feel
we have made an error in marking, you have two weeks following the release of marks to raise any issues with the course
convener, after which time your mark will be considered final. If you request that we re-mark your project, we will
re-mark the entire project and your mark may go up or down as a result.
Project Structure
This project consists of four (4) tasks as described below which can be worth different numbers of marks. Make sure you
answer all aspects of each task.
If you have any questions on the project please post them on Wattle – however do not post any partial solutions,
program codes, equations, calculations, URLs, etc. or any hints on how to solve any of the project tasks.
Project Tasks
For this project, we provide you with the following two new data sets, dataset A.csv and dataset B.csv, as well as a truth
data set true matches.csv, available for download from Wattle in week 8.
The tasks for this project are similar to what you had to do in lab 7 in week 9. You are required to run your record linkage
program (including any modifications you have made to this program) on the two data sets provided, and write a report
which addresses the following questions:
1. Blocking (6 marks):
• How does blocking affect your results? Specifically, describe your choice of blocking method and choice of blocking
keys. Discuss which attributes and/or attribute combination(s) in the given data sets were useful as blocking keys
and which were not, and why.
• If there is a trade-off between performance (reduction ratio, pairs completeness and pairs quality) and the quality of
the final record linkage results, where do you think the optimal balance is, and why?
• Do you think this trade-off would change on different data sets with different levels and characteristics of data quality?
If so, how and why?
2. Comparison and Classification (6 marks):
• How do different comparison techniques affect linkage results? Discuss and justify how you selected appropriate
comparison functions for different attributes, and why these selected functions are suitable while others were not.
• How do different classification techniques using different parameter settings affect linkage quality? Discuss and justify
how you selected an appropriate classification function to obtain high linkage quality.
• As discussed in the lectures in week 8, for suitable linkage quality measures, describe how the final record linkage
quality changes with the choice of parameters and techniques?
• Is the record linkage quality particularly sensitive to certain parameters or choice of comparison or classification
techniques? If so, why is this the case?
• Provide the numerical linkage evaluation results for other (not optimal, see below) parameter settings that you have
used (you only have to provide the output file for your best obtained linkage results – see next task). Ideally you
include tables or plots to show linkage quality results for different parameter settings.
3. Optimal Settings (4 marks):
• What is the best linkage quality result you are able to achieve, both in the blocking and the classification steps?
Why do you think this combination of parameters and techniques works well?
• Are the results good for all evaluation measures discussed in the lectures in week 8, or only for some? If the results
are good only for some measures, why do you think the results are not good for other measures?
In addition to answering this task in your report, you must also submit the output file which contains
the linked and classified matching record pairs (as a CSV file) for the best linkage result you were able to
obtain.
Use the Python program saveLinkResult.py which we use in lab 7 to write linkage output into a file. Your
submitted output file must exactly follow this CSV file format! We will use a program to check linkage quality
using this file to validate what you write in your report. If our program does not work with your submitted file because
it does not follow the required file structure you will loose marks.
4. Data Quality (4 marks):
• How dirty are these new data sets compared to all the data sets you have worked with in labs 3 to 7? Describe your
impression after having conducted the linkage project.
• How can you determine this? Describe the methodology you used to assess the data quality of the data sets we
provided for this project (such as any calculations you used, or how you determined the data quality using data
exploration and profiling).
Visualisations: You should use appropriate data visualisations such as tables, plots, etc. for your descriptions to the above
tasks. Marks will be awarded for good visualisations and appropriate tables.
Assume you are presenting your record linkage project to an audience without a strong technical background, so make sure
you adequately explain any visualisations you use (i.e. describe what tables and figures show and interpret the content of
the obtained results).
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468