Cardiff School of Computer Science and Informatics Coursework Assessment Pro-forma
This assignment is worth 60% of the total marks available for this module. The penalty for late or non-submission is an award of zero marks.
Your submission must include the official Coursework Submission Cover sheet, which can be found here:
https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf
Submission Instructions
Your coursework program -- a Python script -- should be submitted by logging into the coursework testbed at https://egeria.cs.cf.ac.uk/ before 9.30am on the submission date. Make sure to include (as comments) your student number (but not your name) at the top of your script. Also, follow this by any notes (as comments) regarding your submission. For instance, specify here if your program does not generate the proper output or does not do it correctly. (This only applies for Q1 and Q2)
Description | Type | Name | |
Cover sheet | Compulsory | One PDF (.pdf) file | [student number].pdf |
Q1 and Q2 | Compulsory | One Python (.py) file | Q1_Q2_[student number].py |
Q3 and Q4 | Compulsory | Two Python (.py) files | Q3_[student number].py Q4_[student number].py |
Readme.txt | optional | Text File (.txt) | Readme_[student_number].txt |
Any deviation from the submission instructions above (including the number and types of files submitted) may result in a mark of zero for the assessment or question part.
For question 1 and 2 submission follow below instructions:
Download the following files from Learning Central:
cmt115coursework_template.py
anagram.txt
credit_cards.txt
cmt115coursework_template.py contains a total of 2 to-be-implemented functions each belonging to TWO separate tasks.
The python script also contains code for testing; the TWO txt files are data set to be used for testing your functions.
To test your implementation using given test files: make sure all the testing files are in the same directory as the python file before you run the Python script.
To test your implementation using your own test files: if you have your testing files, for an instance, anagram.txt and credit_cards.txt, you execute the python script (assume it is still called Cmt115coursework_template.py) using command line (using either Command Prompt in Wins or Terminal in Mac) which is like this:
Cmt115coursework_template.py anagram.txt credit_cards.txt
For question 3 and 4 follow below instructions:
Download the following files from Learning Central:
good_csv.csv
bad_csv.csv and bad_csv_fixed.csv
bad_csv2.csv and bad_csv2_fixed.csv
welsh_exports.csv
Q3.py
Q4.py
Q3.py contains template code for question 3. Q4.py contains template code for question 4.
Complete the code to solve the question. Make sure to rename the files according to the above table.
Assignment
Answer all of the following questions:
Question 1 – Anagram (Total 10 Marks)
You need to create a function that accepts two variables (A, B). A and B are called anagrams if they contain all same characters in the same frequencies. For example, the anagrams of DOG are DOG, ODG, GOD, DGO, OGD and GDO. If A and B are anagrams the function will print “Anagrams” otherwise, print “Not anagrams” instead.
Constraints: The comparison should NOT be case sensitive.
Question 2 – Credit Card (Total 10 Marks)
You are working in a bank and part of your job is to develop tools to facilitate your colleagues’ job. John your co-worker is working in the credit card department and he received an N number of credit cards number that needs to be validated against certain criteria. Your task is to develop a function to run through these credit cards numbers and print ALL at once, list of the credit cards with validation status. (Hint: you can use regular expressions)
Constraints: The input may be in a nested list or list
Input:
[378282246310005, 30569309025904, 6011111111111117, 5123-2332-3232-3213] OR
[378282246310005, [30569309025904, 6011111111111117], 5123-2332-3232-3213]
Example of the output: 378282246310005 Invalid
30569309025904 Invalid
6011111111111117 Invalid
5123-2332-3232-3213 valid
The criteria are:
It must start with a 4,5 or 6
It must be exactly 16 digits
It must be numbers only
It can have a digits in groups of 4, separated by one hyphen “-“
It should not contain any other characters.
It must NOT have any 4 repeated digits.
Question 3 – Processing messy input data (Total 20 marks)
You work as a data analyst for a company. Recently, you started receiving csv files that are corrupted and cannot be read in Pandas using pd.read_csv. The reason is that the number of columns is not consistent in the csv file. To illustrate this, compare the three example files available on Learning Central:
good_csv.csv: this file is not corrupted. There is missing data, but every line contains 5 columns so it can be perfectly read.
bad_csv.csv: this file is corrupted. Line 1 contains 4 columns, line 2 and 3 contain 5 columns, and line 4 contains 3 columns. We can infer that the correct number of columns is 5.
bad_csv2.csv: this file is corrupted. Lines 1, 2 and 4 contain 5 columns, but line 3 contains 10 columns. We can infer that the correct number of columns is 5.
bad_csv_fixed.csv and bad_csv2_fixed.csv: This are the fixed versions of the above files. You can compare your solution against these files.
Write a function fix_messy_data starting from the template in the file Q3.py. It takes a corrupted input file and write out a fixed version of the file; if the input file is not corrupted, it should be written out unchanged. Given a filename, the function equalizes the number of columns in each line of the file:
If there is too few columns in a line you should append empty data.
If there is too many columns in a line, delete the extra columns.
The number of columns in a line is equal to the number of commas ( , ) plus 1.
You can estimate the correct number of columns by finding out which number of columns appears most often in the corrupted file. If there is a tie, any choice is permitted.
Your program then needs to save the fixed file by appending ‘_fixed’ to the filename of the input csv file.
You can apply your implementation (e.g. on the bad_csv.csv file) by typing “python Q3.py bad_csv.csv” on the command line.
Note: All csv files are supposed to come without a header. You program needs to work for files with any number of lines and any number of columns. Use Python’s file input/output functions for the file processing. Pandas functions and modules that have not been covered in the lectures or labs are not allowed.
Question 4 – Welsh exports (Total 20 marks)
The Welsh government hires you to analyse the development of Welsh exports across a number of product families. You are provided with the file welsh_exports.csv that lists Welsh exports in £million for different product families. It covers the years 2013-2018. The following screenshot shows the first 12 rows of the data. For every year, the respective first row gives the total sum for this year (rows 0, 5, and 10 in the screenshot). The following four rows break down the data into quarters 1, 2, 3 and 4.
Using Pandas and matplotlib, collect all the analyses starting from the template file Q4.py. Adhere to the order of the questions:
Read in the csv file. Remove the rows corresponding to year totals and print descriptive statistics across all quarters. Identify the columns that have missing data.
[2 marks]
In the Quarter column, use the .apply method to remove the year information (e.g. ‘Quarter 1, 2014’ becomes ‘Quarter 1’). Once done, check for seasonal effects on exports by printing the median exports per quarter (across all years). [4 marks]
The Welsh government is interested in whether the upward trend in the export of ‘Miscellaneous Manufactured Goods’ is statistically significant. Plot the data, then perform linear regression and obtain a p-value on the slope to answer this question. Plot another figure that contains the exports for all product families as separate lines. [6 marks]
For the two quarters of the year 2018, produce a grouped bar plot that shows the exports per quarter for every product family. [4 marks]
There is a way to recover the missing data. Can you find out how? Implement a function that recovers the missing data. [4 marks]
Note: It should be possible to execute the analysis on a different computer. That is, make sure that all the necessary import commands and other preparatory steps, if any, are carried out at the beginning of the script. Make sure that no absolute paths are used by keeping all files in the same folder as the script.
Learning Outcomes Assessed
Using the Python programming language to complete programming tasks
Familiarity with basic programming concepts and data structures
Reading and writing files
Preprocessing and data wrangling with Pandas dataframes
Visualisation of data
Descriptive statistics and statistical analyses
Criteria for assessment
Credit will be awarded against the following criteria; the coursework will allow students to demonstrate their knowledge and practical skills and to apply the principles taught in lectures.
For question 1 and 2, the functions you have implemented will be tested against different data sets. The score each implemented function receives is judged by its functionality. A correctly functioning function is to be given a full mark.
For question 3, the code will be tested against different data sets. Additionally, for question 3 and 4, the efficiency and quality of the code will be part of the mark. The below table explains the criteria.
Question 1 and 2
The following criteria are applied.
Criteria | Distinction (70-100%) | Merit (60-69%) | Pass (50-59%) | Fail (0-50%) |
Question 1 | Excellent working condition with no errors | Mostly correct. Minor errors in output | Major problem. Errors in output | Mostly wrong or hardly implemented |
Question 2 | Excellent working condition with no errors | Mostly correct. Minor errors in output | Major problem. Errors in output | Mostly wrong or hardly implemented |
Question 3
The following criteria are applied.
Criteria | Distinction (70-100%) | Merit (60-69%) | Pass (50-59%) | Fail (0-50%) | |
Q3 | Functionalit y (70%) | fully working application that demonstrates an excellent understanding of the assignment problem using relevant python approach. | All required functionality is met, and the application are working probably with some minors’ errors | Some of the functionality developed with and incorrect output major errors. | Faulty application with wrong implementation and wrong output |
Efficiency (15%) | Excellent performance passing all test cases | Good performance missed some test cases | Passed some test cases with incorrect output. | Did not pass any test case | |
Quality (15%) | Excellent documentation with usage of docstring and comments | Good documentation with minor missing of comments. | Fair documentation. | No comments or documentation at all |
Question 4
This question focuses on data analysis rather than algorithm development, hence except for the correctness of the analysis style and elegance of the analysis and the figures receives a larger weight than for the previous questions.
Q4 | Functionalit y (60%) | Fully correct analyses, algorithms, and figures, that demonstrate an excellent understanding of the assignment problem using a relevant Pandas approach. | Correct analysis, algorithms, and figures, probably with some minor mistakes | Only some of the analyses, algorithms, and figures are developed and major errors occur | Wrong implementation and incorrect or missing analyses, algorithms, and figures |
Quality (40%) | Figures are elegant and show an excellent understanding of visualisation principles including tick marks, labels, colouring, and titles. Excellent documentation with usage of docstring and comments | Figures show a good understanding of visualisation principles. Good documentation with minor missing of comments. | Figures show a basic understanding of visualisation principles. Fair documentation. | Missing figures. No comments or documentation at all |
Feedback and suggestion for future learning
Feedback on your coursework will address the above criteria. Feedback and marks will be returned within 4 weeks of your submission date via learning central. In case you require further details you are welcome to schedule a one-to-one meeting.