Coursework 2: Recommendations CO558 Python Programming (Autumn 2020/21) Deadline: Friday 13th Nov 2020 (7pm GMT) on CATe Authorship credits: Luca Grillotti, Josiah Wang, Oana Cocarascu 1 Problem description In this coursework you will be analysing two types of recommendations: • Part 1: The first type of recommendation is related to route planning. You will be looking at London Tube stations, and computing the shortest path between two stations as well as how much a user would have to pay for his/her journey. • Part 2: The second part is related to sentiment analysis. This can be seen as a sub-task in the decision making based on online reviews. As online reviews have become the chosen method of quality control for users in various domains, they now impact a customer’s decision both positively and negatively. To aid you in developing your program, you are supplied with a skeleton code. Important: • Do not modify any constructors. • Do not import any other packages for Part 1 other than from the Python Standard Library. • The files tubemap.py, journey.py, and user.py have example code of how to call and run the functions from each file. For example, to run user.py, you can execute the following command from the root (i.e. cloned) directory: python3 -m tubemap.user • Similarly you can run the example codes for Part 2, e.g. python3 -m review.dataset • Do not modify the run() function in part2.py. How to get the files and how to hand in You are provided with a git repository. You will use git to push your code changes to the repository. To get the code, type the following command: git clone https://gitlab.doc.ic.ac.uk/lab2021 autumn/python cw2
.git replacing with your Imperial username. To submit on CATe, go to https://teaching.doc.ic.ac.uk/labts and click on the button Submit to CATe that is associated to the commit that contains the code you wish to submit. 1 Part 1: London Tube Map You are given a file data/london.json that contains data representing the London Tube map. Your task is to implement several methods related to loading and processing the data, determining the fastest route between two given stations, and computing the total price a user has to pay for his/her journeys. Any journey that starts or ends between 06:30-09:30 and 17:00-20:00 is classed as peak. All other journeys are off-peak. If all of a customer’s journeys on a given day are off-peak, their cap should be £7.00. If any of a customer’s journeys are peak, their cap should be £8.80. Long jour- neys are over 10 stations and short journeys have fewer than or equal to 10 stations. The prices are as follows: Peak long: £3.80 ; Peak short: £2.90 ; Off-peak long: £2.60 ; Off-peak short: £1.90. Implement the following methods of class TubeMap in the file tubemap.py: • import tube map from json() which imports the json data from london.json and updates the attributes graph tube map and set zones per station. Note that some stations belong to multiple zones (e.g. a station that has “zone”: “2.5” should be recorded both in zone 2 and zone 3) (10 marks) • get set lines for station() which returns a set containing the names of the lines on which a station is found (3 marks) • get set all stations on line() which returns the set of all the stations on a given line (3 marks) • get fastest path between() which implements Dijkstra’s algorithm to find the fastest path between two given stations (a link to the pseudocode of the algorithm is given in the docstring of the method). You can use float(‘inf’) to specify infinity. (20 marks) Implement the following methods of class Journey in the file journey.py: • is list successive stations valid() which returns True if all the stations given can be found in the tube map provided and all the pairs of successive stations are indeed connected (2 marks) • is long journey() which returns True if the journey is over 10 stations (2 marks) • is on peak() which returns True if the journey starts or ends during peak times (2 marks) • compute price() which computes the price of the journey depending whether it is peak or not and if it is a long journey or a short journey (2 marks) Implement the following methods of class User in the file user.py: • register journey() which updates the attribute journeys per date given a journey’s de- tails (3 marks) • compute price per day() which computes the price of all journeys in a given day (3 marks) 2 Part 2: Movie Reviews The aims of part 2 of the coursework are: 1. to give you a practical understanding in applying advanced object-oriented programming concepts: Abstraction, Polymorphism, Inheritance, and Encapsulation 2. to allow you to practise tackling a machine learning task using numpy and sklearn You will find two files in data/: • rt-polarity.pos contains examples of positive movie reviews, one per line • rt-polarity.neg contains examples of negative movie reviews, one per line You will need numpy and sklearn for this part of the coursework. Dataset The Dataset class can be found in review/dataset.py. It contains the public, read-only attributes X and y, that stores the feature vectors and labels of the samples respectively. It also contains the private attributes X and y that can only be used inside the class. Please implement the following methods for Dataset: • add () which allows you to overload the ‘+’ operator to concatenate two Dataset instances. (4 marks) • append() which allows you to concatenate the contents of another Dataset instance to the current instance. This is a mutator method that does not return any value. (4 marks) • split train test() which splits the Dataset instances into two sets for training and testing. You can specify the keyword argument test size to control the proportion of data to use as test examples. It should return a tuple (train, test) containing the Dataset instances for training and testing respectively. You may use sklearn to implement this method. (4 marks) DatasetLoader The DatasetLoader class can be found in review/dataset.py. Please implement the following method: • load() which returns a new Dataset instance from a specified file (one sample per line), and assigns a common label to all samples in the file (defaults to 0). (4 marks) 3 MovieReviewDatasetLoader The MovieReviewDatasetLoader class can be found in review/dataset.py. It is a subclass of DatasetLoader. Please implement the following method: • load() which overloads the method in DatasetLoader. It takes two arguments: the name of the file containing positive examples, and the name of the file containing negative examples. It returns a new Dataset instance with examples from both files, and with the label 1 assigned to positive examples and label 0 assigned to negative examples. Hint: Try reusing the load() method from the superclass DatasetLoader by invoking super().load() (4 marks) Classsifier The Classifier class can be found in review/classification.py. It can be initialised with a Dataset instance (training samples) and any model that implements the fit() and predict() methods (e.g. a sklearn estimator). These have all been implemented for you. BagOfWordsClasssifier The BagOfWordsClassifier is a subclass of Classifier, and can be found in review/classification.py. Like Classifier, it is instantiated with a Dataset instance, a model, and a keyword argument max vocab size which controls the maximum number of words in the vocabulary. A bag of words classifier transforms the raw input into a histogram of word counts according to a fixed vocabulary, and uses this as a feature vector for each sample. Please complete the following method: • set up pipeline() to set up a sklearn.pipeline.Pipeline as the model to the BagOfWordsClassifier. The pipeline should transform the raw texts to a matrix of token counts, and then use the base model argument as the classifier. Also remember to limit the vocabulary size to the max vocab size attribute. (5 marks) Hint: Read the official manual for sklearn.feature extraction.text.CountVectorizer. TfidfClasssifier The TfidfClassifier is a special case of BagOfWordsClassifier, and can be found in review/ classification.py. It is a bag of words classifier, but additionally weights the word counts (Term Frequency (TF)) with the Inverse Document Frequency (IDF) to reflect the importance of a word in a collection of texts. For example, the word ‘the’ is downweighted because it appears regularly across texts and is thus not important. Please complete the following method: • set up pipeline() to set up a sklearn.pipeline.Pipeline as the model to the TfidfClassifier. The pipeline should transform the raw texts to a matrix of token counts, weighted by the IDF, and then use the base model argument as the classifier. Again, remember to limit the vocabulary size to max vocab size. (5 marks) Hint: Read the official manual for sklearn.feature extraction.text.TfIdfVectorizer. 4 Scorer The Scorer class can be found in review/evaluation.py. It is used for evaluating the predictions of classifiers. It is instantiated with two np.ndarrays: one the ground truth labels of the test set, and the other labels predicted by the classifier. Please complete the following methods: • compute accuracy() which returns the accuracy score. You can compute this yourself (hint: it can be done in a single line with numpy!) or use sklearn. (2 marks) • compute confusion() which returns the confusion matrix. Again, you can compute this yourself or use sklearn. (2 marks) • generate report() which should return the confusion matrix and accuracy as a string, formatted as shown below. It should contain the total number of instances in each row and each column of the confusion matrix, and the total number of instances overall. (5 marks) 0 1 0 814 232 1046 1 283 765 1048 1097 997 2094 Accuracy: 0.7541 • generate report to disk() which saves the report from above into a text file. (2 marks) part2.py There is a file named part2.py in your root folder. The function run() shows an example of how your movie review classification library can be used. BagOfWordsClassifier and TfidfClassifier are designed to be flexible enough to be used with any model, whether they be decision trees, SVMs or Naive Bayes, as long as they implement the fit() and predict() methods. This is the power of polymorphism in action. We will attempt to run BagOfWordsClassifier and TfidfClassifier on the movie review dataset, each with two different base models, resulting in a total of four classifiers. We will: • use sklearn.neighbors.KNeighborsClassifier as a base model; and • use sklearn.model selection.GridSearchCV to optimise the KNeighborsClassifier as the second base model. Please complete the following functions in part2.py: • set up knn() which returns an instance of sklearn.neighbors.KNeighborsClassifier. Please use K = 5. Hint: You should be able to do this in one line. (2 marks) 5 • set up knn gridsearchcv() which returns an instance of sklearn.model selection.GridSearchCV. Set up this grid search to optimise the hyperparameters of a KNeighborsClassifier via 10- fold cross validation. You should try to optimise for K (the number of nearest neighbours) and the weights parameter (‘uniform’ or ‘distance’). Choose a reasonable range for K so that your code does not take forever to run. (5 marks) Please check the official manual for sklearn.neighbors.KNeighborsClassifier for details about the parameters. If you run part2.py successfully, your program should output four text files in the root directory. Please also push these to your Gitlab repository. (2 marks) How you will be marked You will be assigned a mark according to: • whether your program works or not; • whether you have used meaningful names for variables; • whether you have used a clear, appropriate and logical design. For Part 2, you will not be marked on how good your classifiers are. We do expect your classifiers to perform better than random (accuracy 50%). 6 欢迎咨询51作业君