辅导案例-CO558

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

Coursework 2: Recommendations
CO558 Python Programming (Autumn 2020/21)
Deadline: Friday 13th Nov 2020 (7pm GMT) on CATe
Authorship credits: Luca Grillotti, Josiah Wang, Oana Cocarascu
1 Problem description
In this coursework you will be analysing two types of recommendations:
• Part 1: The first type of recommendation is related to route planning. You will be looking
at London Tube stations, and computing the shortest path between two stations as well as
how much a user would have to pay for his/her journey.
• Part 2: The second part is related to sentiment analysis. This can be seen as a sub-task
in the decision making based on online reviews. As online reviews have become the chosen
method of quality control for users in various domains, they now impact a customer’s decision
both positively and negatively.
To aid you in developing your program, you are supplied with a skeleton code.
Important:
• Do not modify any constructors.
• Do not import any other packages for Part 1 other than from the Python Standard Library.
• The files tubemap.py, journey.py, and user.py have example code of how to call and run
the functions from each file. For example, to run user.py, you can execute the following
command from the root (i.e. cloned) directory: python3 -m tubemap.user
• Similarly you can run the example codes for Part 2, e.g. python3 -m review.dataset
• Do not modify the run() function in part2.py.
How to get the files and how to hand in
You are provided with a git repository. You will use git to push your code changes to the repository.
To get the code, type the following command:
git clone https://gitlab.doc.ic.ac.uk/lab2021 autumn/python cw2 .git
replacing with your Imperial username.
To submit on CATe, go to https://teaching.doc.ic.ac.uk/labts and click on the button
Submit to CATe that is associated to the commit that contains the code you wish to submit.
1
Part 1: London Tube Map
You are given a file data/london.json that contains data representing the London Tube map.
Your task is to implement several methods related to loading and processing the data, determining
the fastest route between two given stations, and computing the total price a user has to pay for
his/her journeys.
Any journey that starts or ends between 06:30-09:30 and 17:00-20:00 is classed as peak. All
other journeys are off-peak. If all of a customer’s journeys on a given day are off-peak, their cap
should be £7.00. If any of a customer’s journeys are peak, their cap should be £8.80. Long jour-
neys are over 10 stations and short journeys have fewer than or equal to 10 stations. The prices
are as follows: Peak long: £3.80 ; Peak short: £2.90 ; Off-peak long: £2.60 ; Off-peak short: £1.90.
Implement the following methods of class TubeMap in the file tubemap.py:
• import tube map from json() which imports the json data from london.json and updates
the attributes graph tube map and set zones per station. Note that some stations belong
to multiple zones (e.g. a station that has “zone”: “2.5” should be recorded both in zone 2
and zone 3) (10 marks)
• get set lines for station() which returns a set containing the names of the lines on which
a station is found (3 marks)
• get set all stations on line() which returns the set of all the stations on a given line (3
marks)
• get fastest path between() which implements Dijkstra’s algorithm to find the fastest path
between two given stations (a link to the pseudocode of the algorithm is given in the docstring
of the method). You can use float(‘inf’) to specify infinity. (20 marks)
Implement the following methods of class Journey in the file journey.py:
• is list successive stations valid() which returns True if all the stations given can be
found in the tube map provided and all the pairs of successive stations are indeed connected
(2 marks)
• is long journey() which returns True if the journey is over 10 stations (2 marks)
• is on peak() which returns True if the journey starts or ends during peak times (2 marks)
• compute price() which computes the price of the journey depending whether it is peak or
not and if it is a long journey or a short journey (2 marks)
Implement the following methods of class User in the file user.py:
• register journey() which updates the attribute journeys per date given a journey’s de-
tails (3 marks)
• compute price per day() which computes the price of all journeys in a given day (3 marks)
2
Part 2: Movie Reviews
The aims of part 2 of the coursework are:
1. to give you a practical understanding in applying advanced object-oriented programming
concepts: Abstraction, Polymorphism, Inheritance, and Encapsulation
2. to allow you to practise tackling a machine learning task using numpy and sklearn
You will find two files in data/:
• rt-polarity.pos contains examples of positive movie reviews, one per line
• rt-polarity.neg contains examples of negative movie reviews, one per line
You will need numpy and sklearn for this part of the coursework.
Dataset
The Dataset class can be found in review/dataset.py. It contains the public, read-only attributes
X and y, that stores the feature vectors and labels of the samples respectively. It also contains the
private attributes X and y that can only be used inside the class. Please implement the following
methods for Dataset:
• add () which allows you to overload the ‘+’ operator to concatenate two Dataset instances.
(4 marks)
• append() which allows you to concatenate the contents of another Dataset instance to the
current instance. This is a mutator method that does not return any value. (4 marks)
• split train test() which splits the Dataset instances into two sets for training and testing.
You can specify the keyword argument test size to control the proportion of data to use
as test examples. It should return a tuple (train, test) containing the Dataset instances
for training and testing respectively. You may use sklearn to implement this method. (4
marks)
DatasetLoader
The DatasetLoader class can be found in review/dataset.py. Please implement the following
method:
• load() which returns a new Dataset instance from a specified file (one sample per line), and
assigns a common label to all samples in the file (defaults to 0). (4 marks)
3
MovieReviewDatasetLoader
The MovieReviewDatasetLoader class can be found in review/dataset.py. It is a subclass of
DatasetLoader. Please implement the following method:
• load() which overloads the method in DatasetLoader. It takes two arguments: the name of
the file containing positive examples, and the name of the file containing negative examples.
It returns a new Dataset instance with examples from both files, and with the label 1 assigned
to positive examples and label 0 assigned to negative examples. Hint: Try reusing the load()
method from the superclass DatasetLoader by invoking super().load() (4 marks)
Classsifier
The Classifier class can be found in review/classification.py. It can be initialised with a
Dataset instance (training samples) and any model that implements the fit() and predict()
methods (e.g. a sklearn estimator). These have all been implemented for you.
BagOfWordsClasssifier
The BagOfWordsClassifier is a subclass of Classifier, and can be found in review/classification.py.
Like Classifier, it is instantiated with a Dataset instance, a model, and a keyword argument
max vocab size which controls the maximum number of words in the vocabulary.
A bag of words classifier transforms the raw input into a histogram of word counts according to
a fixed vocabulary, and uses this as a feature vector for each sample. Please complete the following
method:
• set up pipeline() to set up a sklearn.pipeline.Pipeline as the model to the BagOfWordsClassifier.
The pipeline should transform the raw texts to a matrix of token counts, and then use the
base model argument as the classifier. Also remember to limit the vocabulary size to the
max vocab size attribute. (5 marks)
Hint: Read the official manual for sklearn.feature extraction.text.CountVectorizer.
TfidfClasssifier
The TfidfClassifier is a special case of BagOfWordsClassifier, and can be found in review/
classification.py. It is a bag of words classifier, but additionally weights the word counts (Term
Frequency (TF)) with the Inverse Document Frequency (IDF) to reflect the importance of a word
in a collection of texts. For example, the word ‘the’ is downweighted because it appears regularly
across texts and is thus not important. Please complete the following method:
• set up pipeline() to set up a sklearn.pipeline.Pipeline as the model to the TfidfClassifier.
The pipeline should transform the raw texts to a matrix of token counts, weighted by the
IDF, and then use the base model argument as the classifier. Again, remember to limit the
vocabulary size to max vocab size. (5 marks)
Hint: Read the official manual for sklearn.feature extraction.text.TfIdfVectorizer.
4
Scorer
The Scorer class can be found in review/evaluation.py. It is used for evaluating the predictions
of classifiers. It is instantiated with two np.ndarrays: one the ground truth labels of the test set,
and the other labels predicted by the classifier. Please complete the following methods:
• compute accuracy() which returns the accuracy score. You can compute this yourself (hint:
it can be done in a single line with numpy!) or use sklearn. (2 marks)
• compute confusion() which returns the confusion matrix. Again, you can compute this
yourself or use sklearn. (2 marks)
• generate report() which should return the confusion matrix and accuracy as a string,
formatted as shown below. It should contain the total number of instances in each row and
each column of the confusion matrix, and the total number of instances overall. (5 marks)
0 1
0 814 232 1046
1 283 765 1048
1097 997 2094
Accuracy: 0.7541
• generate report to disk() which saves the report from above into a text file. (2 marks)
part2.py
There is a file named part2.py in your root folder. The function run() shows an example of how
your movie review classification library can be used.
BagOfWordsClassifier and TfidfClassifier are designed to be flexible enough to be used
with any model, whether they be decision trees, SVMs or Naive Bayes, as long as they implement
the fit() and predict() methods. This is the power of polymorphism in action.
We will attempt to run BagOfWordsClassifier and TfidfClassifier on the movie review
dataset, each with two different base models, resulting in a total of four classifiers. We will:
• use sklearn.neighbors.KNeighborsClassifier as a base model; and
• use sklearn.model selection.GridSearchCV to optimise the KNeighborsClassifier as
the second base model.
Please complete the following functions in part2.py:
• set up knn() which returns an instance of sklearn.neighbors.KNeighborsClassifier.
Please use K = 5. Hint: You should be able to do this in one line. (2 marks)
5
• set up knn gridsearchcv() which returns an instance of sklearn.model selection.GridSearchCV.
Set up this grid search to optimise the hyperparameters of a KNeighborsClassifier via 10-
fold cross validation. You should try to optimise for K (the number of nearest neighbours)
and the weights parameter (‘uniform’ or ‘distance’). Choose a reasonable range for K so
that your code does not take forever to run. (5 marks)
Please check the official manual for sklearn.neighbors.KNeighborsClassifier for details
about the parameters.
If you run part2.py successfully, your program should output four text files in the root directory.
Please also push these to your Gitlab repository. (2 marks)
How you will be marked
You will be assigned a mark according to:
• whether your program works or not;
• whether you have used meaningful names for variables;
• whether you have used a clear, appropriate and logical design.
For Part 2, you will not be marked on how good your classifiers are. We do expect your classifiers
to perform better than random (accuracy 50%).
6

欢迎咨询51作业君