辅导案例-COM4513-assignment1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 1/19
[COM4513-6513] Assignment 1: Text Classification with
Logistic Regression
Instructor: Nikos Aletras
The goal of this assignment is to develop and test two text classification systems:
Task 1: sentiment analysis, in particular to predict the sentiment of movie review, i.e. positive or negative
(binary classification).
Task 2: topic classification, to predict whether a news article is about International issues, Sports or
Business (multiclass classification).
For that purpose, you will implement:
Text processing methods for extracting Bag-Of-Word features, using (1) unigrams, bigrams and trigrams to
obtain vector representations of documents. Two vector weighting schemes should be tested: (1) raw
frequencies (3 marks; 1 for each ngram type); (2) tf.idf (1 marks).
Binary Logistic Regression classifiers that will be able to accurately classify movie reviews trained with (1)
BOW-count (raw frequencies); and (2) BOW-tfidf (tf.idf weighted) for Task 1.
Multiclass Logistic Regression classifiers that will be able to accurately classify news articles trained with
(1) BOW-count (raw frequencies); and (2) BOW-tfidf (tf.idf weighted) for Task 2.
The Stochastic Gradient Descent (SGD) algorithm to estimate the parameters of your Logistic Regression
models. Your SGD algorithm should:
Minimise the Binary Cross-entropy loss function for Task 1 (3 marks)
Minimise the Categorical Cross-entropy loss function for Task 2 (3 marks)
Use L2 regularisation (both tasks) (1 mark)
Perform multiple passes (epochs) over the training data (1 mark)
Randomise the order of training data after each pass (1 mark)
Stop training if the difference between the current and previous validation loss is smaller than a
threshold (1 mark)
After each epoch print the training and development loss (1 mark)
Discuss how did you choose hyperparameters (e.g. learning rate and regularisation strength)? (2 marks;
0.5 for each model in each task).
After training the LR models, plot the learning process (i.e. training and validation loss in each epoch) using
a line plot (1 mark; 0.5 for both BOW-count and BOW-tfidf LR models in each task) and discuss if your
model overfits/underfits/is about right.
Model interpretability by showing the most important features for each class (i.e. most positive/negative
weights). Give the top 10 for each class and comment on whether they make sense (if they don't you might
have a bug!). If we were to apply the classifier we've learned into a different domain such laptop reviews or
restaurant reviews, do you think these features would generalise well? Can you propose what features the
classifier could pick up as important in the new domain? (2 marks; 0.5 for BOW-count and BOW-tfidf LR
models respectively in each task)
Data - Task 1
The data you will use for Task 1 are taken from here: http://www.cs.cornell.edu/people/pabo/movie-review-data/
(http://www.cs.cornell.edu/people/pabo/movie-review-data/) and you can find it in the ./data_sentiment folder
in CSV format:
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 2/19
data_sentiment/train.csv : contains 1,400 reviews, 700 positive (label: 1) and 700 negative (label: 0) to
be used for training.
data_sentiment/dev.csv : contains 200 reviews, 100 positive and 100 negative to be used for
hyperparameter selection and monitoring the training process.
data_sentiment/test.csv : contains 400 reviews, 200 positive and 200 negative to be used for testing.
Data - Task 2
The data you will use for Task 2 is a subset of the AG News Corpus
(http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) and you can find it in the ./data_topic folder
in CSV format:
data_topic/train.csv : contains 2,400 news articles, 800 for each class to be used for training.
data_topic/dev.csv : contains 150 news articles, 50 for each class to be used for hyperparameter
selection and monitoring the training process.
data_topic/test.csv : contains 900 news articles, 300 for each class to be used for testing.
Submission Instructions
You should submit a Jupyter Notebook file (assignment1.ipynb) and an exported PDF version (you can do it
from Jupyter: File->Download as->PDF via Latex ).
You are advised to follow the code structure given in this notebook by completing all given funtions. You can
also write any auxilliary/helper functions (and arguments for the functions) that you might need but note that you
can provide a full solution without any such functions. Similarly, you can just use only the packages imported
below but you are free to use any functionality from the Python Standard Library
(https://docs.python.org/2/library/index.html), NumPy, SciPy and Pandas. You are not allowed to use any third-
party library such as Scikit-learn (apart from metric functions already provided), NLTK, Spacy, Keras etc..
Please make sure to comment your code. You should also mention if you've used Windows (not recommended)
to write and test your code. There is no single correct answer on what your accuracy should be, but correct
implementations usually achieve F1-scores around 80% or higher. The quality of the analysis of the results is as
important as the accuracy itself.
This assignment will be marked out of 20. It is worth 20% of your final grade in the module.
The deadline for this assignment is 23:59 on Fri, 20 Mar 2020 and it needs to be submitted via MOLE.
Standard departmental penalties for lateness will be applied. We use a range of strategies to detect unfair
means (https://www.sheffield.ac.uk/ssid/unfair-means/index), including Turnitin which helps detect plagiarism,
so make sure you do not plagiarise.
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 3/19
In [4]:
Load Raw texts and labels into arrays
First, you need to load the training, development and test sets from their corresponding CSV files (tip: you can
use Pandas dataframes).
In [2]:
If you use Pandas you can see a sample of the data.
In [3]:
The next step is to put the raw texts into Python lists and their corresponding labels into NumPy arrays:
In [4]:
Bag-of-Words Representation
To train and test Logisitc Regression models, you first need to obtain vector representations for all documents
given a vocabulary of features (unigrams, bigrams, trigrams).
Text Pre-Processing Pipeline
Out[3]:
text label
0 note : some may consider portions of the follo... 1
1 note : some may consider portions of the follo... 1
2 every once in a while you see a film that is s... 1
3 when i was growing up in 1970s , boys in my sc... 1
4 the muppet movie is the first , and the best m... 1
import pandas as pd
import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import random
# fixing random seed for reproducibility
random.seed(123)
np.random.seed(123)
# fill in your code...
data_tr.head()
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 4/19
Text Pre Processing Pipeline
To obtain a vocabulary of features, you should:
tokenise all texts into a list of unigrams (tip: using a regular expression)
remove stop words (using the one provided or one of your preference)
compute bigrams, trigrams given the remaining unigrams
remove ngrams appearing in less than K documents
use the remaining to create a vocabulary of unigrams, bigrams and trigrams (you can keep top N if you
encounter memory issues).
In [5]:
N-gram extraction from a document
You first need to implement the extract_ngrams function. It takes as input:
x_raw : a string corresponding to the raw text of a document
ngram_range : a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes
extracting unigrams and bigrams.
token_pattern : a string to be used within a regular expression to extract all tokens. Note that data is
already tokenised so you could opt for a simple white space tokenisation.
stop_words : a list of stop words
vocab : a given vocabulary. It should be used to extract specific features.
and returns:
a list of all extracted features.
See the examples below to see how this function should work.
In [6]:
stop_words = ['a','in','on','at','and','or',
'to', 'the', 'of', 'an', 'by',
'as', 'is', 'was', 'were', 'been', 'be',
'are','for', 'this', 'that', 'these', 'those', 'you', 'i',
'it', 'he', 'she', 'we', 'they' 'will', 'have', 'has',
'do', 'did', 'can', 'could', 'who', 'which', 'what',
'his', 'her', 'they', 'them', 'from', 'with', 'its']
def extract_ngrams(x_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', stop_words=[],

# fill in your code...

return x
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 5/19
In [7]:
In [8]:
Note that it is OK to represent n-grams using lists instead of tuples: e.g. ['great', ['great', 'movie']]
Create a vocabulary of n-grams
Then the get_vocab function will be used to (1) create a vocabulary of ngrams; (2) count the document
frequencies of ngrams; (3) their raw frequency. It takes as input:
X_raw : a list of strings each corresponding to the raw text of a document
ngram_range : a tuple of two integers denoting the type of ngrams you want to extract, e.g. (1,2) denotes
extracting unigrams and bigrams.
token_pattern : a string to be used within a regular expression to extract all tokens. Note that data is
already tokenised so you could opt for a simple white space tokenisation.
stop_words : a list of stop words
vocab : a given vocabulary. It should be used to extract specific features.
min_df : keep ngrams with a minimum document frequency.
keep_topN : keep top-N more frequent ngrams.
and returns:
vocab : a set of the n-grams that will be used as features.
df : a Counter (or dict) that contains ngrams as keys and their corresponding document frequency as
values.
ngram_counts : counts of each ngram in vocab
Hint: it should make use of the extract_ngrams function.
Out[7]:
['great',
'movie',
'watch',
('great', 'movie'),
('movie', 'watch'),
('great', 'movie', 'watch')]
Out[8]:
['great', ('great', 'movie')]
extract_ngrams("this is a great movie to watch",
ngram_range=(1,3),
stop_words=stop_words)
extract_ngrams("this is a great movie to watch",
ngram_range=(1,2),
stop_words=stop_words,
vocab=set(['great', ('great','movie')]))
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 6/19
In [9]:
Now you should use get_vocab to create your vocabulary and get document and raw frequencies of n-grams:
In [10]:
Then, you need to create vocabulary id -> word and id -> word dictionaries for reference:
In [11]:
Now you should be able to extract n-grams for each text in the training, development and test sets:
In [12]:
Vectorise documents
5000
['manages', 'questions', 'covered', 'body', 'ron', 'flair', 'drunken', 'approach',
'etc', 'allowing', 'lebowski', 'strong', 'model', 'category', 'family', 'couldn', 'a
rgento', 'why', 'shown', ('doesn', 'work'), 'ocean', ('lot', 'more'), 'lou', 'attorn
ey', 'kick', 'thinking', 'worth', 'larger', ('waste', 'time'), ('back', 'forth'), 'r
oles', 'adventures', ('million', 'dollars'), 'critics', 'according', ('ghost', 'do
g'), 'outside', 'protect', ('last', 'time'), ('but', 'so'), 'creative', 'sell', 'pil
e', 'needless', 'immediately', 'screens', 'cards', 'blonde', 'meets', 'place', 'need
s', 'needed', 'teacher', 'conceived', 'competition', 'powerful', 'expected', ('firs
t', 'movie'), ('but', 'least'), 'gave', 'pleasures', 'spectacular', 'safe', 'wishe
s', 'stuff', ('there', 'something'), 'robert', 'kid', 'latest', ('bad', 'guy'), 'com
et', 'street', 'intelligent', 'allow', ('tim', 'roth'), ('production', 'design'), 'l
iving', 'abyss', 'clean', ('makes', 'him'), 'aware', 'footage', 'vicious', 'sharon',
'genuinely', 'south', 'draw', 'wall', ('will', 'smith'), 'romeo', ('scenes', 'but'),
'sometimes', 'friend', 'millionaire', 'families', 'technique', 'spirit', ('not', 'go
ing'), 'horrifying', 'national']
[('but', 1334), ('one', 1247), ('film', 1231), ('not', 1170), ('all', 1117), ('movi
e', 1095), ('out', 1080), ('so', 1047), ('there', 1046), ('like', 1043)]
def get_vocab(X_raw, ngram_range=(1,3), token_pattern=r'\b[A-Za-z][A-Za-z]+\b', min_df=0, keep_topN

# fill in your code...

return vocab, df, ngram_counts
vocab, df, ngram_counts = get_vocab(X_tr_raw, ngram_range=(1,3), keep_topN=5000, stop_words=stop_wo
print(len(vocab))
print()
print(list(vocab)[:100])
print()
print(df.most_common()[:10])
# fill in your code...
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 7/19
Next, write a function vectoriser to obtain Bag-of-ngram representations for a list of documents. The function
should take as input:
X_ngram : a list of texts (documents), where each text is represented as list of n-grams in the vocab
vocab : a set of n-grams to be used for representing the documents
and return:
X_vec : an array with dimensionality Nx|vocab| where N is the number of documents and |vocab| is the size
of the vocabulary. Each element of the array should represent the frequency of a given n-gram in a
document.
In [13]:
Finally, use vectorise to obtain document vectors for each document in the train, development and test set.
You should extract both count and tf.idf vectors respectively:
Count vectors
In [14]:
In [15]:
In [16]:
TF.IDF vectors
Out[15]:
(1400, 5000)
Out[16]:
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[0., 0., 0., 1., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 1.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.]])
def vectorise(X_ngram, vocab):

# fill in your code...

return X_vec
# fill in your code...
X_tr_count.shape
X_tr_count[:2,:50]
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 8/19
First compute idfs an array containing inverted document frequencies (Note: its elements should correspond
to your vocab )
In [17]:
Then transform your count vectors to tf.idf vectors:
In [18]:
In [19]:
In [ ]:
Binary Logistic Regression
After obtaining vector representations of the data, now you are ready to implement Binary Logistic Regression
for classifying sentiment.
First, you need to implement the sigmoid function. It takes as input:
z : a real number or an array of real numbers
and returns:
sig : the sigmoid of z
In [20]:
Out[19]:
array([0. , 0. , 0. , 2.24028121, 0. ,
0. , 0. , 5.67501654, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
2.47354289, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 2.56209629,
0. , 0. , 0. , 0. , 0. ])
# fill in your code...
# fill in your code...
X_tr_tfidf[1,:50]
def sigmoid(z):

# fill in your code...

return z
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 9/19
In [21]:
Then, implement the predict_proba function to obtain prediction probabilities. It takes as input:
X : an array of inputs, i.e. documents represented by bag-of-ngram vectors ( )
weights : a 1-D array of the model's weights
and returns:
preds_proba : the prediction probabilities of X given the weights
× ||
(1, ||)
In [22]:
Then, implement the predict_class function to obtain the most probable class for each vector in an array of
input vectors. It takes as input:
X : an array of documents represented by bag-of-ngram vectors ( )
weights : a 1-D array of the model's weights
and returns:
preds_class : the predicted class for each x in X given the weights
× ||
(1, ||)
In [23]:
To learn the weights from data, we need to minimise the binary cross-entropy loss. Implement binary_loss
that takes as input:
X : input vectors
Y : labels
weights : model weights
alpha : regularisation strength
and return:
l : the loss score
0.5
[0.00669285 0.76852478]
print(sigmoid(0))
print(sigmoid(np.array([-5., 1.2])))
def predict_proba(X, weights):

# fill in your code...

return preds_proba
def predict_class(X, weights):

# fill in your code...

return preds_class
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 10/19
In [24]:
Now, you can implement Stochastic Gradient Descent to learn the weights of your sentiment classifier. The
SGD function takes as input:
X_tr : array of training data (vectors)
Y_tr : labels of X_tr
X_dev : array of development (i.e. validation) data (vectors)
Y_dev : labels of X_dev
lr : learning rate
alpha : regularisation strength
epochs : number of full passes over the training data
tolerance : stop training if the difference between the current and previous validation loss is smaller than a
threshold
print_progress : flag for printing the training progress (train/validation loss)
and returns:
weights : the weights learned
training_loss_history : an array with the average losses of the whole training set after each epoch
validation_loss_history : an array with the average losses of the whole development set after each
epoch
In [25]:
Train and Evaluate Logistic Regression with Count vectors
First train the model using SGD:
def binary_loss(X, Y, weights, alpha=0.00001):

# fill in your code...
return l

def SGD(X_tr, Y_tr, X_dev=[], Y_dev=[], loss="binary", lr=0.1, alpha=0.00001, epochs=5, tolerance=0

cur_loss_tr = 1.
cur_loss_dev = 1.
training_loss_history = []
validation_loss_history = []

# fill in your code...

return weights, training_loss_history, validation_loss_history
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 11/19
In [26]:
Now plot the training and validation history per epoch. Does your model underfit, overfit or is it about right?
Explain why.
In [27]:
Explain here...
Compute accuracy, precision, recall and F1-scores:
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:50: DeprecationWarni
ng: elementwise == comparison failed; this will raise an error in the future.
w_count, loss_tr_count, dev_loss_count = SGD(X_tr_count, Y_tr,
X_dev=X_dev_count,
Y_dev=Y_dev,
lr=0.0001,
alpha=0.001,
epochs=100)
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 12/19
In [10]:
Finally, print the top-10 words for the negative and positive class respectively.
In [8]:
In [9]:
If we were to apply the classifier we've learned into a different domain such laptop reviews or restaurant
reviews, do you think these features would generalise well? Can you propose what features the classifier could
pick up as important in the new domain?
Provide your answer here...
Train and Evaluate Logistic Regression with TF.IDF vectors
Follow the same steps as above (i.e. evaluating count n-gram representations).
# fill in your code...
print('Accuracy:', accuracy_score(Y_te,preds_te_count))
print('Precision:', precision_score(Y_te,preds_te_count))
print('Recall:', recall_score(Y_te,preds_te_count))
print('F1-Score:', f1_score(Y_te,preds_te_count))
# fill in your code...
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 13/19
In [31]:
Now plot the training and validation history per epoch. Does your model underfit, overfit or is it about right?
Explain why.
In [32]:
Compute accuracy, precision, recall and F1-scores:
Epoch: 0| Training loss: 0.496538116279205| Validation loss: 0.5895899274588817
Epoch: 1| Training loss: 0.4084394710560143| Validation loss: 0.544649086837174
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:50: DeprecationWarni
ng: elementwise == comparison failed; this will raise an error in the future.
w_tfidf, trl, devl = SGD(X_tr_tfidf, Y_tr,
X_dev=X_dev_tfidf,
Y_dev=Y_dev,
lr=0.0001,
alpha=0.00001,
epochs=50)
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 14/19
In [11]:
Print top-10 most positive and negative words:
In [13]:
In [14]:
Discuss how did you choose model hyperparameters (e.g. learning rate and
regularisation strength)? What is the relation between training epochs and
learning rate? How the regularisation strength affects performance?
Enter your answer here...
Full Results
Add here your results:
LR Precision Recall F1-Score
BOW-count
BOW-tfidf
Multi-class Logistic Regression
Now you need to train a Multiclass Logistic Regression (MLR) Classifier by extending the Binary model you
developed above. You will use the MLR model to perform topic classification on the AG news dataset consisting
of three classes:
Class 1: World
Class 2: Sports
Class 3: Business
You need to follow the same process as in Task 1 for data processing and feature extraction by reusing the
functions you wrote.
# fill in your code...
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te))
print('Recall:', recall_score(Y_te,preds_te))
print('F1-Score:', f1_score(Y_te,preds_te))
# fill in your code...
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 15/19
In [36]:
In [37]:
In [38]:
In [39]:
Out[37]:
label text
0 1 Reuters - Venezuelans turned out early\and in ...
1 1 Reuters - South Korean police used water canno...
2 1 Reuters - Thousands of Palestinian\prisoners i...
3 1 AFP - Sporadic gunfire and shelling took place...
4 1 AP - Dozens of Rwandan soldiers flew into Suda...
5000
['questions', ('exhibition', 'game'), ('computer', 'maker'), ('chavez', 'won'), ('in
voices', 'halliburton'), ('den', 'hoogenband', 'netherlands'), 'body', ('anterior',
'cruciate'), ('billion', 'cash', 'stock'), ('nortel', 'networks', 'corp'), ('offerin
g', 'nearly'), ('prime', 'minister'), ('july', 'first', 'time'), 'strong', ('hang',
'over'), 'model', 'ease', 'assets', 'category', 'family', 'disappeared', ('olympic',
'men'), 'couldn', 'why', 'shift', ('world', 'largest', 'food'), 'oust', ('settler',
'homes'), ('jerusalem', 'reuters'), ('wisconsin', 'reuters'), 'luis', 'bowl', ('mond
ay', 'said', 'quarterly'), ('baltimore', 'orioles'), ('more', 'israeli'), ('army',
'had', 'decided'), 'economic', ('after', 'web', 'no'), 'lynn', ('defensive', 'end'),
'according', 'facilities', ('company', 'wednesday'), ('lt', 'gt', 'wednesday'), 'tut
si', ('dillard', 'inc'), 'outside', 'protect', 'uk', 'weather', 'sell', 'pile', 'imm
ediately', 'senate', ('venus', 'williams'), ('four', 'months'), 'pope', ('assistan
t', 'manager'), 'nyse', 'place', 'needs', 'needed', 'competition', 'powerful', 'outl
ooks', 'expected', 'gave', ('holding', 'corp'), ('minister', 'alexander'), 'spectacu
lar', ('world', 'biggest'), 'barrel', ('store', 'sales'), 'venezuela', 'groups', 'ro
bert', 'calf', 'brokerage', ('fears', 'about', 'country'), ('retailer', 'behind'),
'kid', 'latest', 'street', 'allow', 'surge', 'living', ('aug', 'reuters'), 'aware',
'thens', 'stunned', 'sending', 'sharon', 'south', 'draw', 'wall', 'declare', ('inc',
'said'), 'lows', ('cincinnati', 'reds'), 'families']
[('reuters', 631), ('said', 432), ('tuesday', 413), ('wednesday', 344), ('new', 32
5), ('after', 295), ('ap', 275), ('athens', 245), ('monday', 221), ('first', 210)]
# fill in your code...
data_tr.head()
# fill in your code...
vocab, df, ngram_counts = get_vocab(X_tr_raw, ngram_range=(1,3), keep_topN=5000, stop_words=stop_wo
print(len(vocab))
print()
print(list(vocab)[:100])
print()
print(df.most_common()[:10])
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 16/19
In [40]:
Now you need to change SGD to support multiclass datasets. First you need to develop a softmax function. It
takes as input:
z : array of real numbers
and returns:
smax : the softmax of z
In [42]:
Then modify predict_proba and predict_class functions for the multiclass case:
In [43]:
In [44]:
Toy example and expected functionality of the functions above:
In [45]:
In [46]:
Out[46]:
array([[0.33181223, 0.66818777],
[0.66818777, 0.33181223],
[0.89090318, 0.10909682]])
# fill in your code...
def softmax(z):

# fill in your code...

return smax
def predict_proba(X, weights):

# fill in your code...

return preds_proba
def predict_class(X, weights):

# fill in your code...

return preds_class
X = np.array([[0.1,0.2],[0.2,0.1],[0.1,-0.2]])
w = np.array([[2,-5],[-5,2]])
predict_proba(X, w)
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 17/19
In [47]:
Now you need to compute the categorical cross entropy loss (extending the binary loss to support multiple
classes).
In [1]:
Finally you need to modify SGD to support the categorical cross entropy loss:
In [49]:
Now you are ready to train and evaluate you MLR following the same steps as in Task 1 for both Count and tfidf
features:
Out[47]:
array([2, 1, 1])
predict_class(X, w)
def categorical_loss(X, Y, weights, num_classes=5, alpha=0.00001):

# fill in your code...

return l

def SGD(X_tr, Y_tr, X_dev=[], Y_dev=[], num_classes=5, lr=0.01, alpha=0.00001, epochs=5, tolerance=

# fill in your code...
return weights, training_loss_history, validation_loss_history
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 18/19
In [50]:
Plot training and validation process and explain if your model overfit, underfit or is about right:
In [2]:
Compute accuracy, precision, recall and F1-scores:
In [6]:
Print the top-10 words for each class respectively.
In [7]:
Discuss how did you choose model hyperparameters (e.g. learning rate and
regularisation strength)? What is the relation between training epochs and
learning rate? How the regularisation strength affects performance?
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:55: DeprecationWarni
ng: elementwise == comparison failed; this will raise an error in the future.
w_count, loss_tr_count, dev_loss_count = SGD(X_tr_count, Y_tr,
X_dev=X_dev_count,
Y_dev=Y_dev,
num_classes=3,
lr=0.0001,
alpha=0.001,
epochs=200)
# fill in your code...
# fill in your code...
print('Accuracy:', accuracy_score(Y_te,preds_te))
print('Precision:', precision_score(Y_te,preds_te,average='macro'))
print('Recall:', recall_score(Y_te,preds_te,average='macro'))
print('F1-Score:', f1_score(Y_te,preds_te,average='macro'))
# fill in your code...
2020/3/16 assignment1 - Jupyter Notebook
localhost:8888/notebooks/Documents/NLP/assignment1.ipynb# 19/19
g g g p
Explain here...
Now evaluate BOW-tfidf...
Full Results
Add here your results:
LR Precision Recall F1-Score
BOW-count
BOW-tfidf
In [ ]: