Sample Take-home Exam Paper Semester 1, 2022
COMP5046 Natural Language Processing This sample exam paper contains one exam question for each weekly topic, and it is for sharing the structure and style of the final exam. You can see which types of questions can be 4 mark short questions and 18 mark essay questions. Please do not write your answer by hand. (For only drawing, you can do it either by hand or computer.) Your answer (including drawings and illustration MUST be written by you.) For the illustration, you MUST NOT copy from other resources. The final exam will be an open-book, unsupervised exam. Week 1 & 2. Word Representation Q. Explain the difference between FastText and Word2Vec with examples. (4 marks) Solution Word2vec treats each word in the corpus like an atomic entity and generates a vector for each word. It treats words as the smallest unit to train on. Word2Vec learns vectors only for complete words found in the training corpus so shows Out-of-Vocabulary (OOV) cases for unseen words. FastText, an extension of the word2vec model, treats each word as composed of character n-grams. So the vector for a word is made of the sum of this character n-grams. For example, the word vector aquarium is a sum of the vectors of the n-grams: <aq/aqu/qua/uar/ari/riu/ium/um>. Note that < and > means Start of word and End of word. As Word Embedder encounters the word Aquarius, it might not recognize it, but it can guess by the sharing part in aquarium and Aquarius, to embed Aquarius near the aquarium. Hence, FastText learns vectors for the n-grams that are found within each word, as well as each complete word. The N-gram feature is the most significant improvement in FastText, its designed to solve OOV issues. Week 3 and 4. Word Classification with Machine Learning Q. In class, we learned that the family of recurrent neural networks have many important advantages and can be used in a variety of NLP tasks. For each of the following tasks and inputs, state how you would run an RNN to do that task. (4 marks) 1. how many outputs i.e. the number of times the softmax is called from your RNN. If the number of outputs is not fixed, state it as arbitrary 2. what each is a probability distribution over 3. which inputs are fed at each time step to produce each output Task A: Named-Entity Recognition: For each word in a sentence, classify that word as either a person, organization, location, or none. (Inputs: A sentence containing n-words) Task B: Sentiment Analysis: Classify the sentiment of a sentence ranging from negative to positive (integer values from 0 to 4). (Inputs: A sentence containing n-words.) Solution Task A: Named Entity Recognition 1. Number of Outputs: n outputs 2. Each is a probability distribution over 4 NER categories. 3. Each word in the sentence is fed into the RNN and one output is produced at every time step corresponding to the predicted tag/category for each word. Task B: Sentiment Analysis 1. Number of Outputs: 1 output. (n outputs is also acceptable if it takes the average of all outputs) 2. Each is a probability distribution over 5 sentiment values. 3. Each word in the sentence is fed into the RNN and one output is produced from the hidden states (by either taking only the final, max or mean across all states) corresponding to the sentiment value of the sentence.