FIT9136 Assignment 2 Semester 2 2020 Deep Mendha Teaching Associate, Faculty of IT Email:
[email protected] © 2020, Monash University Assignment Structure by Shirin Ghaffarian Maghool Date: 24 Aug 2020 © 2020, Monash University Table of Contents 1. Key Information............................................................................................................................................3 1.1. Learning outcomes................................................................................................................................3 1.2. Do and Do NOT......................................................................................................................................3 1.3. Marking Criteria.....................................................................................................................................3 1.4. Submission details.................................................................................................................................4 2. Getting help..................................................................................................................................................5 2.1. English language skills............................................................................................................................5 2.2. Study skills.............................................................................................................................................5 2.3. Things are tough right now....................................................................................................................5 2.4. Things in the unit don’t make sense......................................................................................................5 2.5. I don’t know what I need.......................................................................................................................5 3. Key tasks (100 marks)...................................................................................................................................6 3.1. Digital Shakespeare: Pre-processing (30 marks)....................................................................................6 3.2. Zipf’s Law: Common Statistics (40 marks)..............................................................................................7 3.3. Bi-gram Model (15 Marks).....................................................................................................................8 3.4. Generate Statement (15 marks)............................................................................................................9 © 2020, Monash University 1. Key Information Format: Individual Due date: 19th Oct ‘20 5:00 pm (AEST) Weight: 25% of unit mark 1.1. Learning outcomes 1. design, construct, test and document Python programs; 2. demonstrate on advance topics of python, like classes, objects, visualization; 3. evaluate different algorithms and analyse their complexity; 4. translate problems into algorithms with appropriate implementations by investigating different strategies for the algorithm development 1.2. Do and Do NOT Do Do NOT Maintain academic integrity1 Get support early from this unit and other services in the university Apply for special consideration for extensions2 Leave your assignment in draft mode Submit late (10% daily penalty applies)3 Submission is not accepted after 5 days of the due date, unless you have special consideration. 1.3. Marking Criteria Your work will be marked on Functionality Correctly working program 60% Code Architecture Algorithms, data types, control structures and use of libraries 10% Code Style Variable names, readability, clear logic 10% Documentation Program comments, clarity and connection to code 20% 1 https://www.monash.edu/rlo/research-writing-assignments/referencing-and-academic-integrity/ academic-integrity 2 https://www.monash.edu/exams/changes/special-consideration 3 eg: original mark was 70/100, submitting 2 days late results in 56/100 (14 marks off). This includes weekends © 2020, Monash University 1.4. Submission details Submit to “Assignment 2 Submission” on moodle: A2_studentID.zip4 Make different A2_QN_StudentID.py file for the assignment 2.5 Please make sure do not add dataset folder to the .zip file. Containing each of the four (4) tasks covered in Section 3. Key tasks (100 marks) 4 studentID is your student ID. E.g. if your Id was 12345678, you would submit “A2_12345678.zip” 5StudentID is your student ID, and QN is question number. E.g. if your Id was 12345678, and you are submitting Q1, then you would submit “A2_Q1_12345678.py” © 2020, Monash University 2. Getting help 2.1. English language skills if you don’t feel confident with your English. Talk to English Connect: https://www.monash.edu/english-connect 2.2. Study skills If you feel like you just don’t have enough time to do everything you need to, maybe you just need a new approach Talk to a learning skills advisor: https://www.monash.edu/library/skills/contacts 2.3. Things are tough right now Everyone needs to talk to someone at some point in their life, no judgement here. Talk to a counsellor: https://www.monash.edu/health/counselling/appointments (friendly, approachable, confidential, free) 2.4. Things in the unit don’t make sense Even if you’re not quite sure what to ask about, if you’re not sure you won’t be alone, it’s always better to ask. Ask in Ed: https://lms.monash.edu/course/view.php?id=78169§ion=4 Attend a consultation: https://lms.monash.edu/course/view.php?id=78169#section-3 Email your tutor: https://lms.monash.edu/course/view.php?id=78169§ion=1 2.5. I don’t know what I need Everyone at Monash University is here to help you. If things are tough now they won’t magically get better by themselves. Even if you don’t exactly know, come and talk with us and we’ll figure it out. We can either help you ourselves or at least point you in the right direction. © 2020, Monash University 3. Key tasks (100 marks) Language modelling is one the most interesting and important task in natural language processing (NLP).A language model lies at the core of machine translation systems such as google translate, summarisation systems, text completion systems such as auto-complete to name a few. In this assignment you will be writing a simple language model known as the bi-gram model. The tasks that follow will build the model step by step. Each task is attempting to access a particular aspect of programming. Read each section carefully and attempt to solve them using the techniques we have explored in class. Libraries can be used: math, mathplotlib, os, pandas, numpy. Dataset is available on the assessment page, under Assessment 2. 3.1. Digital Shakespeare: Pre-processing (30 marks) Our language model will be trained on Shakespeare's books. Your first task is to read in all the given books and remove all special sentences from it. The special sentences lies between < and >. Inputs Files: All the files in the dataset folders Output Filename: cleaned.txt Example in sample behaviour. Details Task: Remove all special sentences (Special sentences lies between “<” and “>”). Remove all the characters that are not alphanumeric excepts space. Sample behaviour © 2020, Monash University 3.2. Zipf’s Law: Common Statistics (40 marks) Zipf's law of word distribution states that the frequency of every word in a large corpus is inversely proportional to its rank in the frequency table. Let f 1 be the Ith largest frequency in the list that is f 1 is the frequency of most common word, f 2 is the frequency of second most common word and so on. Zipf's law states that f 1 is approximately equal to a I for some constant . Inputs File: cleaned.txt Output There are 3 outputs: 1. File: vocab.txt (Using the file that was created in the previous task.) 2. Graph of frequency against first 100 words in sorted vocab. 3. Graph of words that occur n time against word occurrence. Details There are 3 Task: © 2020, Monash University 1. Finding all the unique word in the cleaned file. Save all the clean words along with their frequencies, and sort (in descending order), based on word's frequency in file vocab.txt 1. To get unique words convert all the text into lower case first and the find unique words. 2. Plot a graph of frequencies against first 100 words from the sorted vocab file. 3. Count the number of words that occur once, twice , thrice till 250. Plot number of words that occur n times against word occurrence. 3.3. Bi-gram Model (15 Marks) A bi-gram language model is a probabilistic model where a sentence probability is decomposed into product of conditionals as follow: p (x1 , x2 , x3, ... xn )=∏ p (xk∨ xk −1 )-----------------------------(equation 1) These probabilities are approximated from the corpus using the following equation. p (xk∨ xk − 1 )= c (xk , xk− 1 ) c (xk −1 ) --------------------------------------(equation 2) where c stands for count of the words occurring together in the corpus. For the first 1000 words in the vocab.txt file, fill in the following table. If you look at the numbers in your table you will see a lot of zeros. This is called data sparsity problem and causes a major issue by pushing probabilities to zero. A simple fix is to smooth out the probabilities by adding one to each count. Then our new equation of probability becomes p (xk∨ xk − 1 )= c (xk , xk− 1 )+1 c (xk −1 )+V ------------------------------------(equation 3) where V is the words in vocabulary. Convert the counts into probabilities using equation 3 and save them in a file called model. You can save it in a file extension of your choice. © 2020, Monash University The data structure to store such a table is also up to you. All your choices must me made with in the scope of what you have studied in this unit. Advance packages such as pickle, nltk, spacy and advance data structures such as heaps should not be used. 3.4. Generate Statement (15 marks) Given your model and a prompt, generate sentences of various lengths. Inputs Input Sentences: This is a What is the purpose Move From here and Output Suggested Word: man because so Details Chose the word with the highest probability at each location based upon the model. This is called greedy decoding or inference. For example for prompt number 1, the probability can be decomposed as p(this)p(is|this)p(a|is)p(..|a) where you chose word with the highest probability at p(..|a). Sample behaviour Note: To test your model further, test dataset is available on the Assessment page on the Moodle. © 2020, Monash University
欢迎咨询51作业君