辅导案例-FIT9136-Assignment 2

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

FIT9136 Assignment 2
Semester 2 2020
Deep Mendha
Teaching Associate, Faculty of IT
Email: [email protected]
© 2020, Monash University
Assignment Structure by Shirin Ghaffarian Maghool
Date: 24 Aug 2020
© 2020, Monash University
Table of Contents
1. Key Information............................................................................................................................................3
1.1. Learning outcomes................................................................................................................................3
1.2. Do and Do NOT......................................................................................................................................3
1.3. Marking Criteria.....................................................................................................................................3
1.4. Submission details.................................................................................................................................4
2. Getting help..................................................................................................................................................5
2.1. English language skills............................................................................................................................5
2.2. Study skills.............................................................................................................................................5
2.3. Things are tough right now....................................................................................................................5
2.4. Things in the unit don’t make sense......................................................................................................5
2.5. I don’t know what I need.......................................................................................................................5
3. Key tasks (100 marks)...................................................................................................................................6
3.1. Digital Shakespeare: Pre-processing (30 marks)....................................................................................6
3.2. Zipf’s Law: Common Statistics (40 marks)..............................................................................................7
3.3. Bi-gram Model (15 Marks).....................................................................................................................8
3.4. Generate Statement (15 marks)............................................................................................................9
© 2020, Monash University
1. Key Information
Format: Individual
Due date: 19th Oct ‘20
5:00 pm (AEST)
Weight: 25% of unit mark
1.1. Learning outcomes
1. design, construct, test and document Python programs;
2. demonstrate on advance topics of python, like classes, objects, visualization;
3. evaluate different algorithms and analyse their complexity;
4. translate problems into algorithms with appropriate implementations by investigating
different strategies for the algorithm development
1.2. Do and Do NOT
Do Do NOT
 Maintain academic integrity1
 Get support early from this unit and
other services in the university
 Apply for special consideration for
extensions2
 Leave your assignment in draft mode
 Submit late (10% daily penalty applies)3
 Submission is not accepted after 5 days
of the due date, unless you have special
consideration.
1.3. Marking Criteria
Your work will be marked on
Functionality Correctly working program 60%
Code
Architecture
Algorithms, data types, control structures and use of
libraries
10%
Code Style Variable names, readability, clear logic 10%
Documentation Program comments, clarity and connection to code 20%
1 https://www.monash.edu/rlo/research-writing-assignments/referencing-and-academic-integrity/
academic-integrity
2 https://www.monash.edu/exams/changes/special-consideration
3 eg: original mark was 70/100, submitting 2 days late results in 56/100 (14 marks off). This
includes weekends
© 2020, Monash University
1.4. Submission details
Submit to “Assignment 2 Submission” on moodle:
 A2_studentID.zip4
 Make different A2_QN_StudentID.py file for the assignment 2.5
 Please make sure do not add dataset folder to the .zip file.

Containing each of the four (4) tasks covered in Section 3. Key tasks (100 marks)
4 studentID is your student ID. E.g. if your Id was 12345678, you would submit “A2_12345678.zip”
5StudentID is your student ID, and QN is question number. E.g. if your Id was 12345678, and you
are submitting Q1, then you would submit “A2_Q1_12345678.py”
© 2020, Monash University
2. Getting help
2.1. English language skills
if you don’t feel confident with your English.
 Talk to English Connect: https://www.monash.edu/english-connect
2.2. Study skills
If you feel like you just don’t have enough time to do everything you need to, maybe you
just need a new approach
 Talk to a learning skills advisor: https://www.monash.edu/library/skills/contacts
2.3. Things are tough right now
Everyone needs to talk to someone at some point in their life, no judgement here.
 Talk to a counsellor: https://www.monash.edu/health/counselling/appointments
(friendly, approachable, confidential, free)
2.4. Things in the unit don’t make sense
Even if you’re not quite sure what to ask about, if you’re not sure you won’t be alone, it’s
always better to ask.
 Ask in Ed: https://lms.monash.edu/course/view.php?id=78169§ion=4
 Attend a consultation: https://lms.monash.edu/course/view.php?id=78169#section-3
 Email your tutor: https://lms.monash.edu/course/view.php?id=78169§ion=1
2.5. I don’t know what I need
Everyone at Monash University is here to help you. If things are tough now they won’t
magically get better by themselves. Even if you don’t exactly know, come and talk with us
and we’ll figure it out. We can either help you ourselves or at least point you in the right
direction.
© 2020, Monash University
3. Key tasks (100 marks)
Language modelling is one the most interesting and important task in natural
language processing (NLP).A language model lies at the core of machine translation
systems such as google translate, summarisation systems, text completion systems
such as auto-complete to name a few. In this assignment you will be writing a simple
language model known as the bi-gram model. The tasks that follow will build the
model step by step. Each task is attempting to access a particular aspect of
programming. Read each section carefully and attempt to solve them using the
techniques we have explored in class.
 Libraries can be used: math, mathplotlib, os, pandas, numpy.
 Dataset is available on the assessment page, under Assessment 2.
3.1. Digital Shakespeare: Pre-processing (30 marks)
Our language model will be trained on Shakespeare's books. Your first task is to read
in all the given books and remove all special sentences from it. The special sentences
lies between < and >.
Inputs Files: All the files in the dataset folders
Output Filename: cleaned.txt
Example in sample behaviour.
Details Task:
 Remove all special sentences (Special sentences lies
between “<” and “>”).
 Remove all the characters that are not alphanumeric
excepts space.
Sample
behaviour
© 2020, Monash University
3.2. Zipf’s Law: Common Statistics (40 marks)
Zipf's law of word distribution states that the frequency of every word in a large
corpus is inversely proportional to its rank in the frequency table. Let f 1 be the Ith
largest frequency in the list that is f 1 is the frequency of most common word, f 2 is the
frequency of second most common word and so on. Zipf's law states that f 1 is
approximately equal to a
I
for some constant .
Inputs File: cleaned.txt
Output There are 3 outputs:
1. File: vocab.txt (Using the file that was created in the previous
task.)
2. Graph of frequency against first 100 words in sorted vocab.
3. Graph of words that occur n time against word occurrence.
Details There are 3 Task:
© 2020, Monash University
1. Finding all the unique word in the cleaned file. Save all the
clean words along with their frequencies, and sort (in
descending order), based on word's frequency in file vocab.txt
1. To get unique words convert all the text into lower case first
and the find unique words.
2. Plot a graph of frequencies against first 100 words from the
sorted vocab file.
3. Count the number of words that occur once, twice , thrice till
250. Plot number of words that occur n times against word
occurrence.
3.3. Bi-gram Model (15 Marks)
A bi-gram language model is a probabilistic model where a sentence probability is
decomposed into product of conditionals as follow:
p (x1 , x2 , x3, ... xn )=∏ p (xk∨ xk −1 )-----------------------------(equation 1)
These probabilities are approximated from the corpus using the following equation.
p (xk∨ xk − 1 )=
c (xk , xk− 1 )
c (xk −1 )
--------------------------------------(equation 2)
 where c stands for count of the words occurring
together in the corpus.
For the first 1000 words in the vocab.txt file, fill in the following table.
If you look at the numbers in your table you will see a lot of zeros. This is called
data sparsity problem and causes a major issue by pushing probabilities to zero. A
simple fix is to smooth out the probabilities by adding one to each count. Then our
new equation of probability becomes
p (xk∨ xk − 1 )=
c (xk , xk− 1 )+1
c (xk −1 )+V
------------------------------------(equation 3)
 where V is the words in vocabulary.
Convert the counts into probabilities using equation 3 and save them in a file
called model. You can save it in a file extension of your choice.
© 2020, Monash University
The data structure to store such a table is also up to you. All your choices must
me made with in the scope of what you have studied in this unit. Advance packages
such as pickle, nltk, spacy and advance data structures such as heaps should not
be used.
3.4. Generate Statement (15 marks)
Given your model and a prompt, generate sentences of various lengths.
Inputs Input Sentences:
 This is a
 What is the purpose
 Move From here and
Output Suggested Word:
 man
 because
 so
Details Chose the word with the highest probability at each location based
upon the model. This is called greedy decoding or inference.
For example for prompt number 1, the probability can be
decomposed as p(this)p(is|this)p(a|is)p(..|a) where you chose word
with the highest probability at p(..|a).
Sample
behaviour
Note:
 To test your model further, test dataset is available on the Assessment
page on the Moodle.
© 2020, Monash University

欢迎咨询51作业君