辅导案例-FIT9133

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

FIT9133 Assignment #2
Finding the Right Book
Semester 2 2019
Gavin Kroeger
Teaching Associate, Faculty of IT
Email: [email protected]
© 2019, Monash University
Assignment Structure by Jojo Wong
September 29, 2019
© 2019, Faculty of IT, Monash University
2Revision Status
$Id: A2.tex, Version 1.0 2019/08/31 2:11 pm Gavin Kroeger $
$Id: A2.tex, Version 1.1 2019/09/07 6:27 pm Gavin Kroeger $
$Id: A2.tex, Version 1.2 2019/09/09 9:06 pm Gavin Kroeger $
$Id: A2.tex, Version 1.3 2019/09/17 5:47 pm Gavin Kroeger $
$Id: A2.tex, Version 1.4 2019/09/24 6:25 pm Gavin Kroeger $
$Id: A2.tex, Version 1.5 2019/09/29 10:33 pm Gavin Kroeger $
© 2019, Faculty of IT, Monash University
CONTENTS 3
Contents
1 Introduction 4
2 Finding the Right Book 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Task 1: Setting up the preprocessor . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 More details on Clean() . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Task 2: Word Analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Task 3: Calculating Inverse Document Frequency (IDF) . . . . . . . . . . . . 8
2.6 Task 4: Presenting Your Results . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Important Notes 10
3.1 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Marking Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Submission 11
4.1 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Academic Integrity: Plagiarism and Collusion . . . . . . . . . . . . . . . . . . 11
© 2019, Faculty of IT, Monash University
1 Introduction 4
1 Introduction
This assignment is due on 18/10/2019. It is worth 20% of the total unit marks. A
penalty of 10% per day will apply for late submission. Refer to the FIT9133 Unit Guide for the
policy on extensions or special considerations.
Note that this is an individual assignment and must be your own work. Please pay
attention to Section 4.2 of this document on the university policies for the Academic Integrity,
Plagiarism and Collusion.
All the program files and any supporting documents should be compressed into one single file
for submission. (The submission details are given in Section 4.)
Note: In this assignment you are not permitted to import the following libraries:
re
sklearn (otherwise known as scikit-learn)
Further, you may not use any external libraries other than those provided by base Ana-
conda. Submissions using any external library that isn't included with base Anaconda will
incur penalties. The following link lists what is included with Windows 64 bit: https:
//docs.anaconda.com/anaconda/packages/py3.7_win-64/. If the installed column is
ticked, then it is in base Anaconda.
© 2019, Faculty of IT, Monash University
2 Finding the Right Book 5
2 Finding the Right Book
2.1 Overview
This assignment is about creating class objects that are able to read books and then perform
analysis on them. The analysis performed in this assignment may not be state of the art but
will provide insight towards basic data science. The first task is about cleaning a dataset, the
second and third tasks are about performing calculations on a dataset, and the fourth is about
presenting your results.
2.2 The Dataset
You will be working with books selected from the Project Guttenberg
1
. These books are
encoded in UTF-8 and are written in English. The text files for this assignment can be found
in Moodle under the assignment tab. The selected texts are also considered classics. When
you are finished with semester you are encouraged to read them.
2.3 Task 1: Setting up the preprocessor
In the first task, you are required to define a class that will perform the basic preprocessing on
each input text. This class should have one instance variable which is a string that holds the
text of the book. This string will change depending on whether clean() or read_text()
has been called.
The implementation of this preprocessor class should include the following four methods:
__init__(self):
This is the constructor that is required for creating instances of this class. You should
define a string called book_content to hold the a book's text.
__str__(self):
Re-define this method to present content of book_content.
clean(self):
This is method removes undesirable characters from text present in book_content and
stores it back in book_content. If no text has been read into this variable this function
should return the int 1, otherwise it returns None. Please read section 2.3.1 for more
details.
1https://www.gutenberg.org/wiki/Main_Page
© 2019, Faculty of IT, Monash University
2.3 Task 1: Setting up the preprocessor 6
read_text(self,text_name):
This method is defined to take as an argument a string that is the name of a file in
the current directory. The function reads the content of the file into the string instance
variable of this class. You may assume that the text is in UTF-8 encoding.
You should name this class as Preprocessor and the Python file as preprocessor_StudentID.py.
2.3.1 More details on Clean()
The analysis that will be performed on the text requires that there is no punctuation. Anything
that is not a letter or a number or white-space must be removed from the text. You can
assume that the text will be made of letters from the English Alphabet.
Hyphenated words such as off-campus should become off campus. Underscores (_) should
also be treated this way.
Contraction words such as wasn't should become wasnt.
Do not replace numbers within the text.
The requirements above are not strictly correct for lexicographical analysis, but will make your
life a lot easier for the sake of this assignment. Characters should be made lowercase if they
can be.
© 2019, Faculty of IT, Monash University
2.4 Task 2: Word Analyser 7
2.4 Task 2: Word Analyser
In this task, you are required to define a class for analysing the number of occurrences for
each word from a given cleaned text. This class should also have one instance variable called
word_counts which is Python Dictionary.
The implementation of this word analyser class should include the following four methods:
__init__(self):
This is the constructor that is required for creating instances of this class. You should
define an instance variable word_counts as a Dictionary.
__str__(self):
Re-define this method to present the number of occurrences for each word in a readable
format. You should return a formatted string in this method.
analyse_words(self, book_text):
This is the method that performs a count on a given book text at the word level. This
method should accept the cleaned book text as the argument, and attempt to count the
occurrences for each of the words. The word count should be updated in word_counts.
Do not count occurrences of new line characters.
get_word_frequency(self):
This method is defined to return the frequency of the words found in word_counts.
This method should return a Python Dictionary.
Note that frequency refers to
count(word)
count(allwords)
You should name this class as WordAnalyser and the Python file as word_StudentID.py.
© 2019, Faculty of IT, Monash University
2.5 Task 3: Calculating Inverse Document Frequency (IDF) 8
2.5 Task 3: Calculating Inverse Document Frequency (IDF)
Task 3 is about implementing IDF calculations. IDF is a statistic that attempts to categorize
how important a term is within a corpus of documents. The higher the IDF value the more
important the term is. Traditionally terms are made of multiple words but for the sake of this
assignment a single word will be used for the term. You may not use any third party libraries
to calculate IDF. You must implement the calculation yourself.
IDF = 1 + log D1 +N
Where D is the number of documents in the corpus, and N is the count of the number of
documents containing the term.
This class will contain one instance variable called data that is a Pandas Dataframe. data
will contain loaded term frequencies as rows. Each row represents the frequencies of a single
cleaned text. Each column corresponds to a word. The implementation of this IDF calculator
class should include the following methods:
__init__(self):
This is the constructor that is required for creating instances of this class. Create an
instance variable called data that is a Dataframe.
load_frequency(self, book_frequency, book_title):
Loads the frequency of a cleaned text into data with a title that corresponds to the
text the frequency was generated from. book_frequency is the same type of dictionary
generated in Task 2.
get_IDF(self,term):
Obtains the IDF for the term provided and the documents loaded into data.
You should name this class as IDFAnalyser and the Python file as idf_StudentID.py.
© 2019, Faculty of IT, Monash University
2.6 Task 4: Presenting Your Results 9
2.6 Task 4: Presenting Your Results
The last task is to use Term Frequency-Inverse Document Frequency (TF-IDF) to determine
the most suitable document for a given term. The signature for your function must be
choice(term,documents) where term is a string indicating what term is being used to
search and documents is an idf object outlined in Task 3 (section 2.5). The document with
the highest TF-IDF score for the query term is the best matching one.
TF − IDF = tf(term, document)× idf(term, documents)
The IDF for your calculation will be the same for all calculations for a given term, however
the Term Frequency (TF) will be different for each document. Your function should return
the name of the document that returns the highest TF-IDF amongst all documents present in
the given idf object. In other words, the most suitable book from the loaded books.
You may write as many functions as you like to complete this task.
Your script containing the function choice is to be saved in a Python script called task4_StudentID.py
© 2019, Faculty of IT, Monash University
3 Important Notes 10
3 Important Notes
3.1 Documentation
Commenting your code is essential as part of the assessment criteria (refer to Section 3.2).
You should also include comments at the beginning of your program file, which specify your
name, your Student ID, the start date, and the last modified date of the program, as well as
with a high-level description of the program. In-line comments within the program are also
part of the required documentation.
3.2 Marking Criteria
The assessment of this assignment will be based on the following marking criteria:
50% for working program functionality;
20% for code architecture algorithms, data types, control structures, and use of
libraries;
10% for coding style clear logic, clarity in variable names, and readability;
20% for documentation program comments.
© 2019, Faculty of IT, Monash University
4 Submission 11
4 Submission
There will be NO hard copy submission required for this assignment. You are required to submit
your assignment as a .zip file name with your Student ID. For example, if your Student ID is
12345678, you would submit a zipped file named A2_12345678.zip. Note that marks
will be deducted if this requirement is not strictly complied with.
Your submission must be done via the assignment submission link on the FIT9133 S2 2019
Moodle site by the deadline specified in Section 1. Submissions will not be accepted if left in
Draft mode. Be sure to submit your files before the deadline to not incur late penalties.
4.1 Deliverables
Your submission should contain the following documents:
Four Python scripts named preprocessor_StudentID.py, word_StudentID.py,
idf_StudentID.py, and task4_StudentID.py.
NOTE: Your programs must at least run on the computers in the University's
computer labs. Any submission that does not run accordingly will receive no
marks.
Electronic copies of ALL your files that are needed to run your programs.
Please ensure that you replaced StudentID with your student ID. For example, if your ID is
1234 then you would submit for the first task a file called preprocessor_1234.py.
Your programs must be written as python scripts. Jupyter Notebooks will not be accepted as
part of submission for this assignment.
Marks will deducted for any of these requirements that are not strictly complied with.
4.2 Academic Integrity: Plagiarism and Collusion
Plagiarism Plagiarism means to take and use another person's ideas and or manner of express-
ing them and to pass them off as your own by failing to give appropriate acknowledgement.
This includes materials sourced from the Internet, staff, other students, and from published
and unpublished works.
© 2019, Faculty of IT, Monash University
4.2 Academic Integrity: Plagiarism and Collusion 12
Collusion Collusion means unauthorised collaboration on assessable work (written, oral, or
practical) with other people. This occurs when you present group work as your own or as
the work of another person. Collusion may be with another Monash student or with people
or students external to the University. This applies to work assessed by Monash or another
university.
It is your responsibility to make yourself familiar with the University's policies
and procedures in the event of suspected breaches of academic integrity. (Note:
Students will be asked to attend an interview should such a situation is detected.)
The University's policies are available at: http://www.monash.edu/students/academic/
policies/academic-integrity
© 2019, Faculty of IT, Monash University