程序代写案例-FIT5196

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

FIT5196-S1-2021
Assessment 1
This is an individual assessment and worth 35% of your total mark for
FIT5196.
Due date: Please check Assessment 1: Parsing Data And Text Preprocessing (Weight: 35%)
Text documents, such as crawled web data, are usually comprised of topically coherent text data,
which within each topically coherent data, one would expect that the word usage demonstrates
more consistent lexical distributions than that across data-set. A linear partition of texts into topic
segments can be used for text analysis tasks, such as passage retrieval in IR (information
retrieval), document summarization, recommender systems, and learning-to-rank methods.
Task 1: Parsing Text Files (55%)
This assessment touches the very first step of analyzing textual data, i.e., extracting data from
semi-structured text files. The students are provided with a data-set that contains information
about NEWS articles. Each text file contains information about the NEWS article, i.e., “uuid”,
“url”, “site”, “title”, “text”, etc attributes. Student dataset can be found here. Your task is to
extract the data and transform the data into the XML and CSV format with the following
elements:
1. uuid: is unique id of the of the NEWS article
2. author: is the name of the author of that NEWS article
3. text: is the actual NEWS article details
4. published: is the date and time that the NEWS article was created
The XML and CSV file must be in the same structure as the sample folder. Please note that, as
we are dealing with large datasets, the manual checking of outputs is impossible and output files
would be processed and marked automatically therefore, any deviation from the XML structure
(i.e. sample.xml) and any deviation from this structure (e.g. wrong key names which can be
caused by different spelling, different upper/lower case, etc., wrong hierarchy, not handling the
XML special characters,...) will result in receiving zero for the output mark as the marking script
would fail to load your file. (hint: You can also use the “xmltodict” package to make sure that
your XML is loadable). Beside the XML structure, the following constraints must also be
satisfied:
1. The non-English NEWS articles should be filtered out from the dataset and the final
XML and CSV should only contain the NEWS article in English language. For the sake
of consistency, you must use the langid package to classify the language of a NEWS
article.
2. The re, os, and the langid packages in Python are the only packages that you are allowed
to use for the task 1 of this assessment (e.g., “pandas” is not allowed!). Any other
packages that you need to “import” before usage is not allowed.
Note: Sample input & output might not match. Consider them for format only.
The output and the documentation will be marked separated in this task, and each carries its own
mark.
Output (50%)
See sample.xml and sample.csv for detailed information about the output structure. The
following must be performed to complete the assessment.
● Designing efficient regular expressions in order to extract the data from your dataset.
● Storing and submitting the extracted data into an XML file, .xml
following the format of sample.xml
● Explaining your code and your methodology in task1_.ipynb
● A pdf file, “task1_.pdf ”. You can first clean all the output in the
jupyter notebook task1_.ipynb and then export it as a pdf file. This
pdf will be passed to Turnitin for plagiarism check.
Methodology (25%)
The report should demonstrate the methodology (including all steps) to achieve the correct
results.
Documentation (25%)
The solution to get the output must be explained in a well-formatted report (with appropriate
sections and subsections). Please remember that the report must explain both the obtained results
and the approach to produce those results. You need to explain both the designed regular
expression and the approach that you have taken in order to design such an expression.
Task 2: Text Pre-Processing (45%)
This assessment touches on the next step of analyzing textual data, i.e., converting the extracted
data into a proper format. In this assessment, you are required to write Python code to preprocess
a set of NEWS articles and convert them into numerical representations (which are suitable for
input into recommender-systems/ information-retrieval algorithms).
The data-set that we provide contains NEWS articles about different categories. Please find
your .tsv file from the folder “task_2” from this link. The .tsv file contains 1000+ NEWS articles.
Your task is to extract and transform the information of the .tsv file performing the following
task:
1. Generate the corpus vocabulary with the same structure as sample_vocab.txt. Please
note that the vocabulary must be sorted alphabetically.
2. For complete data, calculate the top 100 frequent unigram and top-100 frequent bigrams
according to the structure of the sample_100uni.txt and sample_100bi.txt. If you have
less than 100 bigrams, just include the top-n bigrams.
3. Generate the sparse representation (i.e., doc-term matrix) of the .tsv file according to the
structure of the sample_countVec.txt
Note: Sample files and sample dataset might not match. Consider them for format only.
Please note that the following steps must be performed (not necessarily in the same order) to
complete the assessment.
1. Only keeps the NEWS Article that are in English language.
2. The word tokenization must use the following regular expression, "\w+(?:[-']\w+)?"
3. The context-independent and context-dependent (with the threshold set to 95%) stop
words must be removed from the vocab. The provided context-independent stop words
list (i.e., stopwords_en.txt) must be used.
4. Tokens should be stemmed using the Porter stemmer. (be careful that stemming
performs lower casing by default)
5. Rare tokens (with the threshold set to less than 5%) must be removed from the vocab.
6. Creating sparse matrix using countvectorizer.
7. Tokens with the length less than 3 should be removed from the vocab.
8. First 200 meaningful bigrams (i.e., collocations) must be included in the vocab using
PMI measure.
Please note that you are allowed to use any Python packages as you see fit to complete the
task 2 of this assessment. The output and the documentation will be marked separately in this
task, and each carries its own mark.
Output (50%)
The output of this task must contain the following files:
1. task2_.ipynb which contains your report explaining the code and the
methodology.
2. A pdf file, “task2_.pdf ”. You can first clean all the output in the
jupyter notebook task2_.ipynb and then export it as a pdf file. This
pdf will be passed to Turnitin for plagiarism check.
3. _vocab.txt: It contains the bigrams and unigrams tokens in the
following format of sample_vocab.txt. Words in the vocabulary must be sorted in
alphabetical order.
4. _countVec.txt: Each line in the txt file contains the sparse
representations of one day of the tweet data in the format of sample_countVec.txt
5. _100uni.txt and _100bi.txt : Each line in the txt
file contains the top 100 most frequent uni/bigrams of one day of the tweet data in the
format of sample_100uni.txt and sample_100bi.txt
Similar to task 1, in task 2, any deviation from the sample output structures may result in
receiving zero for the output. So please be careful.
Methodology (25%)
The report should demonstrate the methodology (including all steps) to achieve the correct
results.
Documentation (25%)
The solution to get the output must be explained in a well-formatted report (with appropriate
sections and subsections). Please remember that the report must explain both the obtained results
and the approach to produce those results.
Note: all submissions will be put through a plagiarism detection software which
automatically checks for their similarity with respect to other submissions. Any plagiarism
found will trigger the Faculty’s relevant procedures and may result in severe penalties, up
to and including exclusion from the university.

欢迎咨询51作业君