程序代写案例-COSC2820-Assignment 2

COSC2820 Advanced Programming for Data Science
COSC 2820/2815
Assignment 2: NLP Web-based Data Application
Milestone I: Natural Language Processing

Assessment
Type
Individual assignment. Submit online via Canvas→Assignments→Assignment 2→ Milestone I:
Natural Language Processing. Marks awarded for meeting requirements as closely as possible.
Clarifications/updates may be made via announcements/relevant discussion forums.
Due Date Week 10, Sunday 3rd Oct 2021, 11:59pm
Marks 25
1. Overview
Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting
sites all manage a job search system, where job hunters could search for relevant jobs based on keywords,
salary, and categories. In previous years, the category of an advertised job was often manually entered by the
advertiser (e.g., the employer). There were mistakes made for category assignment. As a result, the jobs in the
wrong class did not get enough exposure to relevant candidate groups.
With advances in text analysis, automated job classification has become feasible; and sensible suggestions for
job categories can then be made to potential advertisers. This can help reduce human data entry error,
increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site.
In order to do so, we need an automated job ads classification system that helps to predict the categories of
newly entered job advertisements.
This assessment includes two milestones. The first milestone (NLP) concerns the pipeline from basic text
preprocessing to building text classification models for predicting the category of a given job advertisement.
Then, the second milestone will adopt one of the models that we built in the first milestone, and develop a job
hunting website that allows users to browse existing job advertisements, as well as for employers to create
new job advertisements.
This assessment description is about Milestone 1: Natural Language Processing.
2. Learning Outcomes
This assessment relates to following learning outcomes of the course:
● CLO 4: Pre-process natural language text data to generate effective feature representations;
● CLO 5: Document and maintain an editable transcript of the data pre-processing pipeline for
professional reporting.
3. Assessment details
In this milestone, you are required to pre-process a collection of job advertisement documents, build machine
learning models for document classification (i.e., classifying the category of a given job advertisement), and
perform evaluation and analysis on the built models.
Page1 of7
The Data
In this assignment, you are given a large collection of job advertisement documents (~ 50k jobs). The data
folder is available for download from canvas.
Inside the data folder you will see 8 different subfolders, namely: Accounting_Finance, Engineering,
Healthcare_Nursing, Hospitality_Catering, IT, PR_Advertising_Marketing, Sales and Teaching, each folder name
is a job category.
The job advertisement text documents of a particular category are located in the corresponding subfolder.
Each job advertisement document is a txt file, named as "Job_.txt". It contains the title, the webindex,
(some will also have information on the company name, some might not), and the full description of the job
advertisement.
Task 1: Basic Text Pre-processing [5 marks]
In this task, you are required to perform basic text pre-processing on the given dataset, including, but not
limited to tokenization, removing most/less frequent words and stop words, extracting bigrams and
collocations. In this task, we focus on pre-processing the description only. You are required to perform the
following:
1. Extract information from each job advertisement. Perform the following pre-processing steps to the
description of each job advertisement;
2. Tokenize each job advertisement description. The word tokenization must use the following regular
expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?";
3. All the words must be converted into the lower case;
4. Remove words with length less than 2.
5. Remove stopwords using the provided stop words list (i.e, stopwords_en.txt). It is located inside the
same downloaded folder.
6. Remove the word that appears only once in the document collection, based on term frequency.
7. Remove the top 50 most frequent words based on document frequency.
8. Extract the top 10 Bigrams based on term frequency, save them as a txt file (refer to the required
output).
9. Save all job advertisement text and information in a txt file (refer to the required output);
10. Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file (refer to the
required output);
Note:
● For all the words that we removed (including step 4,5,6,7), you will also exclude them in the generated
vocabulary.
● The output of this task will be checked against the expected output. You should strictly follow the order
of the steps above and the following format requirement.
Required Output for Task 1:
The output of this task must contain the following files:
● vocab.txt This file contains the unigram vocabulary, one each line, in the following format:
word_string:word_integer_index. Very importantly, words in the vocabulary must be sorted in
alphabetical order, and the index value starts from 0. This file is the key to interpret the sparse
encoding. For instance, in the following example, the word aaron is the 20th word (the corresponding
integer_index as 19) in the vocabulary (note that the index values and words in the following image are
artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual
expected output).
Page2 of7
Fig.1 example format for vocab.txt
● bigram.txt The file contains the found bigrams found in the whole document collection as well as
their term frequency, separated by comma (each line contains one bigram). The order of the bigrams
is based on their term frequency (from high to low). Following is an example of the file format. (note
that the following image is artificial and used to demonstrate the required format only, it doesn't
reflect the values of the actual expected output).
Note: Do NOT use these bigrams in the vocabulary and sparse encoding. The vocab.txt only contains
unigrams.
Fig.2 example format for bigram.txt
● job_ads.txt This file contains the job advertisement information and the pre-processed description
text for all the job advertisement documents. Each job advertisement occupies 5 lines in the file:
○ The first line stores the id of the job advertisement document, written in the format of “ID: <5
digit id>”, for instance, “ID: 44128”. The job advertisement id matches the 5 digit part of the file
name of the document.
○ The second line stores the category (name of it’s parent folder) of the job advertisement,
written in the format of “Category: ”, for instance, “Category: Teaching”.
○ The third line stores the webIndex of the job advertisement, written in the format of
“Webindex: <8 digit web index>”, for instance, “Webindex: 36757414”.
○ The fourth line stores the un-processed title of the job description, in the format of “Title:
”. </br>○ The fifth line stores the pre-processed description of the job description, in the format of </br>“Description: <description of the advertisement>”. In order to do so, you need to rejoin the </br>tokens of each pre-processed description text into one string, with space as the delimiter. </br>Following is an example of the file format. (note that the following image is artificial and used to </br>demonstrate the required format only, it doesn't reflect the values of the actual expected output). </br>Page3 of7 </br>Fig. 3 example format for job_ads.txt </br>● All Python code related to Task 1 should be written in the jupyter notebook task1.ipynb. </br>Task 2: Generating Feature Representations for Job Advertisement Descriptions [10 </br>marks] </br>In this task, you are required to generate different types of feature representations for the collection of </br>job advertisements. Note that in this task, we will only consider the description of the job </br>advertisement. The feature representation that you need to generate includes the following: </br>Bag-of-words model: </br>○ Generate the Count vector representation for each job advertisement description, and save </br>them into a file (refer to the required output). Note, the generated Count vector </br>representation must be based on the generated vocabulary in Task 1 (as saved in vocab.txt). </br>Models based on word embeddings: </br>○ You are required to generate feature representation of job advertisement description based on </br>the following language models, respectively: </br>■ FastText language model trained based on the provided job advertisement </br>descriptions, with a 200 embedding dimension. </br>■ Choose 2 out of 3 pre-trained language models Word2Vec, GoogleNews300, Glove, </br>with a 200 embedding dimension. </br>For each of the above mentioned language models, you are required to build the weighted </br>(i.e., TF-IDF weighted) and unweighted vector representation for each job advertisement </br>description. </br>To summarise, there are 7 different types of feature representation of documents that you need to </br>build in this task, including count vector, two FastText embeddings (one TF-IDF weighted, and one </br>unweighted version), four pre-trained embeddings (2 different pre-trained language models, each has </br>one TF-IDF weighted version, and one unweighted version). </br>Required Output for Task 2: </br>● count_vectors.txt This file stores the sparse count vector representation of job advertisement </br>descriptions in the following format. Each line of this file corresponds to one advertisement. It starts </br>Page4 of7 </br>with a ‘#’ key followed by the webindex of the job advertisement, and a comma ‘,’. The rest of the line </br>is the sparse representation of the corresponding description in the form of </br>word_integer_index:word_freq separated by comma. Following is an example of the file format (note </br>that the following image is artificial and used to demonstrate the required format only, it doesn't </br>reflect the values of the actual expected output): </br>Fig. 4 example format for count_vectors.txt </br>Note: word_freq here refers to the frequency of the unigram in the corresponding Description only, excluding </br>the title. </br>Task 3: Job Advertisement Classification [10 marks] </br>In this task, you are required to build machine learning models for classifying the category of a job </br>advertisement text. A simple model that you can consider is the logistic regression model from sklearn as </br>demonstrated in the activities. However, you feel free to select other models (even if it has not been covered in </br>this course). You are required to conduct two sets of experiments on the provided dataset to investigate the </br>following two questions, respectively. </br>Q1: Language model comparisons </br>Which language model we built previously (based on job advertisement descriptions) performs the best with </br>the chosen machine learning model? To answer these questions, you are required to build machine learning </br>models based on the feature representations of the documents you generated in task 2, and to perform </br>evaluation on the various model performance. </br>Q2: Does more information provide higher accuracy? </br>In Task 2, we have built a number of feature representations of documents based on job advertisement </br>descriptions. However, we have not explored other features of a job advertisement, e.g., the title of the job </br>position. Will adding extra information help to boost up the accuracy of the model? To answer this question, </br>you are required to conduct experiments to build and compare the performance of classification models that </br>considering: </br>● only title of the job advertisement </br>● only description of the job advertisement (which you’ve already done in Task 3) </br>● both title and description of the job advertisement. For this, you have the flexibility to simply </br>concatenate title and description of a job advertisement when generating document feature </br>representation; or to generate separate feature representations for title and description, respectively, </br>and use both features in the classification models. </br>Note that: For both questions above, </br>Page5 of7 </br>● You are required to use appropriate techniques (e.g., projecting samples of the constructed document </br>embeddings in 2 dimensional space) to understand the nature of the task before building machine </br>learning models for classification </br>● When evaluating the performance of the models, you are required to conduct a 5-fold cross validation </br>to obtain robust comparisons. </br>All Python code related to Task 2 and 3 should be written in the jupyter notebook task2_3.ipynb. </br>6. Marking Guidelines </br>Marking Criteria </br>● Mechanical pass: Your outputs will be compared against the expected output. Therefore, marking will </br>be based on the similarity between what we expect (as discussed in the instructions) and what we </br>receive from you. It is extremely important to carefully follow the instructions to produce the expected </br>output. Otherwise, you may easily lose many points for simple mistakes (e.g. typos in the format of the </br>files, not loading essential libraries, different file names/path, etc). </br>● Expert pass: Your jupyter notebook will be checked by an expert to validate the logic and flow, proper </br>use of libraries and functions, and clarity of codes, comments, structure and presentation. </br>● You need to ensure all the codes and files that are required to run your code are included in the </br>submission. The expert will NOT fix your code’s problem even if it is a simple typo in a file name or an </br>imported library. </br>Mark Allocations </br>● Task 1 Basic Text Pre-processing [5%] </br>o Implementation [4%] </br>o Notebook presentation [1%], proportional to actual mark obtained in implementation </br>● Task 2 [10%] </br>o Implementation [7%] </br>o Notebook presentation [3%], proportional to actual mark obtained in implementation </br>● Task 3 [10%] </br>o Implementation [7%] </br>o Notebook presentation [3%], proportional to actual mark obtained in implementation </br>For Task 1, and Task 2 and 3, you are required to maintain an auditable and editable transcript, and </br>communicate any justification of methods/approach chosen, results, analysis and findings through jupyter </br>notebook. The presentation of the jupyter notebook accounts for certain percentages of the allocated mark </br>for each task, proportional to the actual mark obtained, as per specified above. Students can refer to the </br>activities in modules as examples for the level of details that they should include in their jupyter notebook. </br>4. Submission </br>The final submission of this milestone will consist of: </br>● The required output from Task 1, including vocab.txt, bigram.txt and job_ads.txt </br>● The required output from Task 2, count_vectors.txt </br>● The jupyter notebook of Task 1, and Task 2&3, respectively </br>Page6 of7 </br>● The .py format of the jupyter notebook of Task 1, and Task 2&3, respectively. Note that: </br>○ the content of the .py file must match your jupyter notebook </br>○ the .py file will be used for parlargism detections on both comment/description content, as </br>well as the actual code. </br>○ to help promote academic integrity, please make sure you submit the .py files. Submission </br>without the .py files or unmatched .py files will NOT be marked. </br>○ note that the .py files can be easily downloaded from jupyter notebook interface (File -> </br>Download as -> Python (.py)) </br>● Put all the above mentioned files in a folder, named with your student id, Zip the folder with the same </br>name (i.e., s1234567.zip) and upload for submission </br>Assessment declaration: </br>When you submit work electronically, you agree to the assessment declaration: </br>https://www.rmit.edu.au/students/student-essentials/assessment-and-exams/assessment/assessment-declara </br>tion </br>Late Submission Penalty </br>Late submissions will incur a 10% penalty on the total marks of the corresponding assessment task per </br>day or part of day late. Submissions that are late by 5 days or more are not accepted and will be awarded </br>zero, unless special consideration has been granted. Granted Special Considerations with a new due date set </br>more than 2 weeks after the original due will automatically result in an equivalent assessment in the form </br>of a practical test with interview, assessing the same knowledge and skills of the assignment (location and </br>time to be arranged by the instructor). Please ensure your submission is correct (all files are there, compiles </br>etc), re-submissions after the due date and time will be considered as late submissions. </br>5. Academic integrity and plagiarism (standard warning) </br>Academic integrity is about honest presentation of your academic work. It means acknowledging the work of </br>others while developing your own insights, knowledge and ideas. You should take extreme care that you have: </br>● acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. </br>directly copied), summarised, paraphrased, discussed or mentioned in your assessment through the </br>appropriate referencing methods, </br>● provided a reference list of the publication details so your reader can locate the source if necessary. </br>This includes material taken from Internet sites. </br>If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have </br>passed off the work and ideas of another person without appropriate referencing, as if they were your own. </br>RMIT University treats plagiarism as a very serious offence constituting misconduct. Plagiarism covers a variety </br>of inappropriate behaviours, including: </br>● Failure to properly document a source </br>● Copyright material from the internet or databases </br>● Collusion between students </br>For further information on our policies and procedures, please refer to </br>https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/academic-integrity </br>Page7 of7 </br></br><a href="https://www.51zuoyejun.com">欢迎咨询51作业君</a></div> </div> </div> <div class="aside"> <aside> <h3>分类归档</h3> <div class="line"></div> <ul class="folder"> <li><a href="/programCase.html"><i></i>ALL</a></li> <li><a href="/programCase.html?categoryId=1"><i></i>C/C++代写</a> </li> <li><a href="/programCase.html?categoryId=2"><i></i>Java代写</a> </li> <li><a href="/programCase.html?categoryId=3"><i></i>Python代写</a> </li> <li><a href="/programCase.html?categoryId=4"><i></i>Matlab代写</a> </li> <li><a href="/programCase.html?categoryId=5"><i></i>数据结构代写</a> </li> <li><a href="/programCase.html?categoryId=6"><i></i>机器学习 /ML代写</a> </li> <li><a href="/programCase.html?categoryId=7"><i></i>操作系统代写</a> </li> <li><a href="/programCase.html?categoryId=8"><i></i>金融编程代写</a> </li> <li><a href="/programCase.html?categoryId=9"><i></i>Android代写</a> </li> <li><a href="/programCase.html?categoryId=10"><i></i>IOS代写</a> </li> <li><a href="/programCase.html?categoryId=11"><i></i>JSP代写</a> </li> <li><a href="/programCase.html?categoryId=12"><i></i>ASP.NET代写</a> </li> <li><a href="/programCase.html?categoryId=13"><i></i>PHP代写</a> </li> <li><a href="/programCase.html?categoryId=14"><i></i>R代写</a> </li> <li><a href="/programCase.html?categoryId=15"><i></i>JavaScript/js代写</a> </li> <li><a href="/programCase.html?categoryId=16"><i></i>Ruby代写</a> </li> <li><a href="/programCase.html?categoryId=17"><i></i>计算机网络代写</a> </li> <li><a href="/programCase.html?categoryId=18"><i></i>数据库代写</a> </li> <li><a href="/programCase.html?categoryId=19"><i></i>网络编程代写</a> </li> <li><a href="/programCase.html?categoryId=20"><i></i>Linux编程代写</a> </li> <li><a href="/programCase.html?categoryId=21"><i></i>算法代写</a> </li> <li><a href="/programCase.html?categoryId=22"><i></i>汇编代写</a> </li> <li><a href="/programCase.html?categoryId=23"><i></i>伪代码代写</a> </li> <li><a href="/programCase.html?categoryId=24"><i></i>web代写</a> </li> <li><a href="/programCase.html?categoryId=25"><i></i>c#</a> </li> <li><a href="/programCase.html?categoryId=26"><i></i>图像处理</a> </li> <li><a href="/programCase.html?categoryId=27"><i></i>Lisp代写</a> </li> <li><a href="/programCase.html?categoryId=28"><i></i>程序代写</a> </li> <li><a href="/programCase.html?categoryId=29"><i></i>留学生代写经验指导</a> </li> </ul> </aside> <aside> <h3>Tag</h3> <div class="line"></div> <ul class="tag"> <li><a href="/programCase.html?tagId=1">java代写</a> </li> <li><a href="/programCase.html?tagId=2">calculator</a> </li> <li><a href="/programCase.html?tagId=3">澳洲代写</a> </li> <li><a href="/programCase.html?tagId=4">Car log book</a> </li> <li><a href="/programCase.html?tagId=5">File System</a> </li> <li><a href="/programCase.html?tagId=6">作业代写</a> </li> <li><a href="/programCase.html?tagId=7">CS代写</a> </li> <li><a href="/programCase.html?tagId=8">作业帮助</a> </li> <li><a href="/programCase.html?tagId=9">数据库代写</a> </li> <li><a href="/programCase.html?tagId=10">database代写</a> </li> <li><a href="/programCase.html?tagId=11">作业加急</a> </li> <li><a href="/programCase.html?tagId=12">代写作业</a> </li> <li><a href="/programCase.html?tagId=13">北美代写</a> </li> <li><a href="/programCase.html?tagId=14">linux代写</a> </li> <li><a href="/programCase.html?tagId=15">Shell</a> </li> <li><a href="/programCase.html?tagId=16">C语言代写</a> </li> <li><a href="/programCase.html?tagId=17">程序代写</a> </li> <li><a href="/programCase.html?tagId=18">英国代写</a> </li> <li><a href="/programCase.html?tagId=19">计算机代写</a> </li> <li><a href="/programCase.html?tagId=20">英文代写</a> </li> <li><a href="/programCase.html?tagId=21">代写Python</a> </li> <li><a href="/programCase.html?tagId=22">It代写</a> </li> <li><a href="/programCase.html?tagId=23">留学生</a> </li> <li><a href="/programCase.html?tagId=24">温度分析</a> </li> <li><a href="/programCase.html?tagId=25">python代写</a> </li> <li><a href="/programCase.html?tagId=26">Assignment代写</a> </li> <li><a href="/programCase.html?tagId=27">chess game</a> </li> <li><a href="/programCase.html?tagId=28">游戏代写</a> </li> <li><a href="/programCase.html?tagId=29">加拿大代写</a> </li> <li><a href="/programCase.html?tagId=30">lab代写</a> </li> <li><a href="/programCase.html?tagId=31">机器学习</a> </li> <li><a href="/programCase.html?tagId=32">汇编</a> </li> </ul> </aside> </div> </div> <footer id="about"> <div class="container"> <div class="content"> <div class="tips"> <span>联系方式</span> </div> <ul> <li><i class="email"></i> 51zuoyejun@gmail.com</li> <!-- <li><i class="www"></i>官方旗舰店:<a target="_blank" href="http://t.cn/EVz6MXf">http://t.cn/EVz6MXf</a></li> --> <li><i class="addr"></i>3551 Trousdale Pkwy,University Park,Los Angeles,CA</li> </ul> <div class="qrcode"> <ul> <li> <img src="/reception3/images/qr2.jpg" alt="客服二"> <p>微信客服:ITCSfudao</p> </li> <li> <img src="/reception3/images/qr1.jpg" alt="客服一"> <p>微信客服:IT_51zuoyejun</p> </li> </ul> </div> <p>温馨提示:如果您使用手机请先保存二维码,微信识别;或者直接搜索客服微信号添加好友,如果用电脑,请直接掏出手机果断扫描。</p> </div> </div> <div class="bottom"> <div class="main"> <div class="logo"> <img src="/reception3/images/footer-logo.png" alt="51作业君"> </div> <div class="pages"> <ul> <li><a href="index.html">首页</a></li> <li><a href="/program.html">程序辅导</a></li> <li><a href="/paper.html">论文辅导</a></li> <li><a href="#evalute">客户好评</a></li> </ul> <ul> <li>友情链接:</li> <li><a href="https://www.hddaixie.com" target="_blank">HD代写</a></li> <li><a href="https://sanyangcoding.com" target="_blank">三洋技术团队</a></li> <li><a href="http://apluscode.net" target="_blank">apluscode代写辅导</a> </li> <li><a href="https://www.aplusdx.com" target="_blank">Aplus代写</a> </li> </ul> <ul> <li><a href="#case">客户案例</a></li> <li><a href="#about">联系我们</a></li> </ul> <ul> <li>keywords:</li> <li><a href="https://51zuoyejun.com/paper.html" title="论文辅导" target="_blank">论文辅导</a></li> <li><a href="https://51zuoyejun.com/paper.html" title="论文润色" target="_blank">论文润色</a></li> <li><a href="/paper.html" title="论文代写" target="_blank">论文代写</a> <li><a href="/program.html" title="程序辅导" target="_blank">程序辅导</a></li> <li><a href="https://51zuoyejun.com/sitemap.html" title="论文辅导" target="_blank">sitemap</a></li> </ul> </div> </div> </div> </footer> <div class="H5Link"> <ul> <li> <a href="#about"> <img src="/reception3/img/wechat.png" alt="51作业君"> <p>官方微信</p> </a> </li> <li> <a href="/index.html"> <img src="/reception3/img/arrow-up.png" alt="51作业君"> <p>TOP</p> </a> </li> </ul> </div> <div id="code"> <div class="code"> <img src="/reception3/images/qr1.jpg" alt="51作业君"> <img src="/reception3/images/qr2.jpg" alt="51作业君"> <p>Email:51zuoyejun</p> <p>@gmail.com</p> </div> </div> <div id="aside"> 添加客服微信: <b>ITCSdaixie</b> </div> </body> <script src="/reception3/js/jq-session.js"></script> <!-- <script src="./js/getDetail.js"></script> --> <script src="https://cdn.bootcdn.net/ajax/libs/jquery/3.5.1/jquery.min.js"></script> <script> function change(lang) { $.ajax({ type: 'post', url: '/changeLang', dataType: 'json', data: { lang: lang }, success: function (data) { if (data == "success") { location.reload() } }, err: function (XMLHttpRequest, textStatus, errorThrown) { alert("error") } }); } /** * header */ $('header .nav a').click(function () { var eq = $(this).index() $(this).addClass('light').siblings('a').removeClass('light') }) $("#Menu").click(function () { $("header .nav").css("right", "0"); $("body").css("overflow-y", "hidden"); $(".bg").show() }) $("header .nav").on("click", function () { $("header .nav").css("right", "-3.4rem"); $(".bg").hide(); $("body").css("overflow-y", "auto"); }) $(".bg").on("click", function () { $("header .nav").css("right", "-3.4rem"); $(".bg").hide(); $("body").css("overflow-y", "auto"); }) </script><script type="text/javascript">(function(){window['__CF$cv$params']={r:'6a3ee8e1ee9759b6',m:'XPdwqCQY98rIZHRkpgGb2sjC9EcRkkxShntBseVAl_I-1635201239-0-AVXAyOwVfaDdOqHshr3j9oMHrqYE39KIeCOmqvsjbR8u31Vd8IDXoe5mjzzQdBjf1azZ7dEy9F1rs982jdpOUUyBU8Jpp5FyJv3L+ETa23ai6XNX3gySkBCjNiNOx2E8c3CIM/lm0xq8AbDCNKnPXEk=',s:[0x082ba661e6,0x79c0713a35],}})();</script>