COSC2820 Advanced Programming for Data Science COSC 2820/2815 Assignment 2: NLP Web-based Data Application Milestone I: Natural Language Processing Assessment Type Individual assignment. Submit online via Canvas→Assignments→Assignment 2→ Milestone I: Natural Language Processing. Marks awarded for meeting requirements as closely as possible. Clarifications/updates may be made via announcements/relevant discussion forums. Due Date Week 10, Sunday 3rd Oct 2021, 11:59pm Marks 25 1. Overview Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting sites all manage a job search system, where job hunters could search for relevant jobs based on keywords, salary, and categories. In previous years, the category of an advertised job was often manually entered by the advertiser (e.g., the employer). There were mistakes made for category assignment. As a result, the jobs in the wrong class did not get enough exposure to relevant candidate groups. With advances in text analysis, automated job classification has become feasible; and sensible suggestions for job categories can then be made to potential advertisers. This can help reduce human data entry error, increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site. In order to do so, we need an automated job ads classification system that helps to predict the categories of newly entered job advertisements. This assessment includes two milestones. The first milestone (NLP) concerns the pipeline from basic text preprocessing to building text classification models for predicting the category of a given job advertisement. Then, the second milestone will adopt one of the models that we built in the first milestone, and develop a job hunting website that allows users to browse existing job advertisements, as well as for employers to create new job advertisements. This assessment description is about Milestone 1: Natural Language Processing. 2. Learning Outcomes This assessment relates to following learning outcomes of the course: ● CLO 4: Pre-process natural language text data to generate effective feature representations; ● CLO 5: Document and maintain an editable transcript of the data pre-processing pipeline for professional reporting. 3. Assessment details In this milestone, you are required to pre-process a collection of job advertisement documents, build machine learning models for document classification (i.e., classifying the category of a given job advertisement), and perform evaluation and analysis on the built models. Page1 of7 The Data In this assignment, you are given a large collection of job advertisement documents (~ 50k jobs). The data folder is available for download from canvas. Inside the data folder you will see 8 different subfolders, namely: Accounting_Finance, Engineering, Healthcare_Nursing, Hospitality_Catering, IT, PR_Advertising_Marketing, Sales and Teaching, each folder name is a job category. The job advertisement text documents of a particular category are located in the corresponding subfolder. Each job advertisement document is a txt file, named as "Job_
.txt". It contains the title, the webindex, (some will also have information on the company name, some might not), and the full description of the job advertisement. Task 1: Basic Text Pre-processing [5 marks] In this task, you are required to perform basic text pre-processing on the given dataset, including, but not limited to tokenization, removing most/less frequent words and stop words, extracting bigrams and collocations. In this task, we focus on pre-processing the description only. You are required to perform the following: 1. Extract information from each job advertisement. Perform the following pre-processing steps to the description of each job advertisement; 2. Tokenize each job advertisement description. The word tokenization must use the following regular expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"; 3. All the words must be converted into the lower case; 4. Remove words with length less than 2. 5. Remove stopwords using the provided stop words list (i.e, stopwords_en.txt). It is located inside the same downloaded folder. 6. Remove the word that appears only once in the document collection, based on term frequency. 7. Remove the top 50 most frequent words based on document frequency. 8. Extract the top 10 Bigrams based on term frequency, save them as a txt file (refer to the required output). 9. Save all job advertisement text and information in a txt file (refer to the required output); 10. Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file (refer to the required output); Note: ● For all the words that we removed (including step 4,5,6,7), you will also exclude them in the generated vocabulary. ● The output of this task will be checked against the expected output. You should strictly follow the order of the steps above and the following format requirement. Required Output for Task 1: The output of this task must contain the following files: ● vocab.txt This file contains the unigram vocabulary, one each line, in the following format: word_string:word_integer_index. Very importantly, words in the vocabulary must be sorted in alphabetical order, and the index value starts from 0. This file is the key to interpret the sparse encoding. For instance, in the following example, the word aaron is the 20th word (the corresponding integer_index as 19) in the vocabulary (note that the index values and words in the following image are artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output). Page2 of7 Fig.1 example format for vocab.txt ● bigram.txt The file contains the found bigrams found in the whole document collection as well as their term frequency, separated by comma (each line contains one bigram). The order of the bigrams is based on their term frequency (from high to low). Following is an example of the file format. (note that the following image is artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output). Note: Do NOT use these bigrams in the vocabulary and sparse encoding. The vocab.txt only contains unigrams. Fig.2 example format for bigram.txt ● job_ads.txt This file contains the job advertisement information and the pre-processed description text for all the job advertisement documents. Each job advertisement occupies 5 lines in the file: ○ The first line stores the id of the job advertisement document, written in the format of “ID: <5 digit id>”, for instance, “ID: 44128”. The job advertisement id matches the 5 digit part of the file name of the document. ○ The second line stores the category (name of it’s parent folder) of the job advertisement, written in the format of “Category: ”, for instance, “Category: Teaching”. ○ The third line stores the webIndex of the job advertisement, written in the format of “Webindex: <8 digit web index>”, for instance, “Webindex: 36757414”. ○ The fourth line stores the un-processed title of the job description, in the format of “Title: ”. ○ The fifth line stores the pre-processed description of the job description, in the format of “Description: ”. In order to do so, you need to rejoin the tokens of each pre-processed description text into one string, with space as the delimiter. Following is an example of the file format. (note that the following image is artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output). Page3 of7 Fig. 3 example format for job_ads.txt ● All Python code related to Task 1 should be written in the jupyter notebook task1.ipynb. Task 2: Generating Feature Representations for Job Advertisement Descriptions [10 marks] In this task, you are required to generate different types of feature representations for the collection of job advertisements. Note that in this task, we will only consider the description of the job advertisement. The feature representation that you need to generate includes the following: Bag-of-words model: ○ Generate the Count vector representation for each job advertisement description, and save them into a file (refer to the required output). Note, the generated Count vector representation must be based on the generated vocabulary in Task 1 (as saved in vocab.txt). Models based on word embeddings: ○ You are required to generate feature representation of job advertisement description based on the following language models, respectively: ■ FastText language model trained based on the provided job advertisement descriptions, with a 200 embedding dimension. ■ Choose 2 out of 3 pre-trained language models Word2Vec, GoogleNews300, Glove, with a 200 embedding dimension. For each of the above mentioned language models, you are required to build the weighted (i.e., TF-IDF weighted) and unweighted vector representation for each job advertisement description. To summarise, there are 7 different types of feature representation of documents that you need to build in this task, including count vector, two FastText embeddings (one TF-IDF weighted, and one unweighted version), four pre-trained embeddings (2 different pre-trained language models, each has one TF-IDF weighted version, and one unweighted version). Required Output for Task 2: ● count_vectors.txt This file stores the sparse count vector representation of job advertisement descriptions in the following format. Each line of this file corresponds to one advertisement. It starts Page4 of7 with a ‘#’ key followed by the webindex of the job advertisement, and a comma ‘,’. The rest of the line is the sparse representation of the corresponding description in the form of word_integer_index:word_freq separated by comma. Following is an example of the file format (note that the following image is artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output): Fig. 4 example format for count_vectors.txt Note: word_freq here refers to the frequency of the unigram in the corresponding Description only, excluding the title. Task 3: Job Advertisement Classification [10 marks] In this task, you are required to build machine learning models for classifying the category of a job advertisement text. A simple model that you can consider is the logistic regression model from sklearn as demonstrated in the activities. However, you feel free to select other models (even if it has not been covered in this course). You are required to conduct two sets of experiments on the provided dataset to investigate the following two questions, respectively. Q1: Language model comparisons Which language model we built previously (based on job advertisement descriptions) performs the best with the chosen machine learning model? To answer these questions, you are required to build machine learning models based on the feature representations of the documents you generated in task 2, and to perform evaluation on the various model performance. Q2: Does more information provide higher accuracy? In Task 2, we have built a number of feature representations of documents based on job advertisement descriptions. However, we have not explored other features of a job advertisement, e.g., the title of the job position. Will adding extra information help to boost up the accuracy of the model? To answer this question, you are required to conduct experiments to build and compare the performance of classification models that considering: ● only title of the job advertisement ● only description of the job advertisement (which you’ve already done in Task 3) ● both title and description of the job advertisement. For this, you have the flexibility to simply concatenate title and description of a job advertisement when generating document feature representation; or to generate separate feature representations for title and description, respectively, and use both features in the classification models. Note that: For both questions above, Page5 of7 ● You are required to use appropriate techniques (e.g., projecting samples of the constructed document embeddings in 2 dimensional space) to understand the nature of the task before building machine learning models for classification ● When evaluating the performance of the models, you are required to conduct a 5-fold cross validation to obtain robust comparisons. All Python code related to Task 2 and 3 should be written in the jupyter notebook task2_3.ipynb. 6. Marking Guidelines Marking Criteria ● Mechanical pass: Your outputs will be compared against the expected output. Therefore, marking will be based on the similarity between what we expect (as discussed in the instructions) and what we receive from you. It is extremely important to carefully follow the instructions to produce the expected output. Otherwise, you may easily lose many points for simple mistakes (e.g. typos in the format of the files, not loading essential libraries, different file names/path, etc). ● Expert pass: Your jupyter notebook will be checked by an expert to validate the logic and flow, proper use of libraries and functions, and clarity of codes, comments, structure and presentation. ● You need to ensure all the codes and files that are required to run your code are included in the submission. The expert will NOT fix your code’s problem even if it is a simple typo in a file name or an imported library. Mark Allocations ● Task 1 Basic Text Pre-processing [5%] o Implementation [4%] o Notebook presentation [1%], proportional to actual mark obtained in implementation ● Task 2 [10%] o Implementation [7%] o Notebook presentation [3%], proportional to actual mark obtained in implementation ● Task 3 [10%] o Implementation [7%] o Notebook presentation [3%], proportional to actual mark obtained in implementation For Task 1, and Task 2 and 3, you are required to maintain an auditable and editable transcript, and communicate any justification of methods/approach chosen, results, analysis and findings through jupyter notebook. The presentation of the jupyter notebook accounts for certain percentages of the allocated mark for each task, proportional to the actual mark obtained, as per specified above. Students can refer to the activities in modules as examples for the level of details that they should include in their jupyter notebook. 4. Submission The final submission of this milestone will consist of: ● The required output from Task 1, including vocab.txt, bigram.txt and job_ads.txt ● The required output from Task 2, count_vectors.txt ● The jupyter notebook of Task 1, and Task 2&3, respectively Page6 of7 ● The .py format of the jupyter notebook of Task 1, and Task 2&3, respectively. Note that: ○ the content of the .py file must match your jupyter notebook ○ the .py file will be used for parlargism detections on both comment/description content, as well as the actual code. ○ to help promote academic integrity, please make sure you submit the .py files. Submission without the .py files or unmatched .py files will NOT be marked. ○ note that the .py files can be easily downloaded from jupyter notebook interface (File -> Download as -> Python (.py)) ● Put all the above mentioned files in a folder, named with your student id, Zip the folder with the same name (i.e., s1234567.zip) and upload for submission Assessment declaration: When you submit work electronically, you agree to the assessment declaration: https://www.rmit.edu.au/students/student-essentials/assessment-and-exams/assessment/assessment-declara tion Late Submission Penalty Late submissions will incur a 10% penalty on the total marks of the corresponding assessment task per day or part of day late. Submissions that are late by 5 days or more are not accepted and will be awarded zero, unless special consideration has been granted. Granted Special Considerations with a new due date set more than 2 weeks after the original due will automatically result in an equivalent assessment in the form of a practical test with interview, assessing the same knowledge and skills of the assignment (location and time to be arranged by the instructor). Please ensure your submission is correct (all files are there, compiles etc), re-submissions after the due date and time will be considered as late submissions. 5. Academic integrity and plagiarism (standard warning) Academic integrity is about honest presentation of your academic work. It means acknowledging the work of others while developing your own insights, knowledge and ideas. You should take extreme care that you have: ● acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. directly copied), summarised, paraphrased, discussed or mentioned in your assessment through the appropriate referencing methods, ● provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from Internet sites. If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have passed off the work and ideas of another person without appropriate referencing, as if they were your own. RMIT University treats plagiarism as a very serious offence constituting misconduct. Plagiarism covers a variety of inappropriate behaviours, including: ● Failure to properly document a source ● Copyright material from the internet or databases ● Collusion between students For further information on our policies and procedures, please refer to https://www.rmit.edu.au/students/student-essentials/rights-and-responsibilities/academic-integrity Page7 of7 欢迎咨询51作业君