Assignment 3 CS 769 - Spring 2021 The objective of this assignment is to build an inverted index for the Craneld collection. This index will be used in assignment 4 to build a vector space retrieval engine. You can implement this project with at most 2 MapReduce jobs. Even though it can be done with only one job, as long as the nal results are accurate, you can choose your own approach. The Craneld collection is in the following directory: /CS769/assignment2/cranfield.txt In order to complete the assignment you will need to run your MapReduce job(s) and generate the index le. 1 Denitions. You need to compute the following values for each term in the documents as dened below: 1.1 Term Frequency The term frequency 푡푓푡,푑 of term 푡 in document 푑 is dened as the number of times that 푡 occurs in 푑 . 1.2 Document Frequency푑푓푡 is the number of documents that contain term 푡 . 1.3 Inverse Document Frequency The 푖푑푓 weight of term 푡 is dened as follows: 푖푑푓푡 = log10 푁푑푓푡 [1.1] Where 푁 is the number of documents in the collection. 2 Schema. The program has to read the Craneld collection and generate the following schema for each word in documents to build the inverted index le: 1 Listing 1: Inverted Index for word White. {"white": { "stat": { "document_frequency": 2, "inverse_document_frequency": 6.5511}, "documents": { "886": { "term_frequency": 1, "term_freq_normalized_weight": 0.10232} "890": { "term_frequency": 1, "term_freq_normalized_weight": 0.09249} } } } In the example above, we have the word 푤ℎ푖푡푒 as the key in the index le. The value for the key are two nested dictionaries, namely 푠푡푎푡 and 푑표푐푢푚푒푛푡푠. The 푠푡푎푡 dictionary will contain the 푑푓 and 푖푑푓 values for the term. The 푑표푐푢푚푒푛푡푠 dictionary contains the list of all the documents in which the term has occurred in as well as the frequency of the term in those documents. Therefore, by looking at document frequency value we know that the term 푤ℎ푖푡푒 has appeared in 2 documents. And the document’s value will have the list of those two document ids, 886 and 890, and the number of times the word has appeared in those documents. 3 Processing Data. You can process the Craneld collection dataset as is or change its structure in any way that makes it easier for your MapReduce job to process. However, there are a number of requirements that you need to follow. Every term should be casefolded, and stemmed with the porter stemmer. Moreover, stopwords should be removed from the documents, that is, they should not be indexed. You can download the list of the stopwords and Porter stemmer from 푁퐿푇퐾 : Listing 2: NLTK Library. from nltk.stem.porter import PorterStemmer from nltk.corpus import stopwords stemmer = PorterStemmer() stop_words = set(stopwords.words(’english’)) print(stemmer.stem("magnificent")) #=> magnific print("himself" in stop_words) #=> True 2 And nally, the terms in your inverted index le should be listed in lexicographical order, that is, sorted alphabetically. 4 Submission. The assignment will be graded based on the MapReduce job that you have to implement and successfully generate the nal inverted index le in your HDFS directoy. Appendices A HDFS. For more HDFS options visit the following site: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/ HDFSCommands.html B Yarn. For more Yarn options visit the following site: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands. html 3
欢迎咨询51作业君