Assignment 4 CS 769 - Spring 2021 In this project, we will utilize the index that you built for Cranled collection to nd the ranked list documents for each query. The query list is in: /CS769/assignment4/cran_query.txt In order to complete the assignment you will need to run your MapReduce job(s) and generate the highest ranked document for each query. 1 Cosine Measure You are required to implement the cosine measure by computing normalized tf*idf as discussed in the class. Given the inverted index from previous project, you have all the information needed to compute the above score for every query and every document that contains the words in the query. You need to follow the MapReduce paradigm to distribute the computation. There are a number of ways to achieve this and below is one of these approaches: 1.1 Mapper The mappers need to read the query le as well as the inverted index. You can provide the index le using the following option in your yarn command while submitting the job: -cacheFile ’/user/
/out3/part-00000#file’ where /user//out3/part00000 is the directory of the le and file is going to be the name of the le when you access it in your code: Listing 1: Read Cached File. f = open(’file’) Once a mapper has read a query and the inverted index, it can follow the below steps to generate tuples that will be read by the reducers: 1. Use the Porter stemmer and the list of the stop words to stem and remove the stop words. 2. Find the term frequency of each term in the query to calculate tf. 3. Find the idf for each query term from your index. 4. Add the document list containing each query term to a master list. 1 5. Generate the following structure for each query: Listing 2: Mapper output for each 푞푢푒푟푦_푖푑 and 푑표푐푢푚푒푛푡_푖푑 . {query_id: { document_id: { term_1: { term_weighted_tf: x_1, term_normalized_weighted_tf: y_1}, term_2: { term_weighted_tf: x_2, term_normalized_weighted_tf: y_2} } } Remember, you need to print the above structure for each query_id and its document_ids. In other words, if query_id has 5 documents that contain its words, the mapper should output the above results 5 times, each with the same query_id but a dierent document_id and its associated terms. 1.2 Reducer The reducer job is to aggregate the outputs of the mapper by a key that you specify. In this approach the reducer has to calculate the similarity score between a query and a document. Looking at the output of the mapper, we notice the query_id is the key and therefore all the tuples with the same key will be given to a specic reducer. This means that the reducer can calculate the similarity score between a query and all the documents ids provided and nally rank them. In order to do that, the reducer is going through each document_id for a given query_id and calculate the score using the cosine measure. These scores can be put into a data structure of your choice and sort them by the scores. Finally, the reducer has to generate the highest score between a query_id and a document_id: Listing 3: Reducer output for query and its most similar document. { query_id_1: {document_id: score}, query_id_2: {document_id: score}, query_id_3: {document_id: score}, .... } 2 Submission. The assignment will be graded based on the MapReduce job that you have to implement and successfully generate the nal score le in your HDFS directoy. 2 Appendices A HDFS. For more HDFS options visit the following site: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/ HDFSCommands.html B Yarn. For more Yarn options visit the following site: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands. html 3 欢迎咨询51作业君