程序代写案例-CS 769-Assignment 4
Assignment 4
CS 769 - Spring 2021
In this project, we will utilize the index that you built for Cranled collection to nd the ranked
list documents for each query. The query list is in:
/CS769/assignment4/cran_query.txt
In order to complete the assignment you will need to run your MapReduce job(s) and generate
the highest ranked document for each query.
1 Cosine Measure
You are required to implement the cosine measure by computing normalized tf*idf as discussed
in the class.
Given the inverted index from previous project, you have all the information needed to compute
the above score for every query and every document that contains the words in the query. You
need to follow the MapReduce paradigm to distribute the computation. There are a number of
ways to achieve this and below is one of these approaches:
1.1 Mapper
The mappers need to read the query le as well as the inverted index. You can provide the
index le using the following option in your yarn command while submitting the job:
-cacheFile ’/user//out3/part-00000#file’
where /user//out3/part00000 is the directory of the le and file is going
to be the name of the le when you access it in your code:
Listing 1: Read Cached File.
f = open(’file’)
Once a mapper has read a query and the inverted index, it can follow the below steps to
generate tuples that will be read by the reducers:
1. Use the Porter stemmer and the list of the stop words to stem and remove the stop words.
2. Find the term frequency of each term in the query to calculate tf.
3. Find the idf for each query term from your index.
4. Add the document list containing each query term to a master list.
1
5. Generate the following structure for each query:
Listing 2: Mapper output for each 푞푢푒푟푦_푖푑 and 푑표푐푢푚푒푛푡_푖푑 .
{query_id: {
document_id: {
term_1: {
term_weighted_tf: x_1,
term_normalized_weighted_tf: y_1},
term_2: {
term_weighted_tf: x_2,
term_normalized_weighted_tf: y_2}
}
}
Remember, you need to print the above structure for each query_id and its document_ids.
In other words, if query_id has 5 documents that contain its words, the mapper should output
the above results 5 times, each with the same query_id but a dierent document_id and its
associated terms.
1.2 Reducer
The reducer job is to aggregate the outputs of the mapper by a key that you specify. In this
approach the reducer has to calculate the similarity score between a query and a document.
Looking at the output of the mapper, we notice the query_id is the key and therefore all the
tuples with the same key will be given to a specic reducer. This means that the reducer can
calculate the similarity score between a query and all the documents ids provided and nally
rank them.
In order to do that, the reducer is going through each document_id for a given query_id
and calculate the score using the cosine measure. These scores can be put into a data structure
of your choice and sort them by the scores.
Finally, the reducer has to generate the highest score between a query_id and a document_id:
Listing 3: Reducer output for query and its most similar document.
{
query_id_1: {document_id: score},
query_id_2: {document_id: score},
query_id_3: {document_id: score},
....
}
2 Submission.
The assignment will be graded based on the MapReduce job that you have to implement and
successfully generate the nal score le in your HDFS directoy.
2
Appendices
A HDFS.
For more HDFS options visit the following site:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/
HDFSCommands.html
B Yarn.
For more Yarn options visit the following site:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.
html
3

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie