程序代写案例-CS 769-Assignment 3
Assignment 3
CS 769 - Spring 2021
The objective of this assignment is to build an inverted index for the Craneld collection. This
index will be used in assignment 4 to build a vector space retrieval engine.
You can implement this project with at most 2 MapReduce jobs. Even though it can be done
with only one job, as long as the nal results are accurate, you can choose your own approach.
The Craneld collection is in the following directory:
/CS769/assignment2/cranfield.txt
In order to complete the assignment you will need to run your MapReduce job(s) and generate
the index le.
1 Denitions.
You need to compute the following values for each term in the documents as dened below:
1.1 Term Frequency
The term frequency 푡푓푡,푑 of term 푡 in document 푑 is dened as the number of times that 푡
occurs in 푑 .
1.2 Document Frequency푑푓푡 is the number of documents that contain term 푡 .
1.3 Inverse Document Frequency
The 푖푑푓 weight of term 푡 is dened as follows:
푖푑푓푡 = log10 푁푑푓푡 [1.1]
Where 푁 is the number of documents in the collection.
2 Schema.
The program has to read the Craneld collection and generate the following schema for each
word in documents to build the inverted index le:
1
Listing 1: Inverted Index for word White.
{"white": {
"stat": {
"document_frequency": 2,
"inverse_document_frequency": 6.5511},
"documents": {
"886": {
"term_frequency": 1,
"term_freq_normalized_weight": 0.10232}
"890": {
"term_frequency": 1,
"term_freq_normalized_weight": 0.09249}
}
}
}
In the example above, we have the word 푤ℎ푖푡푒 as the key in the index le. The value for
the key are two nested dictionaries, namely 푠푡푎푡 and 푑표푐푢푚푒푛푡푠. The 푠푡푎푡 dictionary will
contain the 푑푓 and 푖푑푓 values for the term. The 푑표푐푢푚푒푛푡푠 dictionary contains the list of all
the documents in which the term has occurred in as well as the frequency of the term in those
documents.
Therefore, by looking at document frequency value we know that the term 푤ℎ푖푡푒 has appeared
in 2 documents. And the document’s value will have the list of those two document ids, 886
and 890, and the number of times the word has appeared in those documents.
3 Processing Data.
You can process the Craneld collection dataset as is or change its structure in any way
that makes it easier for your MapReduce job to process. However, there are a number of
requirements that you need to follow.
Every term should be casefolded, and stemmed with the porter stemmer. Moreover, stopwords
should be removed from the documents, that is, they should not be indexed. You can download
the list of the stopwords and Porter stemmer from 푁퐿푇퐾 :
Listing 2: NLTK Library.
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words(’english’))
print(stemmer.stem("magnificent")) #=> magnific
print("himself" in stop_words) #=> True
2
And nally, the terms in your inverted index le should be listed in lexicographical order, that
is, sorted alphabetically.
4 Submission.
The assignment will be graded based on the MapReduce job that you have to implement and
successfully generate the nal inverted index le in your HDFS directoy.
Appendices
A HDFS.
For more HDFS options visit the following site:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/
HDFSCommands.html
B Yarn.
For more Yarn options visit the following site:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnCommands.
html
3

欢迎咨询51作业君
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie