程序代写案例-CS2431

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
MapReduce Emulator
CS2431 Lab5
MapReduce overview
• One of Google’s key systems
• MapReduce: Simplified Data
Processing on Large Clusters. Jeffrey
Dean and Sanjay Ghemawat. OSDI
2004
– SIGOPS Hall of Fame Award 2015
– More than 10K citations so far.
• Both authors won ACM SIGOPS
Mark Weiser Award in 2012.
Lab5 overview
• Emulate the execution of “wordcount”, a
classic MapReduce application
• Use the bounded_buffer built in Lab 4
– You are encouraged to use your own
implementation from Lab 4
– We provide one correct implementation
• Create producer and consumer threads
Wordcount
• Problem: given a text file, count the number
of occurrence of each word
• Example: For “AA BBB AAA AA BB BBB”, output
“AA:2, BBB:2, AAA:1, BB:1”
• How would you solve the problem?
– A sequential version of solution is provided.
Wordcount
• It is not hard if the text file is small
– You should be able to solve it.
• Not so simple if you need to handle a big
document
– Google needs to do this for all webpages.
– It cannot be handled by a single machine.
MapReduce wordcount
• A number of mapper and reducer processes
running on different machines
• Mapper: parses a file into multiple words, and
sends words to corresponding reducers.
• Reducer: collects and counts words
• Key: same words are mapped to the same
reducer
MapReduce wordcount
Mapper
A-G
H-T
U-Z
Reducer
AA BB
AA XX
XX YY
YY BB
MapReduce wordcount
Mapper
A-G
H-T
U-Z
Reducer
AA BB
AA XX
XX YY
YY BB
AA:1 BB:1
AA:1
XX:1
XX:1 YY:1
BB:1
YY:1
MapReduce wordcount
Mapper
A-G
H-T
U-Z
Reducer
AA BB
AA XX
XX YY
YY BB
AA:1 BB:1
AA:1
XX:1
XX:1 YY:1
BB:1
YY:1
AA:2
BB:2
XX:2
YY:2
MapReduce wordcount
Mapper
A-G
H-T
U-Z
Reducer
AA BB
AA XX
XX YY
YY BB
AA:1 BB:1
AA:1
XX:1
XX:1 YY:1
BB:1
YY:1
AA:2
BB:2
XX:2
YY:2
The real MapReduce is more complicated than that: it has a
shuffling phase to sort intermediate results, but it is not
necessary for wordcount, so let’s ignore it.
MapReduce wordcount
• Each mapper and reducer is a single process.
They run on different machines.
• They communicate by using TCP/IP.
• Additional problems:
– What to do if a machine crashes or becomes slow.
– What to do if multiple users submit jobs.
– ……
Lab5: A simplified MapReduce
• Each mapper and reducer is a thread. They
run in the same process.
• They communicate by using bounded_buffer.
– Mapper is a producer; Reducer is a consumer.
• Ignore additional problems
Lab5: A simplified MapReduce
Mapper/
Producer
Reducer/
Consumer
AA BB
AA XX
XX YY
YY BB
bounder_buffer
Lab5: Your job
• Create mapper threads.
– Each mapper thread will need to process a string.
• Create reducer threads.
– Each reducer threads is responsible for a
bounded_buffer.
• When a mapper thread parses a word, put it
in the corresponding bounded_buffer
• When a reducer thread gets a word from its
bounded_buffer, update its count.
Lab5: Your job
• You need to implement the following function,
defined in word_count.h:
• void wordcount(int m, int r, char** docs)
– m: number of mapper threads
– r: number of reducer threads
– docs: m strings, one for each mapper
• A string is composed of only “a”-“z” and space.
• The main.c is provided. You do not need to
modify that.
Lab5: Your job
• Tests:
– One mapper, one reducer
– Three mappers, one reducer
– One mapper, two reducers
– Two mappers and two reducers
– Two mappers and three reducers
– Three mappers and two reducers
– ……
Additional requirements and hints
• You are not allowed to use any global
variables in Lab5, except bounded buffers.
– Reason: MapReduce actually runs on different
machines, so they cannot share variables.
• There should be no memory leak when your
program exits.
– Try valgrind.
If you want some challenges …
• I provide a sequential version of word_count.c
and some tips after this slide
• Most challenging: do not look at the
sequential version and any tips
• Median challenging: look at the sequential
version but no tips
• Least challenging: look at both
• No difference for your score. It’s up to you.
Spoiler after this slide.
Tips about Mapper
• Main function needs to pass a string to each
mapper
• Mapper needs to parse the string into words.
– Note: strtok is not thread-safe. Try to find why and
find the thread-safe version.
• Mapper gives a word to a reducer by pushing the
word into the reducer’s bounded_buffer
– Question: which reducer to give a word to?
Map a word to a Reducer
• Requirement: same words must be mapped to
the same reducer
• Simple solution: let each reducer handle a range
of words.
– E.g. Reducer1 for words starting with “a”, Reducer2
for “b”, …
• Any problem with this approach?
– Load imbalance: number of words starting with “x” is
much smaller than that of “s”. As a result, some
reducers are doing much more work than others.
Map a word to a Reducer
• Common solution in practice:
– Compute a hash for a word. This essentially converts a
word into an integer.
– Map the word to Reducer hash%(number of reducers)
• How to compute hash?
– This is out of the scope of this lab.
– hashmap.c provides a hash function crc32. You can
use it directly.
Tips about Reducer
• Each reducer is responsible for a bounded buffer.
• Each reducer has a local hashmap to count words
• Like a consumer, a Reducer keeps popping words
from the bounded buffer and updating the
hashmap
• When all words are processed, a Reducer outputs
counts in the hashmap.
– Question: how to know all words are processed?
A legacy question from Lab 4
• How can a consumer/reducer know that all
producers/mappers have finished their jobs?
– We use sleep in Lab4, which is obviously not a
good solution.
• Solution: a producer/mapper can send a
special message to all bounded_buffers,
indicating this is the last message from the
producer/mapper
A legacy question from Lab 4
• What can be a “special message”?
– In wordcount, anything that is not a valid word
can work as a special message.
• When can a reducer output its hashmap?
– After it receives the special messages from all
mappers.
Details of Reducer
• While(true)
• Pop a word from the queue
• If it is a normal word, update hashmap
• If it is the special message, increment
“finishedMappers” by 1
• If finishedMappers==m, output hashmap and
break

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468