MapReduce Emulator CS2431 Lab5 MapReduce overview • One of Google’s key systems • MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004 – SIGOPS Hall of Fame Award 2015 – More than 10K citations so far. • Both authors won ACM SIGOPS Mark Weiser Award in 2012. Lab5 overview • Emulate the execution of “wordcount”, a classic MapReduce application • Use the bounded_buffer built in Lab 4 – You are encouraged to use your own implementation from Lab 4 – We provide one correct implementation • Create producer and consumer threads Wordcount • Problem: given a text file, count the number of occurrence of each word • Example: For “AA BBB AAA AA BB BBB”, output “AA:2, BBB:2, AAA:1, BB:1” • How would you solve the problem? – A sequential version of solution is provided. Wordcount • It is not hard if the text file is small – You should be able to solve it. • Not so simple if you need to handle a big document – Google needs to do this for all webpages. – It cannot be handled by a single machine. MapReduce wordcount • A number of mapper and reducer processes running on different machines • Mapper: parses a file into multiple words, and sends words to corresponding reducers. • Reducer: collects and counts words • Key: same words are mapped to the same reducer MapReduce wordcount Mapper A-G H-T U-Z Reducer AA BB AA XX XX YY YY BB MapReduce wordcount Mapper A-G H-T U-Z Reducer AA BB AA XX XX YY YY BB AA:1 BB:1 AA:1 XX:1 XX:1 YY:1 BB:1 YY:1 MapReduce wordcount Mapper A-G H-T U-Z Reducer AA BB AA XX XX YY YY BB AA:1 BB:1 AA:1 XX:1 XX:1 YY:1 BB:1 YY:1 AA:2 BB:2 XX:2 YY:2 MapReduce wordcount Mapper A-G H-T U-Z Reducer AA BB AA XX XX YY YY BB AA:1 BB:1 AA:1 XX:1 XX:1 YY:1 BB:1 YY:1 AA:2 BB:2 XX:2 YY:2 The real MapReduce is more complicated than that: it has a shuffling phase to sort intermediate results, but it is not necessary for wordcount, so let’s ignore it. MapReduce wordcount • Each mapper and reducer is a single process. They run on different machines. • They communicate by using TCP/IP. • Additional problems: – What to do if a machine crashes or becomes slow. – What to do if multiple users submit jobs. – …… Lab5: A simplified MapReduce • Each mapper and reducer is a thread. They run in the same process. • They communicate by using bounded_buffer. – Mapper is a producer; Reducer is a consumer. • Ignore additional problems Lab5: A simplified MapReduce Mapper/ Producer Reducer/ Consumer AA BB AA XX XX YY YY BB bounder_buffer Lab5: Your job • Create mapper threads. – Each mapper thread will need to process a string. • Create reducer threads. – Each reducer threads is responsible for a bounded_buffer. • When a mapper thread parses a word, put it in the corresponding bounded_buffer • When a reducer thread gets a word from its bounded_buffer, update its count. Lab5: Your job • You need to implement the following function, defined in word_count.h: • void wordcount(int m, int r, char** docs) – m: number of mapper threads – r: number of reducer threads – docs: m strings, one for each mapper • A string is composed of only “a”-“z” and space. • The main.c is provided. You do not need to modify that. Lab5: Your job • Tests: – One mapper, one reducer – Three mappers, one reducer – One mapper, two reducers – Two mappers and two reducers – Two mappers and three reducers – Three mappers and two reducers – …… Additional requirements and hints • You are not allowed to use any global variables in Lab5, except bounded buffers. – Reason: MapReduce actually runs on different machines, so they cannot share variables. • There should be no memory leak when your program exits. – Try valgrind. If you want some challenges … • I provide a sequential version of word_count.c and some tips after this slide • Most challenging: do not look at the sequential version and any tips • Median challenging: look at the sequential version but no tips • Least challenging: look at both • No difference for your score. It’s up to you. Spoiler after this slide. Tips about Mapper • Main function needs to pass a string to each mapper • Mapper needs to parse the string into words. – Note: strtok is not thread-safe. Try to find why and find the thread-safe version. • Mapper gives a word to a reducer by pushing the word into the reducer’s bounded_buffer – Question: which reducer to give a word to? Map a word to a Reducer • Requirement: same words must be mapped to the same reducer • Simple solution: let each reducer handle a range of words. – E.g. Reducer1 for words starting with “a”, Reducer2 for “b”, … • Any problem with this approach? – Load imbalance: number of words starting with “x” is much smaller than that of “s”. As a result, some reducers are doing much more work than others. Map a word to a Reducer • Common solution in practice: – Compute a hash for a word. This essentially converts a word into an integer. – Map the word to Reducer hash%(number of reducers) • How to compute hash? – This is out of the scope of this lab. – hashmap.c provides a hash function crc32. You can use it directly. Tips about Reducer • Each reducer is responsible for a bounded buffer. • Each reducer has a local hashmap to count words • Like a consumer, a Reducer keeps popping words from the bounded buffer and updating the hashmap • When all words are processed, a Reducer outputs counts in the hashmap. – Question: how to know all words are processed? A legacy question from Lab 4 • How can a consumer/reducer know that all producers/mappers have finished their jobs? – We use sleep in Lab4, which is obviously not a good solution. • Solution: a producer/mapper can send a special message to all bounded_buffers, indicating this is the last message from the producer/mapper A legacy question from Lab 4 • What can be a “special message”? – In wordcount, anything that is not a valid word can work as a special message. • When can a reducer output its hashmap? – After it receives the special messages from all mappers. Details of Reducer • While(true) • Pop a word from the queue • If it is a normal word, update hashmap • If it is the special message, increment “finishedMappers” by 1 • If finishedMappers==m, output hashmap and break
欢迎咨询51作业君