BBAZ16602 - MIDTERM (Full Mark: 110, Bonus: 10) 48 + 3 HOURS (Nov 1, 2020, 6:00pm -- Nov 3, 2020, 8:59pm) Q1 (15 marks) time: 10 seconds Howard is a very famous detective in Macau. Every time there is a crime, Howard can quickly catch many suspects, tho the problem is ... really too many suspects ... So, you, as a junior detective, you need to help Howard to narrow down the number of suspects. Assume Howard caught suspects. for each suspect, Howard allows him/her to say only 1 sentence, either "i am innocent" or "a, b or c is guilty". In the second sentence, the suspect can finger out any amount of "guilty" people. Also, all the suspects are numbered from to . God knows that you have a tough job, it always send you a message in your dream to tell you the number of suspects that are telling truth to make your life easier. Hopefully, together will all these information, you can narrow down the suspect list. input: the first lines are the sentences from number suspect to number suspect respectively. The line is the number , it is the number of suspects are telling the truth. output: shows the list of possible suspects n 1 n n 1 n n+ 1 k ∈ {1, 2,⋯ , n− 1} file name: midterm_Q1.py input: midterm_Q1_input.txt output: midterm_Q1_output.txt sample input: 1 i am innocent 2 3 is guilty 3 i am innocent 4 1 or 2 is guilty 3 sample output: (if unable to narrow down, just print the empty list) [2] In [ ]: Q2 (25 marks) We store data in bits in our computers, data compression is a way to use fewer bits than the original representation, so the size of the data can be reduced. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression, for example, the file format ".zip'' is one of the example of lossless compression. Lossy compression reduces bits by removing unnecessary or less important information, for example,".jpg'' is one of the lossy compression for images. In this question, you will be focused in lossless compression. For example, say we have a mapping for encoding as follows: Using the mapping above, we can therefore encode a message, for example: Huffman code is a type of encoding method that is commonly used for lossless data compression in our computers. It has variable length of its codes, that is, the size of each code is not necessary the same. To construct Huffman code, we first need to count the occurrence or frequency of each character. Using the same example from above, we have BADHEADFEED⇒ 001000011111100000011101100100011 Then, we can simplify the table by removing those occurrence0 Now we construct a tree using the following rule: - All the characters are leaf nodes - Form a sub-tree by taking the least probable sub-trees One may illustrate the process by the following, The Huffman code for each character is determined by its leaf position of the tree. That is, counting from the root, left corresponds to and right corresponds to . So, is the right most leaf, and hence the corresponding code is . 0 1 A 111 Therefore, using the Huffman coding, we have So, compare to the fixed length code, Huffman code of uses only bits, which is bits less than the fixed length code. BADHEADFEED⇒ 000111011101011101001101001 BADHEADFEED 27 6 ABOUT THE PROGRAM BODY, JUST USE THE FOLLOWING, REMEMBER TO MODIFY THE FILENAMES YOURSELF: In [ ]: #execute line by line fout = open('output filename here', 'w') with open('filename here', 'r') as fp: for line in fp: exec(line) fout.close() Q2.(a) [6 marks] Write a function to return a dictionary that stores the characters and the corresponding occurrence given an input string, e.g. . NOTE: DO write a function to do so, as you will need to combine all the functions later. time: 3 seconds ToDict(some_input_string) {'A':5, 'B':10, ...} filename: midterm_Q2_a.py input file: midterm_Q2_a_input.py output file: midterm_Q2_a_output.py sample input: tx = 'This is the brand new world.' print(ToDict(tx), file = fout) sample output: {' ': 5, '.': 1, 'T': 1, 'd': 1, 'e': 3, 'h': 3, 'i': 2, 'l': 2, 'n': 1, 'o': 2, 'r': 1, 's': 2, 't': 1, 'w': 3} In [ ]: Q2.(b) [8 marks] Use the tree data structure, and the node data structure from lecture notes to stores the corresponding characters. Then, write a function that outputs a Huffman tree given an input dictionary. (here, we output the preorder of the tree) time: 3 seconds Tree() Node() HuffmanTree(some_input_dictionary) filename: midterm_Q2_b.py input file: midterm_Q2_b_input.py output file: midterm_Q2_b_output.py sample input: tx = 'BADHEADFEED' dic = ToDict(tx) tr = HuffmanTree(dic) print(tr.preorder(tr.root, []), file = fout) sample output: ['BFDEHA', 'BFD', 'BF', 'B', 'F', 'D', 'EHA', 'E', 'HA', 'H', 'A'] In [ ]: Q2.(c) [6 marks] Write a function in the tree data structure that returns a dictionary, which stores the characters and the corresponding Huffman code given an input tree. time: 3 seconds ToHuffman() Tree() filename: midterm_Q2_c.py input file: midterm_Q2_c_input.py output file: midterm_Q2_c_output.py sample input: tx = 'BADHEADFEED' dic = ToDict(tx) tr = HuffmanTree(dic) print(tr.ToHuffman(), file = fout) sample output: {'A': '111', 'B': '000', 'D': '01', 'E': '10', 'F': '001', 'H': '110'} In [ ]: Q2.(d) [2 marks] Write a Huffman encoding function to return a binary sequence that corresponds to your input, e.g. . time: 10 seconds encHuffman(some_input_string, Huffman_dict) 11001... filename: midterm_Q2_d.py input file: midterm_Q2_d_input.py output file: midterm_Q2_d_output.py sample input: tx = 'BADHEADFEED' dic = ToDict(tx) tr = HuffmanTree(dic) Huff_dic = tr.ToHuffman() print(encHuffman('BADHEADFEED', Huff_dic), file = fout) sample output: 000111011101011101001101001 In [ ]: Q2.(e) [3 marks] Write a function to decode a binary sequence with input a binary sequence and a Huffman dictionary. time: 20 seconds decHuffman(some_binary_seq, Huffman_dict) filename: midterm_Q2_e.py input file: midterm_Q2_e_input.py output file: midterm_Q2_e_output.py sample input: tx = 'BADHEADFEED' dic = ToDict(tx) tr = HuffmanTree(dic) Huff_dic = tr.ToHuffman() enc = encHuffman('BADHEADFEED', Huff_dic) print(decHuffman(enc, Huff_dic), file = fout) sample output: BADHEADFEED In [ ]: Q3 (20 marks) Q3.(a) [8 marks] Haloween party is just finished. You and your friends are still very excited and hope to have another party on Nov 2, 2020. Now, you come up with an idea -- social dance party, and you are the coordinator of this party. Assuming there are men and women. In this party, each man need to pair with a woman, and you know, each of you have partner preference in your mind. So, you, as a coordinator, you need to pair them so that the 'pair rate' is maximized. men is numbered from to , women is also numbered from to . For each man, it has preference of choosing women, similar for each woman. For example, if man and woman is paired, the pair rate equals if woman is in man 's preference list or vice versa, and otherwise. The overall pair rate is the sum of each pair's pair rate. n n n 1 n n 1 n b g 1 g b 0 filename: midterm_Q3_a.py input file: midterm_Q3_a_input.py output file: midterm_Q3_a_output.py sample input: (the first line is the number of men, it is also the number of women, the coming lines corresponds to the preference list of men from man to , the last lines corresponds to the women's preference list, if a line is , that means the corresponding man or woman has no preference.) 3 1 2 0 3 0 3 0 sample output: 2 time: 45 seconds n 1 n n 0 In [ ]: Q3.(b) [12 marks] Now, assuming each man and woman will set their preference ranking from to , if a man and a woman is paired, their score will the the multiplication of the rankings. It is obvious that, the sum of the pairing score need to be minimized, what is the minimized value? sample input: (the first line is the number of men, it is also the number of women, the coming lines corresponds to the priority preference rating of men from man to , the last lines corresponds to the women's priority preference) 3 1 2 3 3 1 2 1 3 2 2 1 3 1 2 3 1 2 3 sample output (man 1 - woman 3, man 2 - woman 2, man 3 - woman 1): 8 time: 60 seconds 1 n b g n 1 n n In [ ]: Q4 (50 marks) We have a set of movies, with the properties like type(s), director(s), actor(s), etc. We also have a set of users and the users rating for some movies. Now, can you use graph to recommend me a movie that I might like? Q4.(a) [2 marks] Which is the most voted movie? Please show the first most voted movies. filename: midterm_Q4_a.py input file: midterm_Q4_a_input.py output file: midterm_Q4_a_output.py sample input (the first 10 most voted movies): 10 ti 60 d n In [ ]: Q4.(b) [2 marks] Which movie has the highest rate? Please show the first highest rate movies. In fact, we can use the average ratings of the movie as the score but using this won't be fair enough since a movie with 4.3 average rating and only 3 votes cannot be considered better than the movie with 4.1 as as average rating but 300 votes. So, we use a more fair rating scheme for each movie: where is the number of votes for the movie; is the number of votes that 90% of the movies have; is the average rating of the movie; And is the mean ratings across the entire ratings filename: midterm_Q4_b.py input file: midterm_Q4_b_input.py output file: midterm_Q4_b_output.py sample input (the first 10 highest rate movies): 10 ti 60 d n WR (weighted rating) = R+ C v v+m m v+m v m R C In [ ]: Q4.(c) [4 marks] For those who give a movie score greater than or equal to , they are the potential fans of the movie. So, given a movie name, can you show the what other movies that these potential fans are also potential fans? List the first 10 of them. filename: midterm_Q4_c.py input file: midterm_Q4_c_input.py output file: midterm_Q4_c_output.py sample input (the movie name): Independence Day (a.k.a. ID4) (1996) time: 120 seconds 4 In [ ]: Q4.(d) [4 marks] Based on (c), you build a naive recommender system. Suppose a reviewer give a bunch of scores to different movies, for those movies he/she gives at least , you should be able to find the others who also give these movies at least . Based on these reviewers, calculate the scores for the movies that they have rated, and sort them by scores. List the first 10 of them. filename: midterm_Q4_d.py input file: midterm_Q4_d_input.py output file: midterm_Q4_d_output.py sample input (userId): 12345 time: 120 seconds 4 4 In [ ]: Q4.(e) [8 marks] Improving your recommender system. A person rates 5 to a movie, if another person also rates 5 to the same movie, it might means that these two people share a common interest. Now, we define a 'distance' metric between a person and a movie, mathematically, where the distance is integer value. More specifically, If a person rates 1.5 to a movie, then the distance between them is 8. Now, based on this distance metric, we can calculate the average distance between two people among the movies they both rated. Therefore, given the average distance between person A and person B, and the number of common movies that they both rated. You may recommend person A the movies that person B have rated with at least rate 4. So, if person A is fixed, you will be able to find all possible people and sorted by the number of movies rated in common with descendant order, and distance with ascendant order. List the first 20 of them. filename: midterm_Q4_e.py input file: midterm_Q4_e_input.py output file: midterm_Q4_e_output.py sample input (userId): 12345 time: 180 seconds distance = 11 − 2 × rate, In [ ]: Q4.(f) [8 marks] Improving your recommender system. Following (d), you will be able to identify the movies that you haven't watched from their rated movie list (probably those you haven't rate). Now, we define a new metric call 'distance per person of a movie' by You see that the smaller value of distance per person of a movie meaning that the more people having watched that movie, which means such movies should recommend to you. List the first 20 movies. filename: midterm_Q4_f.py input file: midterm_Q4_f_input.py output file: midterm_Q4_f_output.py sample input (userId): 12345 time: 240 seconds distance per person of a movie = , average distance from you to those people number of people having rated this movie In [ ]: Q4.(g) [10 marks, essay + code] Based on (e), one should be able to calculate the minimum distance from a person thru another person (rated at least a movie in common) to a movie that haven't rated, simply calculate the average distance of different movies and sort them in ascendant order for the recommender system, and list the first 20 movies. essay filename: midterm_Q4_g.pdf Compare the difference between (f) and (g), and analyze which method is better, in terms of? filename: midterm_Q4_g.py input file: midterm_Q4_g_input.py output file: midterm_Q4_g_output.py sample input (userId): 12345 time: 240 seconds In [ ]: Q4.(h) [12 marks, OPEN QUESTION, essay + code] In fact, we can also rate a person's flavor using the movie types. Think of a metric that can merge the movie types, ratings, and similarity. Then, write a recommender system and list the first 20 movies. essay filename: midterm_Q4_h.pdf You should provide the reason of such metric, and show some stats that supports your thought. filename: midterm_Q4_h.py input file: midterm_Q4_h_input.py output file: midterm_Q4_h_output.py sample input (userId): 12345 time: 300 seconds In [ ]: BONUS print the following: BBAZ16602 is fun! B i f | B s u | A n . Z 1 6 6 0 2 output file: midterm_bonus_output.txt In [ ]:
欢迎咨询51作业君