FIT3155 S1/2021: Assignment 3 (Due midnight 11:59pm on Friday 28 May 2021) [Weight: 10 = 3.5 + 2.5 + 4 marks.] Your assignment will be marked on the performance/efficiency of your program. You must write all the code yourself, and should not use any external library routines, except those that are permitted as stated within the tasks. Follow these procedures while submitting this assignment: The assignment should be submitted online via moodle strictly as follows: • All your scripts MUST contain your name and student ID. • Use gzip or Winzip to bundle your work into an archive which uses your student ID as the file name. (STRICTLY AVOID UPLOADING .rar ARCHIVES!) – Your archive should extract to a directory which is your student ID. – This directory should contain a subdirectory for each of the three questions, named as q1/, q2/ and q3/. – Your corresponding scripts and work should be tucked within those subdirectories. • Submit your zipped file electronically via Moodle. Academic integrity, plagiarism and collusion Monash University is committed to upholding high standards of honesty and academic integrity. As a Monash student your responsibilities include developing the knowl- edge and skills to avoid plagiarism and collusion. Read carefully the material available at https://www.monash.edu/students/academic/policies/academic-integrity to understand your responsibilities. As per FIT policy, all submissions will be scanned via MOSS. 1 Assignment Questions 1. (3.5 marks) We say that a number p1 is a twin prime if p1 is prime and there exists another prime number p2, such that |p1 − p2| = 2. That is, a twin prime is a prime number that is either 2 less or 2 more than another prime number. For instance, the first 3 twin prime pairs are (3, 5), (5, 7), and (11, 13). Note that (2, 3) is the only pair of primes with a prime gap of 1 but is not considered a twin prime pair, while 5 is the only prime that belongs to two twin prime pairs. Since every prime number greater than 3 is of the form 6x ± 1 where x is a natural number, all twin primes - excluding (3, 5) - are of the form (6x − 1, 6x + 1). In this task you must write a program to generate a random pairs of twin prime numbers, such that each prime in the pair is m-bits long, where m ≥ 3 is the argument to your program. Your program should have the following high level structure: 1 # Pick a random number with m bits 2 Uniformly pick a random number n in [2^{m-1},2^m-1] 3 Test primality of n using Miller -Rabins randomised algorithm. 4 If n is prime: 5 If n is a twin prime: 6 output the decimal value of n and its twin and terminate 7 else: 8 repeat from line 2 9 else: 10 repeat from line 2 To undertake this task, you will have to implement the Miller-Rabin algorithm introduced in Week 8 and to generate uniform random numbers, you are allowed to import python’s random module and make use of its functions. However, to obtain full marks you will need to implement your own repeated squaring function. If the interval [2m−1, 2m − 1] contains one, or more twin primes, your program should continue selecting numbers at random until it finds a twin prime pair. However, you may want to consider if it is possible to narrow the search space on line 2. Similarly, you should think carefully about how many witnesses to trial in each Miller-Rabin test in line 3. Before attempting this task you may want to read this (non-examinable) theoretical consideration: Slide 27 of your week 8 lecture mentions the prime number distribution function, pi(n), that gives the number of prime numbers ≤ n. Asymptotically we have, pi(n) ∼ n ln(n) , and using this approximation for pi(n), you can estimate the probability of picking a prime number n between [2m−1, 2m − 1]. Similarly, we can define pi2(n) to be the number of twin prime pairs such that each element in the pair ≤ n and it is conjectured that asymptotically, pi2(n) ∼ 2c2 n [ln(n)]2 , where c2 = 0.661618... is know as the twin prime constant. These results can be used to perform some ‘back of the envelope’ calculations to arrive at a good estimate of the 2 expected number of iterations (the amount of times you pick a random number on line 2) before finding a twin prime that terminates your program. Strictly follow the specification below to address this question: Program name: twin prime.py Argument to your program: m Command line usage of your script: twin prime.py Output file name: output twin prime.txt Output format: The output is a text file containing both elements of the twin prime pair in decimal. The smallest of the primes in the pair should be written to the first line of the file while the larger of the two should be written to the second line. Example: When, m = 5, the program will output either: 17 19 or, 29 31 Since the choices are random, the output is not deterministic. 3 NOTE: Questions 2 and 3 focus on the data compression algorithms we covered in week 9 and require you to work with bits. However, working with raw binary data in Python is rather tedious and so you may assume all binary data is represented as a string of ‘0’s and ‘1’s, i.e. we will represent 0 and 1 using the characters with ASCII codes 48 and 49 respectively. Due to this the compressed data in these questions will actually require more bits to represent than the uncompressed data, and in practice you should never do this. However, the implementation of the algorithms themselves is independent of our representation and so we will ignore this issue here. 2. (2.5 marks) It is common practice to compress data before sending it across a communi- cations channel. To ensure that the messages can be encoded and decoded correctly, the sender and receiver simply need to agree upon a language, or a protocol, that dictates the compression method. However, even if the receiver knows how the data was com- pressed, they may need more than just the compressed data itself in order to retrieve the original message. For instance, if a stream of characters was encoded using a Huffman code, the recipient of the stream would need knowledge of the original alphabet, as well as the Huffman codes the sender assigned to each character. Without this information they would only be able to guess at the original message! To enable the use of various encoding schemes it is common to first send some supplemen- tary data - often referred to as the header - that contains all the information required to decode the following data stream. The header itself has a standardised format (dictated by the transmission protocol) that allows all parties to interpret it without any additional information. Your task is to write a program to construct a header that contains all the information necessary to decode a string whose characters were encoded using Huffman codes. To construct this header, you will need to implement algorithms to construct Huffman and Elias ω codes for character and integer streams respectively. The structure of the header is detailed in the specification below. Strictly follow the specification below to address this question: Program name: header.py Argument to your program: An input text file containing a string str[0 . . . n− 1]. It is safe to assume that: • str consists of ASCII characters characters in the range [32, 127] and potentially newline characters (ASCII code 10). Command line usage of your script: header.py Output file name: output header.txt Output format: The output is a text file containing a header for the input string str. The header is a bitstring (see the note above which details how to represent the bits) made up of the following information: • The number of unique ASCII characters in str encoded using the corresponding Elias ω integer code. • For each unique character in the text: 4 – Encode the unique character using the fixed-length 7-bit ASCII code. (All input characters will have ASCII values < 128). – Then encode the length of the Huffman code assigned to that unique character using an Elias ω code. – To the above, append the variable-length Huffman codeword assigned to that unique character. Example: Let the input string be, aacaacabcaba In this case we have 3 unique characters: a, which occurs 7 times, b, which occurs twice, and c, which occurs 3 times. A feasible set of Huffman codewords (any valid set of Huffman codes will be considered correct) for {a, b, c} are {1, 00, 01} respectively. The header will contain: • the number of unique characters, 3 in this example, encoded using Elias ω code as 011 • ASCII code of each unique character followed by the Elias ω code of the length of its assigned Huffman codeword, followed by the statement of the Huffman codeword: – Statement of a with Huffman codeword ‘1’ of length 1: 1100001, followed by 1, followed by 1 – Statement of b with Huffman codeword ‘00’ of length 2: 1100010, followed by 010, followed by 00 – Statement of c with Huffman codeword of ‘01’ of length 2 : 1100011, followed by 010, followed by 01 Thus, concatenating all of the above codes, the header part is encoded as: 011110000111110001001000110001101001 5 3. (4 marks) Inspired by question 2, imagine that you are the recipient of a stream of characters that has been compressed/encoded using the Lempel-Ziv-Storer-Szymanski (LZSS) variation of the LZ77 algorithm. Your task is to write a LZSS decoder - no encoding is required - to recover the original string. For the purposes of this question you may assume that your decoder is given the output of an LZSS encoder that has been used to encode a string of ASCII characters. • The output of the encoder is a stream of bits (represented as ‘0’ and ‘1’) that loss- lessly encodes the input text file over two parts: (i) the header part, and (ii) the data part. The information encoded in each of these two parts is given below: The information encoded in the header part is identical to the header in question 2. It contains: – The number of unique ASCII characters in the input text encoded using an Elias ω integer code. – For each unique character in the text the header contains: ∗ The unique character using the fixed-length 7-bit ASCII code. (All input characters will have ASCII values < 128). ∗ Then the length of the Huffman code assigned to that unique character, encoded using an Elias ω code. ∗ Finally, append the variable-length Huffman codeword assigned to that unique character. Information encoded in the data part: – The total number of Format-0/1 fields - encoded using Elias ω encoding - required to encode the input text. – This is followed by the corresponding sequence of Format-0/1 fields. Each tuple in this sequence is encoded as follows: For Format-0: 〈0-bit, offset, length〉, where offset and length are each en- coded using the variable-length Elias ω code. For Format-1: 〈1-bit, character〉, where character is encoded using its as- signed variable-length Huffman code defined in the header. Since it would be difficult to develop test cases for this question without writing an encoder you will be provided with a set of test cases for your decoder on Moodle. A subset of these test cases will be used to verify your solution at marking time. Strictly follow the specification below to address this question: Program name: decoder lzss.py Arguments to your program: An input text file containing a string of ‘0’s and ‘1’s (without any line breaks) that represents the output of the encoder described above. Command line usage of your script: decoder lzss.py Output file name: output decoder lzss.txt 6 • Output format: The output is the decoded ASCII text. Example: Assume that we have a dictionary of length W = 6, a buffer of length L = 4, and that the string input to the encoder (you don’t have to write any code for the encoding, it is given as input to your decoder) was: aacaacabcaba From question 2 we know that a feasible set of Huffman codewords for {a, b, c} are {1, 00, 01} respectively. Applying the LZSS approach would yield the following Format-0/1 fields: 〈1, a〉, 〈1, a〉, 〈1, c〉, 〈0, 3, 4〉, 〈1, b〉, 〈0, 3, 3〉, and 〈1, a〉. Recall from question 2 that the header part will contain: • the number of unique characters, 3 in this example, encoded using Elias ω code as 011 • ASCII code of each unique character followed by the Elias ω code of the length of its assigned Huffman codeword, followed by the statement of the Huffman codeword: – Statement of a with Huffman codeword ‘1’ of length 1: 1100001, followed by 1, followed by 1 – Statement of b with Huffman codeword ‘00’ of length 2: 1100010, followed by 010, followed by 00 – Statement of c with Huffman codeword of ‘01’ of length 2 : 1100011, followed by 010, followed by 01 Thus, concatenating all of the above codes, the header part is encoded as: 011110000111110001001000110001101001 The data part will contain: • The encoding of the total number of Format-0/1 fields. In this example, it is 7, encoded using Elias ω code as 000111. • The encoded information of successive Format-0/1 fields: – 〈1, a〉 encoded as 11, – 〈1, a〉 encoded as 11, – 〈1, c〉 encoded as 101, – 〈0, 3, 4〉 encoded as 0011000100, – 〈1, b〉 encoded as 100, – 〈0, 3, 3〉 encoded as 0011011, and finally – 〈1, a〉 encoded as 11. Thus concatenating all codes in the data part, we get the encoding: 00011111111010011000100100001101111 Finally, concatenating the header and data parts gives the lossless encoding of the input string: 01111000011111000100100011000110100100011111111010011000100100001101111 7 If a file containing: 01111000011111000100100011000110100100011111111010011000100100001101111 was given as input to your decoder the output file should contain: aacaacabcaba -----------=o0o=----------- THE END & BEST OF LUCK FOR YOUR EXAM -----------=o0o=----------- 8 欢迎咨询51作业君