辅导案例-PA 8

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

PA 8: Huffman Encoding and Compression
Due Date: November 26th (Tuesday, 11:59 PM)
Checkpoint Submission: November 24th (Sunday, 11:59 PM)
PA Quiz: November 23rd (Saturday, 11:59 PM)
100 points
Overview
In this assignment you will implement a classic lossless compression algorithm Huffman
Encoding. It’s an optimal lossless compression strategy discovered by David A.
Huffman and his students. After finishing this assignment, you can compress and
decompress any file using the algorithm that you implemented.

This assignment is an individual assignment. You may ask Professors/TAs/Tutors
for some guidance and help, but you can’t copy code. You may discuss the assignment
conceptually with your classmates, including bugs that you ran into and how you fixed
them. However, do not look at or copy code. This constitutes an Academic Integrity
Violation.

START EARLY!

Style Requirements
You will be graded on your code’s style. Style Guide

IntelliJ has a great plugin to check your style automatically. Use it!
There are style considerations beyond indentation and such.
● Choose names for test methods and for variables that are descriptive and useful
when you see them in test output.
● Consider writing helper methods if you repeatedly use the same set of
statements.
● Use Javadoc comments for descriptions of your methods and classes and inline
comments if the logic of your program is not captured by your variable names
and the natural flow of code.

Part 0 - Get Starter Code
Our starter code will be distributed from
https://github.com/ucsd-ets/dsc30fa19-startercode.

Please follow the same procedures as before to get your starter code and set up intelliJ.
Part 1 - PA Quiz
Fill out this quiz about the PA to make sure you have understood the assignment
thoroughly. You have 5 attempts to complete it.

PA Quiz on Canvas

Part 2 - Huffman Coding Tree

In this part, you will need to implement methods in HCTree.java and the compareTo
method in the inner class HCNode to help you build and use a Huffman Coding Tree.

We have provided most of the inner class HCNode for you. Please read through the
Javadoc comments in starter code to understand the purpose of each instance variable
in HCNode. After that implement the following methods:

public void buildTree(int[] freq)

This method takes in an int array of size 256, which is the size of extended ASCII.
(Since each byte is 8 bits, then the number of all the possible bytes is simply 2^8 =
256). The given array was used to keep track of the frequency of each byte from
the input file. For example, if ‘a’ appears in the input file 7 times, then the element at
index 97 (which is the ASCII value of ‘a’) of freq would be 7. Note that to get the
symbol ‘a’ of type byte back from its ASCII value 97, you can simply cast the int to
byte by doing : byte a = (byte)97

1. In this method, you first need to create all leaf nodes of your HCTree with
frequency values corresponding to each symbol in the freq array. Note that
HCTree contains an instance variable called leaves, which is an array to
reference each of the leaf nodes in your HCTree (It will be useful later in your
encode method). You need to think about where to store each leaf node in the
leaf array so that when given a symbol, you can immediately find the reference
to the leaf node containing that particular symbol.

2. After creating these HCNodes, then you need to add them to a Java built-in
priority queue to help you build the tree later. (The documentation of priority
queue can be found here). Remember that this priority queue should give
higher priority to the HCNode with less frequency, and if the frequency are
the same, the HCNode with symbol of smaller ASCII value should have
higher priority. Now you need to go back to the inner class HCNode and
implement the compareTo method to satisfy the requirement above, as Java’s
built-in priority queue will use your compareTo method defined in HCNode to
determine the priority of different node.

Hint:

First, recall that the default behavior of compareTo method is: return positive
number if “this” is considered “larger” than the given object.

Then read the document about priority queue and find the information about
the default behavior of priority queue: is it a min heap or max heap by default?
Then think about the following questions:

a. What instance variables define the priority of HCNode?
b. Do we want a min heap or max heap behavior from priority queue?
c. If you want the default behavior of priority queue, then should you keep
compareTo method to have its default behavior?
d. If you want the opposite of the default behavior of priority queue, then
should you keep compareTo method to have its default behavior?

3. Your next step is to build the HCTree using HCNode currently stored in the
priority queue. The algorithm has been discussed in class:

Iteratively, remove two nodes from the priority queue. Then use the first node to
be the ‘0’ child and the second node to be the ‘1’ child of a newly created
parent node. The frequency of this parent node is the sum of the frequencies of
its two children nodes. Make sure that the parent node takes a symbol from
one of its children (choose any one you want, but be consistent). Then, add
this parent node back to the priority queue, so that it later becomes a “branch”
of your final tree.

You will need to think about when the HCTree is completely built: we are
essentially building the tree from the bottom up (from leaves to branches and
finally to the root).

The main reason why Huffman coding can compress data comes from this simple
method. From a high level perspective, since we are using a priority queue here to
help us build the tree bottom up, the HCNode with the most frequent symbol will hold
the shortest path from the root, while the HCNode with the least frequent symbol will
hold the longest path.

As a result, the most frequent symbol will correspond to the shortest sequence of
encoded bits. This is where the compression actually happens: We used much less
encoding bits to represent the most frequent symbol. Although the least frequent
symbol may have the longest number of encoded bits, but because this symbol
doesn’t appear so frequently, the overall encoding will be shorter in the end.
public void encode(byte symbol, BitOutputStream out) throws
IOException

For a given symbol, use the HCTree built before to find its encoding bits and write
those bits to the given BitOutputStream.

In this method, to avoid inefficiently searching the whole tree to find the encoding bits
of the given symbol, you must take advantage of the leaf array that we created before:
it's much faster to start from the leaf node containing the given symbol, and then
traverse back to the root. You will then simply collect all the bits on that path and
reverse those bits at the end to get the encoding.

IMPORTANT: The following code will convert a byte c to its corresponding ascii value:
int ascii = c & 0xff;

Once you get the encoding bits, you should use out.writeBit(int i) that takes
in either 0 or 1 as a bit to write the encoding bits one by one to BitOutputStream.

public byte decode(BitInputStream in) throws IOException

Decodes the bits from BitInputStream and returns a byte that represents the
symbol that is encoded by a sequence of bits from BitInputStream.

Testing HCTree:

Huffman Coding Tree can be hard to debug since most of the I/O is done by using bits,
but bits are hard to be visualized directly. To save you time from debugging later, we
enforce you to test the buildTree() to make sure the structure of your HCTree is
correct before you move on to the next part.

Now create a JUnit test file HCTreeTester.java to verify the correctness of your
buildTree() method. (You are not required to test methods other than
buildTree())

First, you should write an in-order traversal method to recursively traverse your HCTree,
and in-orderly print all the nodes in your HCTree out. (Note that toString() method for
HCNode was defined for you, so you can directly use it to print the node out)

public void inorder(HCNode root)

Then in the JUnit test method, create a freq array of size 256, and modify the array so
that it contains the correct information about the provided check1.txt file:

a: 17 b: 8 c: 7 d: 14 e: 9 f: 1 \n (newline char): 1

(You might refer to the ASCII table http://www.asciitable.com/ for the ascii value of each
symbol above.)

Now from your own knowledge of Huffman coding and the description from the writeup
above, draw the resulting Huffman Coding Tree in a file HCT.pdf. Make sure you
clearly indicate the symbol and freq of each HCNode, as well as what each edge
stands for in your HCTree drawing. You will need to submit this as part of the
assignment.

Then you should pass this array to buildTree() and use your inorder method to print
out your HCTree and see if it matches your drawing. (You don’t need to use
assertEqual() here, and you may just print the HCTree out)

Part 3 - MyCompressor

Once you are done with building the HCTree, now we implement our compressor. For
this basic implementation, we will build the HCTree based on the character counts of a
large corpus. Use the corpus file, corpus.txt, to build your HCTree.

public HCTree buildHuffmanTree(String corpusLocation) throws
IOException

This method will create the HCTree using the frequencies of the characters in the
corpus. In this method, you will populate the frequency array freq by counting the
occurrence of each character in the file corpusLocation. The size of the array
should be 256, which is the size of extended ASCII. (Since each byte is 8 bits, then
the number of all the possible bytes is simply 2^8 = 256). While creating the frequency
make sure to keep track of the indices of each character. For example, the element
at index 97 (which is the ASCII value of ‘a’) should hold the frequency of ‘a’ in the file.
Note: See provided tools below.

In short, his method takes in the corpus file (corpusLocation) and builds the
HCTree and stores it in the instance variable huffmanTree.
public void compress(String inputFile, String compressedFile)
throws IOException

This method takes in two file names. inputFile is the name of the file that you want
to compress and a compressedFile, the name of the new compressed file that your
compression algorithm should generate.
public void decompress(String compressedFile, String
outputFile) throws IOException

Similarly, this method takes in two file names. compressedFile is the name of the
compressed file that you want to decompress and outputFile, the new
decompressed file that your algorithm should generate after decompressing.

Part 4 - Compress and Decompress

Overview

Now, we build our compression program.

The main method of MyCompressor.java takes in args in the following format:
First argument: corpus file
Second argument: signal command. Either “compress” or “decompress”.
Remaining arguments are the names of the file to compress or decompress.

Case “compress”: Compressed file should be names compressed_[original].txt

Case “decompress”: Decompressed file should be named decompressed_[original].txt

Eg.
1. src/corpus.txt compress src/f1.txt src/f2.txt
2. src/corpus.txt decompress src/compressed_f1.txt
src/compressed_f2.txt

In the first case the output should be two files: compressed_f1.txt
compressed_f2.txt
In the second case, the output should be two files:
decompressed_compressed_f1.txt decompressed_compressed_f2.txt

Provided Tools

Compression:

First you will need to scan through the corpus file to compute the frequencies of all 256
bytes. The following line of code will read through the given text file byte by byte, and
put all those bytes into a byte array.

byte[] input = Files.readAllBytes( Paths.get(/*file path*/) );

IMPORTANT: The following code will convert a byte c to its corresponding ascii value:
int ascii = c & 0xff;

Notice that these output stream initializations are given to you:

FileOutputStream file = new FileOutputStream(/*file path*/);
DataOutputStream out = new DataOutputStream(file);
BitOutputStream bitOut = new BitOutputStream(out);

We won’t go into details about how these output streams work, but you should know
how to write to the given file using these output stream.

For example, let’s say your third command line argument is a text file named
“compressed.txt”. Then to write an int 25 to “compressed.txt”, you will simply use
DataOutputStream by calling out.writeInt(25). (You should read the document
DataOutputStream for more information). And similarly, to write a single bit 0 to
“compressed.txt”, you will simply use BitOutputStream by calling
bitOut.writeBit(0);

Decompression:

Notice that these input stream initializations are given to you:

FileInputStream inFile = new FileInputStream(/*file path*/);
DataInputStream in = new DataInputStream(inFile);
BitInputStream bitIn = new BitInputStream(in);

You should also know how to read from the given file using these input stream. For
example, let’s say your third command line argument is a text file named
“compressed.txt”. Then to read an int from “compressed.txt”, you could simply use
DataInputStream by calling int byteCount = in.readInt() (You should read
the document DataInputStream for more information). And similarly, to read a single bit
from “compressed.txt”, you need to use BitInputStream by calling int bit =
bitOut.readBit()

FileOutputStream outFile = new FileOutputStream(/*file path*/);
DataOutputStream out = new DataOutputStream(outFile);

Also, to write to the output file byte by byte, you can simply call out.writeByte(byte
b).

Part 5 - Results
Overview

You are given 10 files (f1.txt - f10.txt), compress each file using your compression
algorithm and report the result in the form of the following table in a file
‘CompressionReport.txt’ or ‘CompressionReport.pdf’

File Original Size [KB] (OS) Compressed Size [KB] (CS) CS/OS
File1
...

Also, explain in a few lines the variation in the CS/OS values.

Critical Thinking: Answer the following questions in 2-3 lines
Now, this method of compression using the Huffman Coding Tree might not be optimal.
1. What would happen if you used the file that you want to compress as the corpus?
Would you get better results? Why?
2. Assuming you only have a compressed version, can you directly decompress the file
that was compressed using HCTree built using itself as the corpus? [Yes/No]
3. If yes, explain how? If no, explain how would you modify your code to enable the
decompression?

Part 6 - Test Compress and Decompress

Once you have done implementing Compress and Decompress, you need to perform
testing on your program.

We have provided 3 basic check files for you to test your Huffman Coding (check1.txt
and so on). You should use command line arguments to compress a check file to a
compressed file, and decompress that compressed file back to see if it matches the
original file.

Next, you should test your program on the larger 10 input files (f1.txt - f10.txt). You
should use command line arguments to compress a check file to a compressed file, and
decompress that compressed file back to see if it matches the original file.

Some Edge cases to fix:

1. Empty file

Compressing an empty file should generate an empty file. After decompressing, your
program should generate an empty file as well.

2. File with only single kind of char

We’ve provided singleChar.txt to test this case (A file with only newline char at each
line).

Submission

Due - November 26th (Tuesday, 11:59 PM)
Files to submit:

1. HCTree.java
2. HCTreeTester.java
3. MyCompressor.java
4. HCT.pdf
5. CompressionReport.txt

Checkpoint - November 24th (Sunday, 11:59 PM)
1. HCTree.java
2. HCTreeTester.java
3. HCT.pdf

After you submit your files to GradeScope, wait for the output of submission
script. Make sure you see this after your submission:

THIS IS A SUCCESSFUL SUBMISSION. YOUR SCORE WILL BE UPDATED ONCE
GRADED.

Otherwise, please fix the error and resubmit your files.

New to PA 8:

For extra assurance on the code that you submit, we’ve included a basic script for you
to test the input files that came with your starter code. Under the output of a successful
submission, you should see the test output of input1.txt, input2.txt, and input3.txt.
Passing these submission tests is not a requirement for a successful submission; these
tests are there to help you ensure that you get identical files from compression followed
by uncompression.

The above is an example of a successful submission that passes the basic script.

The above is an example of a successful submission that does not pass the basic script.