Question 1: KWIC
This question is worth 100% of the assignment mark.
A KWIC or Key Word In Context is a type of index for a text document that lists the indexed words along with their location and the context they are found in; that is, the text on either side of each word. This will be called the context string. We will use the multimap data structure from the previous assignment to produce the index (keys) and the locations and context strings (values).
The goal is to build a complete UNIX-style command line application that can be run from a terminal. It will generate the index from a text file, using options that are given on the command line when the application is run. It can optionally read another text file containing a stop list: a list of "stop words" that will not be placed in the index. The main program can also accept values for the maximum index size (maximum number of keys in the multimap), the maximum frequency of an index entry, and the maximum entry length (the purpose of these two values are described below).
Organization
The code will be organized into several source (.c) and header (.h) files for separate compilation. You will be given a main program that processes command-line arguments when the program is run, and passes the information to your function to produce the index. There will also be another .c file that contains a main program that you will write to execute tests on the relevant portions of the solution.
Overall you must write at least the following source and header files:
• a source and a header file for the multimap;
• a source and a header file for the KWIC processing code;
• a source file containing the given main program (a4_main.c);
• a source file that runs the Assignment 3 tests on the multimap;
• a source file that runs new Assignment 4 tests on the multimap;
• a source file that runs tests on the functions for your KWIC; and
• a source and header file that is used for the code that all the test programs have in common. Write additional source and/or header files as required.
The multimap
The Multimap we wrote in the previous assignment is missing one critical feature that will be needed to produce the index. We have to be able to iterate over all of the keys in our multimap; that is, the words in the index. Add the following functions to the Multimap interface:
// Using these functions together will allow you iterate over
the keys in the
// multimap, in the order that they are stored (that is,
sorted).
// They can be used as follows:
// char key[MAX_KEY_LENGTH];
// if (mm_get_first_key(mm, key, MAX_KEY_LENGTH) > 0) {
// // // // }
do{
// process the key
} while (mm_get_next_key(mm, key, MAX_KEY_LENGTH) > 0);
// Consider what happens if mm_remove_key() is called (possibly
more than
// once) as part of processing a key.
// Copy the first key string in the multimap into the array char
*key.
// It will copy at most length-1 chars, always followed by \0.
// Returns -1 on error, 0 if there were no more keys, or 1 on
success.
// If 0 or -1 are returned, then the contents of key are unchanged.
int mm_get_first_key(Multimap *mm, char *key, int length);
// Same as above, except it copies the next key following a call to either
// mm_get_first_key or mm_get_next_key.
int mm_get_next_key(Multimap *mm, char *key, int length);
These functions should be completely implemented, with contracts and thorough testing.
The KWIC
The KWIC will be indexed by paragraphs, where a paragraph is defined as text on one or more lines from the text file, separated by one or more blank lines. Paragraphs will be identified by a paragraph number, counting from 1 for the first paragraph. A word in a paragraph is defined as any sequence of one or more word characters, separated by one or more non-word characters. A word character is alphabetic or numeric (man isalnum) or a dash - or an apostrophe '.
For example, in the following text:
It wasn't the "best", of times,
'twas the... worst-of-times.
the words are: it, wasn't, the, best, of, times, 'twas, and worst-of-times and it is a single paragraph since there are no blank lines in between the lines.
This is the KWIC building algorithm:
1 Read an entire paragraph from the input text file at a time into a string, even if it spans multiple lines in the text file.
1 For each word in the paragraph:
Check if the word is in the stop list. If it is, do not process it further, and go on to the next word.
Otherwise, subtract the length of the word from the maximum entry length and divide it by two. This is the amount of characters allowed on either side of the word for context. Build a context string by fitting as many words as possible, from before the word, and after the word, with a single space between them, without exceeding the amount characters allowed on each side. See the examples below.
Using the word as the key, and the paragraph number and the string you built as the value, add it to the index multimap.
Post-process the multimap:
Count the total number of values for all keys, and for each key, if its frequency
(number of values / total number of values) is greater than the maximum
frequency, remove it from the multimap. (Another step may be added here) No more steps.
Print the multimap, neatly, as a KWIC index, using the example below as a model. This should be done using your mm_get_first_key and mm_get_next_key functions, not by modifying mm_print (which is meant for debugging and not neat output).
For example, given the text above, and using the default parameters for creating the KWIC, the index may look something like this:
WORD PARA CONTEXT
==== ==== =======
'twas 1 best of times 'twas the
best 1 It wasn't the best of times 'twas
It 1 It wasn't the best of
of 1 It wasn't the best of times 'twas the
the 1 It wasn't the best of times the 1 of times 'twas the worst-of- times
times 1 the best of times 'twas the wasn't 1 It wasn't the best of worst-of-times 1 'twas the worst-of-times
Your output should have equivalent lines, but the contents of each line may not be identical! How your multimap copes with upper and lower case, and how you count the letters/words in the context, may affect your output. For example: you may try to balance words on either side of the context rather than characters, or you may wish to use more of the context if the word appears early in the paragraph. You do need to use the majority of the entry_length for the context, and don't exceed it. Organize your output into neat columns using (f)printf, and label them.
Testing and contracts
As mentioned above, you should test your multimap. All the tests from A3 should be executed as one main testing program. These tests will not be re-evaluated in this assignment, only the ability to run them.
Write a second separate testing program that will focus on the two new functions that were added to the multimap interface (mm_get_first_key and mm_get_next_key). It should cover all of the same types of cases (typical, edge, invalid, etc.) as the original tests; however, to make it easier for the marker to identify the new tests, avoid unnecessarily repeating test cases.
Write a third separate testing program that will test the KWIC processing code. In order to make it possible to test this code, you will need to divide the code into functions and expose the testable functions through a header file. Here are some examples of the functions you could write and test:
• finding and/or splitting words
• determining how many words before/after will fit into the context string
• building the context string (this may be part of the previous function)
• removing words with frequency greater than some value
And others as appropriate. Functions that read from or write to files or stdout are not considered testable.
Place the test harness in a header and source file. That is, there should be one header file that is included by all the tests, and a source file that contains the code that is used by all tests, such as the verify functions.
All functions that are testable, and publicly exposed in an interface, must have appropriate contracts. This includes the new multimap functions and the KWIC processing functions. Your contracts for the rest of the multimap will not be re-evaluated.
Notes:
• You will need to change the constants from A3's Multimap as follows. These will be defined in the multimap header file.
#define MAX_KEY_LENGTH 31
#define MAX_VALUE_LENGTH 81
•
◦ ◦ ◦ •
• •
•
In addition to the constants above, and the defaults given in a4_main.c, you can assume the following maximums:
The maximum paragraph length is 20000 characters.
The maximum length of a stop list is 1000 words.
There are at most 100 entries (occurrences) for each word in the index.
The stop list file contains one word per line. Treat each line as a word; there is no error checking.
You may use your multimap from A3, or the posted model solution. You can use the corresponding set of tests.
You should only minimally modify the given a4_main.c file, to include your header file and call a function to start your program. This function must be defined in a
different .c file.
Ensure the results appear in the correct order, particularly when the same word appears multiple times in the same paragraph. This may be done by changing your multimap (which will affect its tests) or by carefully processing the text when you are building the KWIC.
• Write a Makefile that compiles your source code, including the main program that builds the KWIC and the programs that run your tests. Give them all meaningful names. A sample makefile is available for you to use and modify for your own submission.
• Submit a README file that describes each of the programs produced by your Makefile, and what they do.