辅导案例-CS1210

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

CS1210 Computer Science I: Fundamentals
Homework 2: Handling Text
Due Friday, October 25 at 11:59PM
Introduction
In this assignment, we’re going to develop some code that relies greatly on the string datatype as well as
all sorts of iteration. First, a few general points.
(1) This is a challenging project, and you have been given two weeks to work on it. If you wait to
begin, you will almost surely fail to complete it. The best strategy for success is to work on the
project a little bit every day.
(2) The work you hand in should be only your own; you are not to work with or discuss your
work with any other student. Sharing your code or referring to code produced by others is a
violation of the student honor code and will be dealt with accordingly.
(3) Help is always available from the TAs or the instructor during their posted office hours. You
may also post general questions on the Piazza discussion board (although you should never post
your Python code).
Background
In this assignment we will be processing text. With this handout, you will find a file containing the short
story entitled The Cat by Mary E. Wilkins Freeman, written in 1901, to test your code. At some point
during the course of this assignment, I will provide you additional texts for you to test your code on;
updated versions of this handout may also be distributed as needed. You should think of this project as
building tools to read in, manipulate, and analyze these texts.
The rest of these instructions outline the functions that you should implement, describing their
input/output behaviors. As usual, you should start by completing the hawkid() function so that we may
properly credit you for your work. Test hawkid() to ensure it in fact returns your own hawkid as the
only element in a single element tuple. As you work on each function, test your work on the document
provided to make sure your code functions as expected. Feel free to upload versions of your code as you
go; we only grade the last version uploaded, so this practice allows you to "lock in" working partial
solutions prior to the deadline. Finally, some general guidance.
(1) You will be graded on both the correctness and the quality of your code, including the quality of
your comments!
(2) As usual, respect the function signatures provided.
(3) The template file has been pared down; you should take responsibility for providing the
appropriate level of documentation.
(4) Be careful with iteration; choose the most appropriate form of iteration (comprehension,
while, or for) as the function mandates. Poorly selected iterative forms may be graded down, even
if they work!
Finally, you may feel free to add new functions that you feel are necessary to complete the functionality
described herein. Comment appropriately!
def getText(file):
This function should open the file named file, and return the contents of the file formatted as a single
string. During processing, you should (i) remove any blank lines, (ii) remove any lines consisting entirely
of CAPITALIZED WORDS1, and (iii) replace any explicit ’\n’ (newline) characters with spaces unless
1
directly preceeded by a ’-’ (hyphen), in which case you should simply remove both the hyphen and the
newline, restoring the original word.
def flushMarks(text):
This function should take as input a string such as what might be returned by getText() and return a new
string with the following modifications to the input:
Remove possessives, i.e., "’s" at the end of a word;
Remove ’)’, ’(’, ’,’, ’:’, ’-’, ’_’, ’...’, ’"’ and "’"; and
Replace ’!’, ’?’ and ’;’ with ’.’
A condition of this function is that it should be easy to change or extend the substitutions made. In other
words, a function that steps through each of these substitutions in an open-coded fashion will not get full
credit; write your function so that the substitutions can be modified or extended without having to
significantly alter the code. Here’s a hint: if your code for this function is more than a few lines long,
you’re probably not doing it right.
def extractWords(text, i=0, k=None):
This function should take as input a string such as might be returned by flushMarks() and return an
ordered list of words extracted from the input string. The words returned should all be lowercase, and
should contain only characters, no punctuation.
You can think of the i and k arguments having similar — but not identical — function as the arguments to
range(). If left unspecified, the default behavior is to return all the words in the input text. Otherwise,
return a list of words starting with the ith word through, but not including, the (i + k)th word of the input
(if k is None, the default, then return all of the words in the input starting with the ith word through the
end).
def extractSentences(text, i=0, k=None):
This function returns a list of sentences, where each sentence is defined as a string terminated by a ’.’
although the defining ’.’ itself is removed in the course of processing. The significance of i and k are the
same as for extractWords(), that is, they restrict the text to a passage between the ith up to but not
including the (i + k)th words. Note that, depending on i and k, the first and/or last sentences returned may
actually be sentence fragments.
def countSyllables(word):
This function, which is provided (i.e., you don’t need to write it) takes as input a string representing a
word (such as one of the words in the output from extractWords(), and returns an integer representing the
number of syllables in that word. One problem is that the definition of syllable is unclear. As it turns out,
syllables are amazingly difficult to define in English (this may well be the topic of a future assignment).
The code provided here defines a syllable as follows. First, we strip any trailing ’s’ or ’e’ from the word
(the final ’e’ in English is often, but not always, silent). Next, we scan the word from beginning to end,
counting each transition between a consonant and a vowel, where vowels are defined as the letters ’a’, ’e’,
1 To understand why we remove lines consisting entirely of CAPITALIZED WORDS, inspect the wind . txt
sample file provided. Notice that the frontspiece (title, index and so on) consists of ALL CAPS, and each CHAPTER
TITLE also appears on a line in ALL CAPS. Removing these lines leaves just the text of the story.
2
’i’, ’o’ and ’u’. So, for example, if the word is "creeps," we strip the trailing ’s’ to get "creep" and count
one leading vowel (the ’e’ following the ’r’), or a single syllable. Thus:
>>> c oun t Sy l l a b l e s ( ’ c r e e p s ’ )
1
>>> c oun t Sy l l a b l e s ( ’ d evo t i on ’ )
3
>>> c oun t Sy l l a b l e s ( ’ c r y ’ )
1
The last example hints at the special status of the letter ’y’, which is considered a vowel when it follows a
non-vowel, but considered a non-vowel when it follows a vowel. So, for example:
>>> c oun t Sy l l a b l e s ( ’ c oyo t e ’ )
2
Here, the ’y is a non-vowel so the two ’o’s correspond to 2 transitions, or 2 syllables (don’t forget we
stripped the trailing ’e’). And while that’s not really right (’coyote’ has 3 syllables, because the final ’e’ is
not silent here), it does properly recognize that the ’y’ is acting as a consonant.
You will find this definition of syllable works pretty well for simple words, but fails for more complex
words; English is a complex language with many orthographic bloodlines, so it may be unreasonable to
expect a simple definition of syllable! Consider, for example:
>>> c oun t Sy l l a b l e s ( ’ c on s ume s ’ )
3
>>> c oun t Sy l l a b l e s ( ’ s p l a s h e s ’ )
2
Here, it is tempting to treat the trailing -es as something else to strip, but that would cause ’splashes’ to
have only a single syllable. Clearly, our solution fails under some conditions; but I would argue it is close
enough for our intended use.
Readability Formulae
Next, we turn our attention to computing a variety of readability indexes. Readability indexes hav e been
used since the early 1900’s to determine if the language used in a book or manual is too hard for a
particular audience. At that time, of course, most of the population didn’t hav e a high school degree, so
employers and the military were concerned that their instructions or manuals might be too difficult to
read. Note that the versions of these formulae used in this assignment may deviate slightly from what you
may find on the web.
def lix(text, i=0, k=None):
The Lasbarhetsindex Swedish Readability Formula, or LIX, like all the indexes here, is based on a sample
of the text. By default, we’ll compute the LIX test over the whole input, but if you want to run it only over
a subset of the text, use i and k to restrict which section of the text (in words) is considered.
The LIX formula is:
lix =
wrd
snt
+ (100 × lng)
wrd
Where wrd is the number of words in the sample, snt is the number of sentences in the sample, and lng is
the number of words in the sample that exceed 6 characters. You should have no problem computing the
LIX formula for a text consisting of only complete sentences; any sentence fragments at the beginning
3
and/or the end of the text sample should be counted as one additional sentence (particularly important if
we use i and k to restrict the range for the LIX).
def fog(text, i=0, k=None, csyl = countSyllables):
Gunning’s Fog Index, or FOG, is defined as:
fog = 0. 4(asl + phw)
Where asl is the average sentence length (in words) in the sample, and phw is the percentage of words in
the sample that are 3 or more syllables long.
Thanks to the complicated history of the English language, counting syllables is extremely complicated.
For this assignment, I’m providing you with a syllable counting function countSyllables(). In future
assignments, I might ask you to create a new function that uses a different definition of what a syllable is;
for this reason, your version of fog() should take (as an optional argument) the particular syllable
counting function you wish to use.
def srs(text, i=0, k=None, csyl = countSyllables):
The Smog Readability Score, like Gunning’s Fog Index, also relies on the notion of "hard" words, but
combines it with a sentence count:
srs = 1. 043 ×√ 30 × hrdsnt + 3. 1291
where hrd is the number of hard words in the sample and snt is the number of sentences in the sample. As
with LIX, some care is required when handling sentence fragments in the sample.
Testing Your Code
I hav e provided a function, evalText(), that you can use to manage the process of evaluating a piece text.
d e f eva l Te x t ( fi l e= ’wi nd . t x t ’ , i =0 , k=No n e , c s y l =c oun t Sy l l a b l e s ) :
t ex t = flu s hMa r k s ( g e t Te x t ( fi l e ) )
p r i n t ( "Eva l u a t i ng { } : " . f o rma t ( fi l e . upp e r ( ) ) )
p r i n t ( " { : 5 . 2 f } L i x Re a d a b i l i t y Fo rmu l a " . f o rma t ( l i x ( t ex t , i , k ) ) )
p r i n t ( " { : 5 . 2 f } Gunn i ng ’ s Fog I nd ex " . f o rma t ( f og ( t ex t , i , k , c s y l ) ) )
p r i n t ( " { : 5 . 2 f } Smo g Re a d a b i l i t y Sc o r e " . f o rma t ( s r s ( t ex t , i , j , c s y l ) ) )
Feel free to comment out readability indexes you haven’t yet tried to use.
4