Assigned Date: Wednesday, Nov. 14, 2007
Due Dates: Wednesday, Nov. 28, 2007
Due Time: 11:55pm
Last modified on November 19, 2007, at 12:21 PM (see updates)
zipfBookSkimmer.py which reads in an e-book (filename provided by the user) and a level of abduction (a positive number). It calculates the histogram of words (i.e., for each word it counts the number of times it appears in the book). Then, it removes the n most frequent words from the book, where n is the abduction level.
The output filename should be the same as the input one, but with the substring ".skim-n" inserted prior to the last ".", where n is the abduction level.
Test your program with e-books from Project Gutenberg. (For now, use only the us-ascii encoding.)
Run the program on several book with levels 1, 2, 3, 4, etc. How far can you go and still get the gist of what this book is about?
Write a short report describing which books you used and what you observed. Save it in a file called
report.txt. It should be a text file (not a Word document).
Your report should have your name, class, assignment, and date at the top.
Your program should be subdivided in a top-down design fashion. It should have the following functions:
readBook(filename)-- returns the book as a list of words with all punctuation removed.
getHistogram(words)-- returns a histogram of words (dictionary of words and their frequencies).
abduct(words, histogram, abductionLevel)-- returns a list of words. This is the original list of words, but with
abductionLevelmost frequent words removed.
abductionLevelis 1 and
wordsis ['Perfection', 'is', 'reached', 'not', 'when', 'there', 'is', 'no', 'longer', 'anything', 'to', 'add,', 'but', 'when', 'there', 'is', 'no', 'longer', 'anything', 'to', 'take', 'away'], the returned list of words should be ['Perfection', 'reached', 'not', 'when', 'there', 'no', 'longer', 'anything', 'to', 'add,', 'but', 'when', 'there', 'no', 'longer', 'anything', 'to', 'take', 'away'], i.e., the word 'is' was removed.
outputBook(abductedWords, filename, abductionLevel)-- outputs the abducted book into a properly named file. It calls the following function:
buildFilename(filename, abductionLevel)-- returns the output filename constructed as per specifications above.
Your functions should be thoroughly documented, as per the previous assignment. Your variable names should be meaningful.
To handle punctuation see example code in chapter 11.
For bonus points, try to preserve the original punctuation in the output file. To do so:
getHistogram()treat, for example, "this" and "this!" as the same word.
abduct()remove the words but not the punctuation.
abduct()should return ["!", ".", "anthropopathism"].
abduct()should call functions:
isEqual(frequentWord, word, punctuation) -- returnsTrue
orFalse@@ if the two words are equal as defined above. For example, "this" and "This." are equal, whereas "this" and "anthropopathism" are not.
removeWord(word, punctuation)-- returns either an empty string or a string with the remaining punctuation. For example, if
wordis "this", it returns ""; if
wordis "This." it returns ".".
report.txt via WebCT.
The following policies are in effect for this assignment:
All identifiers should be meaningful.
Include your design (pseudocode) as comments in your program.
The following comments should appear in your program as the first lines in the file. Items in angle brackets are either to be removed or replaced with what is specified within the brackets: