Assigned Date: Wednesday, Nov. 14, 2007
Due Dates: Wednesday, Nov. 28, 2007
Due Time: 11:55pm
Last modified on November 19, 2007, at 12:21 PM (see updates)
See Introduction to Zipf's Law (also see Homework #2)
zipfBookSkimmer.py which reads in an e-book (filename provided by the user) and a level of abduction (a positive number). It calculates the histogram of words (i.e., for each word it counts the number of times it appears in the book). Then, it removes the n most frequent words from the book, where n is the abduction level.
The output filename should be the same as the input one, but with the substring ".skim-n" inserted prior to the last ".", where n is the abduction level.
Test your program with e-books from Project Gutenberg. (For now, use only the us-ascii encoding.)
Run the program on several book with levels 1, 2, 3, 4, etc. How far can you go and still get the gist of what this book is about?
Write a short report describing which books you used and what you observed. Save it in a file called
report.txt. It should be a text file (not a Word document).
Your report should have your name, class, assignment, and date at the top.
Your program should be subdivided in a top-down design fashion. It should have the following functions:
readBook(filename) -- returns the book as a list of words with all punctuation removed.
getHistogram(words) -- returns a histogram of words (dictionary of words and their frequencies).
abduct(words, histogram, abductionLevel) -- returns a list of words. This is the original list of words, but with
abductionLevel most frequent words removed.
- For example, if
abductionLevel is 1 and
words is ['Perfection', 'is', 'reached', 'not', 'when', 'there', 'is', 'no', 'longer', 'anything', 'to', 'add,', 'but', 'when', 'there', 'is', 'no', 'longer', 'anything', 'to', 'take', 'away'], the returned list of words should be ['Perfection', 'reached', 'not', 'when', 'there', 'no', 'longer', 'anything', 'to', 'add,', 'but', 'when', 'there', 'no', 'longer', 'anything', 'to', 'take', 'away'], i.e., the word 'is' was removed.
- If two or more words have the same frequency, as in ['to', 'be', 'or', 'not', 'to', 'be'], pick one of them to remove. So either output is valid (your choice):
- ['be', 'or', 'not', 'be']
- ['to', 'or', 'not', 'to']
outputBook(abductedWords, filename, abductionLevel) -- outputs the abducted book into a properly named file. It calls the following function:
buildFilename(filename, abductionLevel) -- returns the output filename constructed as per specifications above.
Your functions should be thoroughly documented, as per the previous assignment. Your variable names should be meaningful.
To handle punctuation see example code in chapter 11.
For bonus points, try to preserve the original punctuation in the output file. To do so:
- Keep punctuation in the list of words returned by
getHistogram() treat, for example, "this" and "this!" as the same word.
abduct() remove the words but not the punctuation.
- For example, if "this" is to be removed, and the list of words contains, for example ["this", "this!", "This.", "anthropopathism"],
abduct() should return ["!", ".", "anthropopathism"].
- In other words, completely remove a word (remove list item) if it has no punctuation; otherwise remove the word substring but leave the punctuation (do NOT remove the list item - just modify it). (Note that the word "anthropopathism" contains the string "this", but it is not removed.)
abduct() should call functions:
isEqual(frequentWord, word, punctuation) -- returns True
or False@@ if the two words are equal as defined above. For example, "this" and "This." are equal, whereas "this" and "anthropopathism" are not.
removeWord(word, punctuation) -- returns either an empty string or a string with the remaining punctuation. For example, if
word is "this", it returns ""; if
word is "This." it returns ".".
report.txt via WebCT.
The following policies are in effect for this assignment:
- Programming assignment grades will be based on design and style as well as correctness of result.
- You may discuss the problem and how to solve it with others, but you may not look at, copy, or use any code (or pseudocode) that was written by anyone other than yourself. If I have evidence that you have shared program code or used code found anywhere, your grade will be zero.
- If you do discuss the problem and how to solve it with others, you must document that in the program code.
- Not following these rules is in violation of the Student Honor Code and instances of such violations will be reported to the Honor Board.
All identifiers should be meaningful.
Include your design (pseudocode) as comments in your program.
The following comments should appear in your program as the first lines in the file. Items in angle brackets are either to be removed or replaced with what is specified within the brackets:
# Name: <your name goes here – first and last minimum>
# Problem: <Brief, one or two sentence description of the
# problem that this program solves, in your own
# Certification of Authenticity:
# <include one of the following>
# I certify that this lab is entirely my own work.
# I certify that this lab is my own work, but I
# discussed it with: <Name(s)>
- Kenneth J. Hsu, Andrew Hsu, "Self-Similarity of the '1/f Noise' Called Music", Proceedings of the National Academy of Sciences of the United States of America, Vol. 88, No. 8 (Apr. 15, 1991), pp. 3507-3509.
- Michael Frame and Benoit B. Mandelbrot, "A Panorama of Fractals and Their Uses", Mathematics Department, Yale University.