Bill Manaris : Fall 2007 / CSCI 220 Homework 5

Assigned Date: Wednesday, Nov. 14, 2007
Due Dates: Wednesday, Nov. 28, 2007
Due Time: 11:55pm

Last modified on November 19, 2007, at 12:21 PM (see updates)

Learning Objectives


See Introduction to Zipf's Law (also see Homework #2)


Write program which reads in an e-book (filename provided by the user) and a level of abduction (a positive number). It calculates the histogram of words (i.e., for each word it counts the number of times it appears in the book). Then, it removes the n most frequent words from the book, where n is the abduction level.

The output filename should be the same as the input one, but with the substring ".skim-n" inserted prior to the last ".", where n is the abduction level.

Test your program with e-books from Project Gutenberg. (For now, use only the us-ascii encoding.)

Run the program on several book with levels 1, 2, 3, 4, etc. How far can you go and still get the gist of what this book is about?

Write a short report describing which books you used and what you observed. Save it in a file called report.txt. It should be a text file (not a Word document).

Your report should have your name, class, assignment, and date at the top.

Top-Down Design

Your program should be subdivided in a top-down design fashion. It should have the following functions:

Your functions should be thoroughly documented, as per the previous assignment. Your variable names should be meaningful.


To handle punctuation see example code in chapter 11.


For bonus points, try to preserve the original punctuation in the output file. To do so:


Submit and report.txt via WebCT.


The following policies are in effect for this assignment:


All identifiers should be meaningful.

Include your design (pseudocode) as comments in your program.

The following comments should appear in your program as the first lines in the file. Items in angle brackets are either to be removed or replaced with what is specified within the brackets:

# Name: <your name goes here first and last minimum>
# <ProgramName>.py
# Problem: <Brief, one or two sentence description of the
#           problem that this program solves, in your own
#           words.>
# Certification of Authenticity: 
#   <include one of the following>
#   I certify that this lab is entirely my own work.
#   I certify that this lab is my own work, but I
#   discussed it with: <Name(s)>


  1. Kenneth J. Hsu, Andrew Hsu, "Self-Similarity of the '1/f Noise' Called Music", Proceedings of the National Academy of Sciences of the United States of America, Vol. 88, No. 8 (Apr. 15, 1991), pp. 3507-3509.
  2. Michael Frame and Benoit B. Mandelbrot, "A Panorama of Fractals and Their Uses", Mathematics Department, Yale University.
(Printable View of