Fall 2005»CSCI 221 Homework 4 Bonus

CSCI 221 Homework 4 Bonus

Last modified on February 19, 2007, at 09:30 AM (see updates)

Bonus

For bonus points, have ADT TextAnalyzer calculate the Zipf distribution of the words. This is done through the following operations:

  • double getZipfSlope()
    which returns the slope of the Zipf distribution of the word frequencies.
  • double getZipfRSquared()
    which returns the R2 value of the Zipf distribution of the word frequencies. (Note: The R2 value indicates how close the data points are to the trendline, overall. A value of 1 indicates that the data points coincide with the trendline, whereas a value of 0 indicates that the data points are scattered randomly.)

These methods should use ZipfStatistics.class. Download it and save it in the same directory as your source file. It contains the following methods:

  • double slope(double wordFrequencies[])
    which returns the slope of the Zipf distribution of the provided word frequencies.
  • double rSquared(double wordFrequencies[])
    which returns the R2 of the Zipf distribution of the provided word frequencies.

Data

Download a few books from Project Gutenberg and see if their word distributions follow Zipf's law.

For control try loading TextAnalyzer with random words. (Hint: Use Math.random() to generate such words.)

Submission

In your README.TXT file, include the names and URLs of the books you tried, their slopes and R2 values, and the slope(s) and R2 values of the random text(s). Also state your conclusion.

In the submitted .jar file, also include the system driver(s) you developed to explore this question.

License

ZipfStatistics.class is made available under a Creative Commons License. It was developed by Chris Wagner, Charles McCormick, and Bill Manaris.