CSCI 220 Homework Assignment #5

Assigned Date: Monday, March 3, 2003 
Due Date: Friday, March 21, 2003 
Due Time: Noon 

Updated: Wednesday, March 19, 2003 03:19 PM

Program file name to be submitted: TextStatistics.java

Skills Developed: Arrays, Iteration, Selection structures, Objects.

Background: The availability of computers with text manipulation capabilities has resulted in some rather interesting approaches to analyzing the writings of great authors. Much attention has been focused on whether William Shakespeare ever lived. Some scholars believe there is substantial evidence indicating that Christopher Marlowe actually penned the masterpieces attributed to Shakespeare. Researchers have used computers to find similarities in the writings of these two authors, as well as other authors.

Purpose: This assignment asks you to write a program that reads zero or more lines of text and outputs some statistics.  These statistics include:

Sample Run: (Updated Wednesday, March 19, 2003 03:19 PMThe phrase "To be, or not to be: that is the question!" should produce the following output: 

Word-Length   Frequency
-----------------------
    1            0.0
    2            0.4
    3            0.4
    4            0.1
    5            0.0
    6            0.0
    7            0.0
    8            0.0
    9            0.1
    10           0.0
    11           0.0
    12           0.0

Zipf Slope: -1.145493858013254
Zipf R^2: 0.7402646669426666


Specifications: Your code should store the frequency of word-lengths using an array. You should worry only for words containing up to 12 characters.

Notes:

  1. To read a line of input, use Chapman's readString() method.  This method returns null when there is no input to read.
     
  2. To extract word-lengths use one or more of the methods provided with Java's String class.  (In particular, see String methods charAt, length, and an example on how to use String methods.) 
     
  3. If 'JKHLKjjj233' appears in the input, it should be treated as a word of length 11. 
     
  4. (Updated Wednesday, March 19, 2003 03:19 PM) Words are delimited by one or more space characters. In other words, space characters do not contribute to the length of a word.  For simplicity purposes, count all non-space characters (including punctuation) as part of a word.  
     
  5. (Updated Wednesday, March 19, 2003 03:19 PM) To calculate the Zipfian distribution use ZipfStatistics.class.  Download it and save it in the same directory as your source file.  It contains the following methods:

    Slope(wordLengthFrequencies) -- this method returns the Zipf slope (a double value) of the word-length frequencies stored in double array wordLengthFrequencies[].

    RSquared(wordLengthFrequencies) -- this method returns the Zipf R2 value (a double) of the word-length frequencies stored in double array wordLengthFrequencies[].
     

  6. (Updated Wednesday, March 19, 2003 03:19 PM) Once you are satisfied that your program works, run your program against the files 
    Shakespeare-King-Lear.txt
    Shakespeare-Macbeth.txt
    Shakespeare-Othello.txt
    Shakespeare-Romeo-and-Juliet.txt

    Then run your program against the files 
    Mark-Twain-A-Tramp-Abroad.txt
    Mark-Twain-The-Tragedy-of-Puddnhead-Wilson.txt
    Mark-Twain-Tom-Sawyer-Abroad.txt
    Mark-Twain-Tom-Sawyer-Detective.txt

    These files will be made available via the class web page a few days before the assignment is due. 
     
  7. To have your program accept input from a file, rather than default to the keyboard, you can use redirection. For example:

    java TextStatistics < datafile.txt

    will read from a file in the current directory called datafile.txt rather than waiting for the user to type input at the keyboard.
     
  8. Create a text file called TextStatistics.txt. In this file, discuss your conclusions regarding your program's potential in authorship attribution. In other words, is it possible to tell the difference between works written by two different authors simply based on your program's output?  Discuss how this program could be improved.
     
  9. Submit the files TextStatistics.java and TextStatistics.txt, by the due date and time, on diskette. The files should be in the top-level directory.