CSCI 220
HOMEWORK ASSIGNMENT #4
Assigned Date: Tuesday, March 16, 2004
Due Date: Wednesday, March 24, 2004
Due Time: Noon

 

Updates: March 17, 2004

Source filename to be submitted:  WordStatistics.java, WordStatistics.txt

Skills Developed: Arrays, Parallel arrays, logical vs. physical array length, Selection and Iteration structures.

Documentation and submission:  See instructions in the first homework assignment.

Background:

The availability of computers with text manipulation capabilities has resulted in some rather interesting approaches to analyzing the writings of great authors. Much attention has been focused on whether William Shakespeare ever lived. Some scholars believe there is substantial evidence indicating that Christopher Marlowe actually penned the masterpieces attributed to Shakespeare. Researchers have used computers to find similarities in the writings of these two authors, as well as other authors.

Assignment: 

Your assignment is to write a program that reads in some text and prints out a table indicating the relative frequency of each unique word in the text.   The relative frequency of a word is calculated by dividing the number of occurrences of this word in the text by the total number of words in the text.

The first version of your program should output the words in the table in the same order in which they appear in the text.  For bonus points, output the words sorted in decreasing order of frequency.  Also for bonus points, output frequency in the format xx.xx% (using mathematical operations).

For example, the excerpt:

      "To be, or not to be: that is the question:
    Whether ‘tis nobler in the mind to suffer"

contains the word "to" three times, the word "be" two times, the word "or" once, etc.

Create a text file called WordStatistics.txt. In this file, discuss your conclusions regarding your program's potential in authorship attribution. In other words, is it possible to tell the difference between works written by two different authors simply based on your program's output? (one paragraph) Discuss whether or not your results are conclusive.  If not, explain how would could further investigate (without necessarily doing extra work).

how this program could be improved  (another paragraph).

Notes:

1.     Assume a maximum of 20,000 unique words.

2.     You should use Chapman’s StdIn class and its readLine() method.

3.     You are encouraged to use String and StringTokenizer (see Chapter 10) in this assignment.

4.     To have your program accept input from a file, rather than default to the keyboard, you can use redirection. For example:
 
   java TextStatistics < someFile.txt
 
will read from a file in the current directory called someFile.txt rather than waiting for the user to type input at the keyboard.

5.     Test your program with different inputs to ensure that it works properly.

6.     Once you are satisfied that your program works, run your program against the files
· Shakespeare-King-Lear.txt
· Shakespeare-Macbeth.txt
· Shakespeare-Othello.txt
· Shakespeare-Romeo-and-Juliet.txt
 
Then run your program against the files
· Mark-Twain-A-Tramp-Abroad.txt
· Mark-Twain-The-Tragedy-of-Puddnhead-Wilson.txt
· Mark-Twain-Tom-Sawyer-Abroad.txt
· Mark-Twain-Tom-Sawyer-Detective.txt

 

 

Credits

Adapted from Deitel and Deitel (1949), “C – How to Program”, 2nd ed., p. 360.