CSCI 221
HOMEWORK ASSIGNMENT #3
Assigned Date: Monday, March 21, 2005 (sec 2 +1 day)

Due Dates:

TextAnalyzer.java and TextAnalyzer.html:  noon, Thursday, March 31, 2005 (sec 2 +1 day)
Everything else :  midnight, Monday, April 4, 2005 (sec 2 +1 day)

 

[Purpose]    [Background]    [Assignment]    [Documentation]    [Submission]    [Notes]    [References]

 

Updated: Friday, April 01, 2005
Tuesday, March 29, 2005

 

Purpose:

This assignment focuses on the logical, application, and implementation levels of ADTs.  Also, it will expose you to quantitative linguistics and Zipf’s law.

 

Background:

George Kingsley Zipf (1902-1950) was a linguistics professor at Harvard, who died just after publishing his seminal book, Human Behavior and the Principle of Least Effort [1].  In this book, Zipf collected results from various fields that demonstrated an intriguing relationship (or regularity) found in natural phenomena. 

Zipf’s main contribution was that (a) he was the first to hypothesize that there is a universal principle at play, and (b) he proposed a mathematical formula to describe it.  Although his attempts to derive a comprehensive theory were incomplete (and some say misguided), his mathematical formula was pretty accurate.  According to Yale Alumni Magazine, Zipf’s work had considerable influence on a young graduate student named Benoit Mandelbrot, who went on to develop the field of Fractal Geometry [2, 4]. 

Zipf’s law models the scaling (fractal) properties of many phenomena in human ecology, including natural language and music [1, 2, 3].  Zipf’s law is one of many related laws that describe scaling properties of phenomena studied in the physical, biological, and behavioral sciences.  These include Pareto’s law, Lotka’s law, power laws, Benford law, Bradford’s law, Heaps’ law, etc. [4, 5]. 

Informally, Zipf’s law describes phenomena where certain types of events are quite frequent, whereas other types events are rare.  For example, in English, short words (e.g., “a”, “the”) are very frequent, whereas long words (e.g., “anthropomorphologically”) are quite rare. 

Surprisingly, if we compare a word’s frequency of occurrence with its statistical rank, we notice an inverse relationship: successive word counts are roughly proportional to 1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, and so on [4].  This is captured by the formula:

                                    P(f) ~ 1/f n                                                          

where P(f) denotes the probability of a word (or event) of rank f and n is close to 1. 

In physics, Zipf’s law is a special case of a power law.  When n is 1 (Zipf’s ideal), the phenomenon is called 1/f or pink noise.  When n is 0 it is called white noise.  When n is 2 it is called brown noise.  Zipf (1/f, pink noise) distributions have been discovered in a wide range of human and naturally occurring phenomena, including city sizes, incomes, subroutine calls, earthquake magnitudes, thickness of sediment depositions, clouds, trees, extinctions of species, traffic jams, and visits to websites [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. 

The type of structural regularity captured by Zipf’s law can be visualized by plotting such distributions, e.g., visits to websites:

Fig. 1.  Number of unique website hits (y-axis) ordered by website’s statistical rank (x-axis)
on log-log scale [7].  (Also see [8].)

This graph results in a near straight line with slope of –1.

In general, the slope may range from 0 to –¥, with –1.0 denoting Zipf’s ideal.  A slope near 0 indicates a random probability of occurrence (e.g., having y-axis values generated by Math.random()).  A slope tending towards –¥ indicates a monotonous phenomenon (i.e., one event predominates).  It has been suggested that a slope near –1.0, corresponds to a balance that feels natural and even aesthetically pleasing to humans, for certain phenomena, such as music, urban structures, and landscapes [3, 6, 9].

Mandelbrot generalized Zipf’s law to account for all types of scaling phenomena in nature, as follows:

                                                P(f) ~1/bf n                                                        

where b is an arbitrary real constant.

Zipf was independently wealthy; it is believed that he published his last book with his own money.  Since electronic computers were unavailable at the time, he collected data by hiring human "computers" to count words in newspapers, books, and periodicals for numerous days at a time [11].

 

Assignment: 

Data collection for Zipf distributions is child’s play with Java and modern computers.  Let’s use our knowledge of ADTs and Java to automate it. 

Implement the TextAnalyzer ADT in Java.  This ADT encapsulates a list of items stored using array(s).  Each item consists of a word and a count.  The list contains unique words sorted lexicographically (case insensitive).  Adding a duplicate word simply increments the corresponding count.  Deleting a word, decrements the corresponding count.  Deleting the last instance of a word, completely removes that word (and corresponding count) from the list.

TextAnalyzer has the following API:

 

In addition to the above, ADT TextAnalyzer provides the following operations:

 

To calculate the Zipfian distribution use ZipfStatistics.class.  Download it and save it in the same directory as your source file.  It contains the following methods:

 ZipfStatistics.class is made available under a Creative Commons LicenseIt was developed by Chris Wagner, Charles McCormick, and Bill Manaris.

 

Experiment:

To validate your work, perform the following experiment.

Esperanto is an “artificial” language developed by linguists to be simple, regular and thus easy to learn and use.  Unlike natural languages (such as English, German, French, Italian, Vietnamese, and Greek), which have naturally evolved over thousands of years, Esperanto has been around for only 120 years or so.

It has been suggested that, due to its artificial regularity, Esperanto may feel unnatural to humans (at a subconscious level).

 

Hypothesis: Esperanto does not exhibit the scaling properties normally found in natural languages.

 

  1. You should write a set of programs to test the above hypothesis.   These programs should use ADT TextAnalyzer.  They should also use Scanner to read in a text file (an e-book).

·        TextAnalyzer.java, contains the TextAnalyzer ADT.

·        CharacterAnalysis.java, should output the Zipf slope and R2 value for characters found in a textbook.

·        WordAnalysis.java, should output the Zipf slope and R2 value for words found in a textbook.   

·        WordLengthAnalysis.java, should output the Zipf slope and R2 value for word lengths (measured in number of characters) found in a textbook.

·        (bonus) WordDistanceAnalysis.java, should output the Zipf slope and R2 value for the distance between word repetitions (measured in number of intermediate words) found in a textbook.

·        (bonus) SentenceLengthAnalysis.java, should output the Zipf slope and R2 value for sentence lengths (measured in number of characters) found in a textbook.

·        (bonus) up to three more programs (metrics) of your own choosing.

 

  1. Write a report on your methodology (include which metrics you tried on which books and in which language combinations, etc.), your results (include a table with the slope and R2 values generated, etc.), and your discussion/conclusion (e.g., does your data support the hypothesis or not?).  Include a references section with URLs of e-books used in your experiment.

·        EsperantoStudy.doc, contains your experiment report (MS Word format).

 

Documentation:

See first assignment.  Also you should submit a javadoc API in HTML for your ADT.  Your code should be fully documented.

 

Submission:

Two options (use either one – the effect is the same):

  1. Open your BlueJ project.  Under the Project menu, click Create Jar File… .  In the dialog box that opens, select Include Source, and press Continue.  Email the generated .jar file to manaris@cs.cofc.edu, by the due date and time.
  2. (This option is available in BlueJ 2.0 and above)  Save submission.defs into your BlueJ project directory.  Open your BlueJ project.  Under the Tools menu, click Submit… .  In the dialog box that opens, select scheme CSCI 221/hmwk3 and press Submit.  (You may have to specify your email information.)

 

Notes:

  1. You should modularize and document your code thoroughly. Your methods should be fully documented, i.e., purpose, and pre/postconditions.  Each Java file should have a certificate of authenticity, as per first homework.
  2. Your report should include your name (etc.) and section headings to improve readability.

 

References:

1.      Zipf, G.K. (1949).  Human Behavior and the Principle of Least Effort, Addison-Wesley Press, New York.

2.      Mandelbrot, B. (1977). Fractal Geometry of Nature, W.H. Freeman and Company, New York.

3.      Voss, R.F., and Clarke, J. (1975).  “1/f Noise in Music and Speech”, Nature, vol. 258, pp. 317-318.

4.      Bogomolny, A. (on-line). “Benford's Law and Zipf's Law”, accessed March 21, 2005.

5.      Li, W. (on-line). “Zipf’s Law”, accessed March 22, 2005.

6.      Salingaros, N.A., and B.J. West. (1999).  A Universal Rule for the Distribution of Sizes”, Environment and Planning B: Planning and Design,  vol. 26, pp. 909-923.

7.      Schroeder., M. (1991).  Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise, New York: W. H. Freeman and Company.

8.      Adamic, L.A. (on-line). “Zipf, Power-laws, and Pareto - a Ranking Tutorial”, accessed March 22, 2005.

9.      Spehar, B., C.W.G. Clifford, B.R. Newell, and R.P. Taylor.  (2003).  Universal Aesthetic of Fractals.” Computers & Graphics, vol. 27, pp. 813-820.

10.  Nielsen, J. (on-line). “Zipf Curves and Website Popularity”, accessed March 22, 2005.

11.  Wallace, R.S. (on-line). “Zipf’s Law”, accessed March 22, 2005.