Bill Manaris : Spring 2009 / CSCI 470 Homework 3

Assigned Date: Tuesday, Mar. 17, 2009
Due Date: Friday, Mar. 27
Due Time: 11:55pm

Last modified on April 01, 2009, at 05:03 PM (see updates)

Purpose

This assignment focuses on:

This is a pair-programming assignment (i.e., you may work with a partner). You may discuss the assignment only with your partner or the instructor.

Assignment

Write an intelligent agent that searches a website (e.g., the NY Times, or the Washington Post, etc.) to harvest information of interest. The agent will be given an amount of time to search for webpages (target pages) within this website which contain specific keywords. When the time alloted is over (or when the complete website is searched, whichever comes first), the agent will output a histogram of words found on the target pages (if any).

Input

The agent should prompt for the following (on separate lines):

Output

Upon completion of the allotted time (or having exhaustively searched the website), the agent outputs:

Details

The agent should not revisit a page (i.e., should skip already visited URLs). The agent should contain its search within the site (i.e., should skip external links). Keywords are conjunctive (never "OR", always "AND").

See sample Python code for processing webpages, MagnatuneDownloader.py.

For information on how to time your code, see function time() in Python module time.

Experiment with at least three different uninformed search algorithms (e.g., depth-first, breadth-first, iterative deepening, etc.) to see which one performs better.

When constructing the histogram ignore common (stop) words.

Documentation

All programs that you complete in your career as a student and as a professional developer should be fully documented. Obviously, you should comment any variable, obscure statement, block of code, method, and class you create. Your comments should express why something is being done, as opposed to how the how is shown by the code. Also, include opening comments as specified in previous assignment.

Submission

  1. Submit a README.txt file. Include your names, class, homework assignment, date. Also discuss the following:
    1. What should a search node contain? (E.g., URL of page, etc.)
    2. What is the goal of the search ? (Think about this carefully.)
    3. Order the three algorithms you tried in decreasing order of preference (performance). Justify your answer with statistics from try runs.
    4. Limitations of your program (e.g., things you didn't complete, or things you could do better if you had more time).
  2. Your source code, webcrawler.py, fully documented and tested.
  3. Four word clouds from http://www.wordle.net, nytimesObama.png, nytimesBush.png, washingtonPostObama.png, and washingtonPostBush.png. To generate them:
    • Search NY Times, and the Washington Post using (a) keywords "president", "Obama"; and (b) keywords "president", "Bush".
    • Convert each histogram to a list of repeated words (feel free to adapt wordCloud.py).
    • Upload the list of repeated words to http://www.wordle.net.
      • Use font "Duality", layout "Half and Half", and color "Blue meets Orange".
      • Save a screenshot of the generated word cloud as a PNG.
      • For example, here is the word cloud for this assignment:

Grading

Your grade will be based on how well you followed the above instructions, and the depth/quality of your work.

(Printable View of http://www.cs.cofc.edu/~manaris/?n=Spring2009.CSCI470Homework3)