Spring 2010»CSCI 470 Homework 3

CSCI 470 Homework 3

Assigned Date: Thursday, Mar. 4, 2010
Due Date: Friday, Mar. 26
Due Time: 11:55pm

Last modified on March 25, 2010, at 01:47 PM (see updates)

Purpose

This assignment focuses on:

  • uninformed search;
  • selecting an appropriate search strategy;
  • Internet intelligent agents (web crawlers); and
  • information visualization

This is a pair-programming assignment (i.e., you may work with a partner). You may discuss the assignment only with your partner or the instructor.

Assignment

Write an intelligent agent that searches a website (e.g., the NY Times, the Washington Post, etc.) to harvest information of interest.

The agent will be given an amount of time to search this website for webpages (target pages) containing specific keywords. When the time allotted is over (or when the complete website is searched, whichever comes first), the agent will make available a histogram of words found on the target pages (if any).

Agents

The attached code, agentsModel.py, is adapted from code provided with Russell and Norvig, "AI - A Modern Approach". It implement Agents and Environments (chapter 2). Read it and understand how it models various AI agents and environments. (This code uses utils.py. Save them both in the same directory.)

Specifications

Adapt the provided code to the given problem. Remove ALL unnecessary code.

Define an Environment subclass, called WebEnvironment, suited to the problem.

Define three Agent subclasses (webcrawlers) suited to the problem, each with a different search strategy. Each agent should remember:

  • which pages it has visited (to avoid loops)
  • which pages it knows about and plans to visit
  • how much time it has used (to decide when to stop - see function is_alive())
  • anything else it needs

Input

The environment, when instantiated, should prompt the user for the following (on separate lines):

  • a target website (e.g., http://nytimes.com)
  • search keywords (e.g., Jamaica bobsled)
  • time limit in seconds (e.g., 120.0)

Then, the environment will instantiate three agents (see above) and have them perform the assigned task each in their own way. Each agent is allotted the complete available time. Each agent keeps track of the time it has used. When the time elapses, the agent returns False via function is_alive() (when asked).

Percepts and Actions

An agent's percept is constructed by the environment. It contains the HTML contents of a page (initially, the target website's first page).

An agent's action is a URL (meaning, "this is the webpage I want to visit next"). The environment opens this URL and constructs the agent's next percept.

Performance Measure

Each agent should update its performance measure based on how many target pages it has discovered (+1 point per page).

Output

Upon completion of the task, the environment should output for each agent:

  • number of overall pages visited (e.g., 5342 pages searched)
  • number of target pages found (e.g., 150 pages found containing "Jamaica" and "bobsled") - same as agent's performance measure
  • percent of target to overall pages (e.g., 2.81% target pages found)
  • the 50 most popular words (in reverse order, along with their cumulative frequency of occurrence across all target pages) - if any (it's possible that no target pages were found)
  • URLs of all target pages found

Details

Agents should not revisit a page (i.e., should skip already visited URLs). Agents should contain their search within the site (i.e., should skip external links).

Keywords are conjunctive (never "OR", always "AND"). They should be given to an agent when instantiated. I.e., an agent is "hardwired" to look only for those keywords.

See sample Python code for processing webpages, MagnatuneDownloader.py.

Another possibility is BeautifulSoup - a simple, yet powerful Python HTML/XML parser designed for quick turnaround projects.

For information on how to time your code, see function time() in Python module time.

Experiment with at least three different uninformed search algorithms (e.g., depth-first, breadth-first, iterative deepening, etc.) to see which one performs better.

When updating the histogram ignore common (stop) words.

Avoid global variables. Avoid side-effects. (Of course, this excludes access to an object's own instance and class variables.)

Documentation

See previous assignment.

Submission

  1. Submit a README.doc file. Include your names, class, homework assignment, date. Also discuss the following:
    1. What should a search node contain? (E.g., URL of page, etc.)
    2. What is the goal of the search ? (Think about this carefully.)
    3. Order the three algorithms you tried in decreasing order of preference (performance). Justify your answer with statistics from try runs.
    4. Which agents (search strategies) you used for each of the two word clouds.
    5. Limitations of your program (e.g., things you didn't complete, or things you could do better if you had more time).
  2. Your source code, webcrawler.py, fully documented and tested.
  3. Two word clouds from http://www.wordle.net, nytimesHealthCare.png and washingtonPostHealthCare.png. To generate them:
    • Search NY Times, and the Washington Post using keywords "health" and "care".
    • Convert each histogram to a list of repeated words (feel free to adapt wordCloud.py).
    • Upload the list of repeated words to http://www.wordle.net.
      • Use font "Lucida Sans", layout "Horizontal", and color "Ghostly".
      • Save a screenshot of the generated word cloud as a PNG.
      • For example, here is the word cloud for this assignment:

Grading

Your grade will be based on the elegance of your design, the quality of your documentation, and how well you followed the above instructions. In particular, there are many ways to achieve the above functionality. Aim for simplicity and elegance. Design, design, design... before you implement.