Assigned Date: Thursday, Mar. 4, 2010
Due Date: Friday, Mar. 26
Due Time: 11:55pm
Last modified on March 25, 2010, at 01:47 PM (see updates)
This assignment focuses on:
- uninformed search;
- selecting an appropriate search strategy;
- Internet intelligent agents (web crawlers); and
- information visualization
This is a pair-programming assignment (i.e., you may work with a partner). You may discuss the assignment only with your partner or the instructor.
Write an intelligent agent that searches a website (e.g., the NY Times, the Washington Post, etc.) to harvest information of interest.
The agent will be given an amount of time to search this website for webpages (target pages) containing specific keywords. When the time allotted is over (or when the complete website is searched, whichever comes first), the agent will make available a histogram of words found on the target pages (if any).
The attached code, agentsModel.py, is adapted from code provided with Russell and Norvig, "AI - A Modern Approach". It implement Agents and Environments (chapter 2). Read it and understand how it models various AI agents and environments. (This code uses utils.py. Save them both in the same directory.)
Adapt the provided code to the given problem. Remove ALL unnecessary code.
Environment subclass, called
WebEnvironment, suited to the problem.
Agent subclasses (webcrawlers) suited to the problem, each with a different search strategy. Each agent should remember:
- which pages it has visited (to avoid loops)
- which pages it knows about and plans to visit
- how much time it has used (to decide when to stop - see function
- anything else it needs
The environment, when instantiated, should prompt the user for the following (on separate lines):
- a target website (e.g., http://nytimes.com)
- search keywords (e.g., Jamaica bobsled)
- time limit in seconds (e.g., 120.0)
Then, the environment will instantiate three agents (see above) and have them perform the assigned task each in their own way. Each agent is allotted the complete available time. Each agent keeps track of the time it has used. When the time elapses, the agent returns
False via function
is_alive() (when asked).
Percepts and Actions
An agent's percept is constructed by the environment. It contains the HTML contents of a page (initially, the target website's first page).
An agent's action is a URL (meaning, "this is the webpage I want to visit next"). The environment opens this URL and constructs the agent's next percept.
Each agent should update its performance measure based on how many target pages it has discovered (+1 point per page).
Upon completion of the task, the environment should output for each agent:
- number of overall pages visited (e.g., 5342 pages searched)
- number of target pages found (e.g., 150 pages found containing "Jamaica" and "bobsled") - same as agent's performance measure
- percent of target to overall pages (e.g., 2.81% target pages found)
- the 50 most popular words (in reverse order, along with their cumulative frequency of occurrence across all target pages) - if any (it's possible that no target pages were found)
- URLs of all target pages found
Agents should not revisit a page (i.e., should skip already visited URLs). Agents should contain their search within the site (i.e., should skip external links).
Keywords are conjunctive (never "OR", always "AND"). They should be given to an agent when instantiated. I.e., an agent is "hardwired" to look only for those keywords.
See sample Python code for processing webpages, MagnatuneDownloader.py.
Another possibility is BeautifulSoup - a simple, yet powerful Python HTML/XML parser designed for quick turnaround projects.
For information on how to time your code, see function
time() in Python module time.
Experiment with at least three different uninformed search algorithms (e.g., depth-first, breadth-first, iterative deepening, etc.) to see which one performs better.
When updating the histogram ignore common (stop) words.
Avoid global variables. Avoid side-effects. (Of course, this excludes access to an object's own instance and class variables.)
See previous assignment.
- Submit a README.doc file. Include your names, class, homework assignment, date. Also discuss the following:
- What should a search node contain? (E.g., URL of page, etc.)
- What is the goal of the search ? (Think about this carefully.)
- Order the three algorithms you tried in decreasing order of preference (performance). Justify your answer with statistics from try runs.
- Which agents (search strategies) you used for each of the two word clouds.
- Limitations of your program (e.g., things you didn't complete, or things you could do better if you had more time).
- Your source code, webcrawler.py, fully documented and tested.
- Two word clouds from http://www.wordle.net, nytimesHealthCare.png and washingtonPostHealthCare.png. To generate them:
- Search NY Times, and the Washington Post using keywords "health" and "care".
- Convert each histogram to a list of repeated words (feel free to adapt wordCloud.py).
- Upload the list of repeated words to http://www.wordle.net.
- Use font "Lucida Sans", layout "Horizontal", and color "Ghostly".
- Save a screenshot of the generated word cloud as a PNG.
- For example, here is the word cloud for this assignment:
Your grade will be based on the elegance of your design, the quality of your documentation, and how well you followed the above instructions. In particular, there are many ways to achieve the above functionality. Aim for simplicity and elegance. Design, design, design... before you implement.