Assigned Date: Tuesday, Mar. 17, 2009
Due Date: Friday, Mar. 27
Due Time: 11:55pm
Last modified on April 01, 2009, at 05:03 PM (see updates)
This assignment focuses on:
- uninformed search;
- selecting an appropriate search strategy;
- Internet intelligent agents (web crawlers); and
- information visualization
This is a pair-programming assignment (i.e., you may work with a partner). You may discuss the assignment only with your partner or the instructor.
Write an intelligent agent that searches a website (e.g., the NY Times, or the Washington Post, etc.) to harvest information of interest. The agent will be given an amount of time to search for webpages (target pages) within this website which contain specific keywords. When the time alloted is over (or when the complete website is searched, whichever comes first), the agent will output a histogram of words found on the target pages (if any).
The agent should prompt for the following (on separate lines):
- a target website (e.g., "http://nytimes.com")
- search keywords (e.g., "Jamaica bobsled")
- time limit (e.g., "120.0" in seconds)
Upon completion of the allotted time (or having exhaustively searched the website), the agent outputs:
- search statistics including:
- number of overall pages visited (e.g., '5342 pages searched')
- number of target pages found (e.g., '150 pages found containing "Jamaica" and "bobsled"')
- percent of target to overall pages (e.g., '2.81% target pages found')
- overall search time (in seconds)
- average time per page (in seconds)
- the 50 most popular words (in reverse order, along with their cumulative frequency of occurrence across all target pages) - if any (it's possible that no target pages were found)
- URLs of all target pages found
The agent should not revisit a page (i.e., should skip already visited URLs). The agent should contain its search within the site (i.e., should skip external links). Keywords are conjunctive (never "OR", always "AND").
See sample Python code for processing webpages, MagnatuneDownloader.py.
For information on how to time your code, see function
time() in Python module time.
Experiment with at least three different uninformed search algorithms (e.g., depth-first, breadth-first, iterative deepening, etc.) to see which one performs better.
When constructing the histogram ignore common (stop) words.
All programs that you complete in your career as a student and as a professional developer should be fully documented. Obviously, you should comment any variable, obscure statement, block of code, method, and class you create. Your comments should express why something is being done, as opposed to how – the how is shown by the code. Also, include opening comments as specified in previous assignment.
- Submit a README.txt file. Include your names, class, homework assignment, date. Also discuss the following:
- What should a search node contain? (E.g., URL of page, etc.)
- What is the goal of the search ? (Think about this carefully.)
- Order the three algorithms you tried in decreasing order of preference (performance). Justify your answer with statistics from try runs.
- Limitations of your program (e.g., things you didn't complete, or things you could do better if you had more time).
- Your source code, webcrawler.py, fully documented and tested.
- Four word clouds from http://www.wordle.net, nytimesObama.png, nytimesBush.png, washingtonPostObama.png, and washingtonPostBush.png. To generate them:
- Search NY Times, and the Washington Post using (a) keywords "president", "Obama"; and (b) keywords "president", "Bush".
- Convert each histogram to a list of repeated words (feel free to adapt wordCloud.py).
- Upload the list of repeated words to http://www.wordle.net.
- Use font "Duality", layout "Half and Half", and color "Blue meets Orange".
- Save a screenshot of the generated word cloud as a PNG.
- For example, here is the word cloud for this assignment:
Your grade will be based on how well you followed the above instructions, and the depth/quality of your work.