CISC 322 Assignment 1

From EQUIS Lab Wiki

Jump to: navigation, search


Contents

Architecture of a Search Engine: the First Steps

Due Monday, Oct 3, 2005 by 3:00 PM

Synopsis

In this assignment, you will architect and implement part of an internet search engine. This assignment will give you experience with creating UML architectural diagrams from the call perspective, and translating these diagrams into code.

Background

Over the next assignments, you will design and implement a web-based search engine, similar in concept to Google. The system will allow users to type in a search term, and will return the URL's of a set of pages matching that search term.

The system will be composed of two parts:

  • A spider engine, which searches the web for pages and indexes them. The spider engine creates a dictionary which can be used to look up which words appear on what pages. The spider engine can be run periodically to update the database, say once per day.
  • A search engine, which allows users to type a search term into a web browser, and retrieves a list of pages matching the search term. The search engine uses the data found by the spider engine.

In this assignment, you will be creating a first design and implementation of the spider engine. In future assignments, you will refine your spider engine and design and implement the search engine.

Architecture

searchEngineDeployment.png

The above figure shows the deployment architecture of the search engine. The diagram provides a high-level overview of the system, without giving details on how it is actually implemented.

The Spider Engine works through a set of web pages, indexing them for use by the search engine. Data about the pages is stored in a Words Database. The spider engine is run periodically. It does not have to be run each time an end user makes a web search.

End users perform web searches via a web site. Users type in a search query and are given a set of pages that match the query. For example, the query 322 assignment might give a list of all web pages which contain the words 322 and assignment. The search engine uses the data in the words database to compute the results of the search.

In this assignment, you will be working on the spider engine. The figure below shows a more detailed design view of the spider engine:

SpiderComponent.png

The spider engine works as follows:

  • The spider engine is given a URL specifying the starting point for indexing. This page is added to the page queue.
  • The page queue is a set of URLs that the spider engine knows about. The spider engine will visit and index each of these pages.
  • For each page in the page queue, the spider engine retrieves information about the page using the page info component. The page info component returns the list of words appearing on the web page, and the set of URL's that the web page references.
    • The words are added to the word database.
    • The URL's are added to the page queue. Note that if the same page is added to the page queue twice, it should only be visited once.
  • The word database implements a mapping from words to URL's. For example, if the word 322 has been found at URL's u1 and u2, looking up 322 should return the set {u1, u2}.
    • The word database is implemented using a Persistent Dictionary; see below.

Persistent Dictionary

The persistent dictionary is a component provided to you to help with this assignment. Details on the persistent dictionary are here.

Safe Operation of this Program

The spider engine has the potential to behave negatively if it is not run with care. The spider may fail to terminate if:

  • You fail to recognize that you are visiting the same pages repeatedly;
  • Your search of web pages is not bounded: there are billions of web pages on the Internet; you should not attempt to index them all.

The consequences of non-termination could be:

  • Overflowing the database
  • Flooding the CASLab network.

In order to help avoid these problems, use the following techniques:

(1) Between each fetch of a new web page from the network, insert a one second pause. This will restrict the amount of data being requested should your program go out of control.

(2) Ensure that your spider never goes outside the CASLab domain. In your code that downloads web pages from the Internet, include a test that ensures that the URL to be downloaded includes the string "caslab.queensu.ca".

(3) Place a counter in your Page Info component that restricts the maximum number of pages that can be fetched. Restricting this to 10 pages will be adequate for initial testing; raising the number to 100 pages will work for later tests.

Experimental Testbed

A web server will be provided for experimental use. Details will be made available on the WebCT forum.

Coding Style

Your Java code must conform to the CISC 322 Java and JSP Programming Style Guide. Review and follow this guide.

Hints

(1) Java's URL class provides the functionality you need to download a web page from a web server. The getStream method retrieves the contents of the web page specified by the URL. (It actually returns an InputStream object that can be used to read the contents of the web page.) In order to meet the safety rules described above, implement your own SafeURL class that implements a counter and sleep as described above. The outline of this class is something like:

import java.net.MalformedURLException;
import java.net.URL;
import java.io.InputStream;

public class SafeURL
{
    private static int numberOfPagesFetched = 0;
    private static int maxPagesFetched = 10;

    private java.net.URL u;
    
    SafeURL( String spec ) throws MalformedURLException
    {
    	u = new URL( spec );
    }
    
    InputStream openStream() throws java.io.IOException
    {
        if( numberOfPagesFetched >= maxPagesFetched )
        {
            System.err.println( "Maximum number of page fetches exceeded" );
            return null;
        }
        else if( u.getHost().indexOf( "caslab.queensu.ca" ) == -1 )
        {
            // Ignore URL outside of CASLab domain
            System.err.println( "Warning: ignoring URL " + u );
            return null;
        }
        else
        {
            try
            {
                Thread.sleep( 1000 );
            }
            catch( InterruptedException ex )
            {
                System.err.println( "Attempt to sleep failed" );
                return null;
            }
 
            numberOfPagesFetched += 1; 

            return u.openStream();
        }
    }
}

(2) You should include only the words from the web page in your Word Database. HTML pages contain markup that should not be included. HTML tags include, among many others, <B>, <I> and <IMG>. You can remove these tags by ignoring all content that appears between < and > symbols. HTML special symbols should also be ignored; these include, for example, &lt;, which stands for a < symbol. HTML special symbols begin with an ampersand (&) and end with a semi-colon (;).

(3) In HTML, links to another page have the following form:

<A HREF="somewhere">link text</A>

where somewhere is a valid URL. Any capitalization is allowed, and the double-quote marks around the URL are optional.

To Do

Part 1: For each component in the architecture (with the exception of the Persistent Dictionary), create a precise, English-language specification of the component's interface. The specification should list for each of the component's operations, the operation's name, input parameters, output parameters and description.

Part 2: Show the internal design of each component using the class perspective. Your design should consist of class diagrams (typeset with Poseidon) and appropriate text to describe the diagram.

Part 3: Implement the spider engine based on your component design from part 1.

Part 4: Test the spider engine using the test data provided by us. Create a simple test program that allows you to print out the contents of the word database so that you can test the correct operation of your spider engine.

Note that you need only implement the spider engine. It is not necessary to implement the search engine in this part of the assignment.

To Hand In

Submit a report including parts 1, 2, 3 and 4. The report should have a table of contents, and be divided into one section per part, with appropriate section headings. The front page of the report must be the assignment 1 cover page, available here. (Two marks will be deducted for not providing this cover sheet.) Be sure to read this cover sheet for a breakdown of how the assignment will be graded.

This assignment is to be submitted to the CISC 322 drop box on the second floor of Goodwin Hall by 3:00 PM on Monday, Oct 3, 2005. Printers tend to be busy shortly before assignment deadlines, so it is wise to avoid last-minute submission. No late assignments will be accepted.