CISC 322 Assignment 2
From EQUIS Lab Wiki
Contents |
Architecture of a Search Engine: Adding Searching
Due Thursday, October 20, 2005 by 3:00 PM
Synopsis
In this assignment, you will refine your architecture of your spider engine and design the core of your search engine. This assignment will give you experience with tabular interface specification, precise interface specification, and using UML sequence diagrams to document component interaction. The assignment will also expose you to the typical problem of requirements change in an ongoing project.
Background
In the last assignment, you performed preliminary design and implementation of your spider engine. You created informal specifications of the components within the spider engine, and implemented these components.
In this assignment, you will revisit the interface specifications and specify them more precisely. You will also start work on designing your search engine.
Requirements Change
In real software projects, it is typical for the project's requirements to change as the project is underway. Our search engine project is no different, and so assignments two through four will each introduce a change in requirements. This assignment's change is to require the spider engine to recognize robots.txt files.
Robots Exclusion Standard
Spidering a web site can be an invasive procedure, perhaps recording information that the owner of the web site would rather not have show up in a search engine. Spidering can also place a significant load on a server, particularly if it has many pages to search. The Robots Exclusion Standard allows web sites to specify that all or parts of their site should not be visited by spider engines. Well-behaved spider engines consult this information and obey its directives.
The directives are placed in a file called robots.txt located in the web server's root. Here are some examples of robots.txt files:
- http://www.queensu.ca/robots.txt - Queen's University's robots.txt file
- http://www.slashdot.org/robots.txt - Slashdot's robots.txt file
- http://www.ebay.com/robots.txt - eBay's robots.txt file
The directives permitted in a robots.txt file are described here. For simplicity, we will process only lines that deny access to specified directory trees or individual pages. These lines are of the form:
Disallow:/directoryName/ Disallow:/pageName.html
The first form tells the spider not to look at any pages whose url begins with /directoryName. The second form disallows access to a specific page. See the tutorial and example robots.txt files referenced above for lots of examples of each form.
Recognizing robots.txt Files
The first time that your spider program visits a particular server, it should consult the robots.txt file for that domain. The file may consist of an arbitrary number of lines of the form
field:value
Lines of the form
Disallow:pathOrFile
specify paths or files on this server that your spider should not visit. Your spider should not visit these paths or files.
Lines of the form
# comment text
are comments and should be ignored.
The spider should ignore all other lines in the robots.txt file.
Interface Specifications
We have studied two ways of specifying component interfaces:
- Tabular interfaces specifications are precise, English-language descriptions of component interfaces.
- Logic-based interface specifications are precise, formal descriptions of component interfaces.
In this assignment, you will practice both ways of specifying interfaces.
Searches
Up to now, we have worked solely on the spider engine. Over the next assignments, we will also design and implement the search engine itself that allows people to perform queries over the data we have extracted from our set of spidered web pages.
In general, when a user requests a query q, the result should be the set of web pages that satisfy q. The simplest form of a query is:
word1
which results in all web pages containing the word word1. Queries can involve multiple words:
word1 word2
results in the set of all web pages that contain both word1 and word2. A query can contain any number of words. Queries can also contain negated words:
word1 -word2
results in all web pages that contain word1 and do not contain word2.
Query Grammar
The following grammar describes the syntax of queries:
query ::= queryTerm | queryTerm query queryTerm ::= word | - word
The Query Engine
The Query Engine is a component that processes queries. The query engine has an operation getPages that takes as parameters:
- a set of positive words
- a set of negative words
- the maximum number of web pages to be returned
and returns a set of web pages. Each returned web page contains all of the positive words and none of the negative words. The number of returned pages does not exceed the specified maximum.
In future assignments, you will use the query engine to help process string queries of the form described above. Note that the query engine does not take queries in the string form described in the previous section, but via the parameters described above.
To Do
Part 1: Following the format presented in class, provide a tabular specification of the interfaces of each of the Page Info, Page Queue and Word Database components. The specifications should list for each of the component's operations, the operation's name, input parameters, output parameters, precondition, postcondition, exceptions and description.
Modify your interfaces and code as necessary to account for the new Robots Exclusion Standard requirement. Document the changes you make. Discuss the impact that your change had on your components' interfaces - is the change localized, or spread throughout the architecture? What does this tell you about the design of your architecture?
While performing this specification, you may find other opportunities to improve the components' interfaces over what you submitted in assignment 1. Document any changes and explain the reason behind them. Modify your code as necessary to conform to your new interface specifications.
Part 2: Create a tabular specification for the Query Engine component described above. Your specification must include the getPages operation using the parameters and return value described above. Your specification may include other operations as required.
Part 3: Create a formal specification of the Page Queue component using the logic-based notation presented in class.
Part 4: Create a sequence diagram showing the following scenario:
The spider engine is invoked on the URL for the simple test case provided in assignment 1.
The sequence diagram should carry the scenario through the visit of page a.html. It is not necessary to show the activities required to process the other pages.
Part 5: Test the spider engine using the test data provided by us. Use your test program from assignment 1 to print out the contents of the word database so that you can test the correct operation of your spider engine. Include commentary with the tests to explain how the tests show correct operation of your system.
Note that you are not being asked to implement the query engine as part of assignment 2.
Hints
Some hints:
Parsing the Robots.txt file
A full parser for robots.txt files can be complex. It is sufficient in this assignment to use the following algorithm:
- For each line in the robots.txt file
- Remove all whitespace from the line (blanks, tabs, etc.)
- If the line begins with a "#" character, discard the line
- If the line contains a "#" character that is not at the beginning, remove the "#" character and all characters following it
- If the line begins with "Disallow:", record the pathOrFile part
- Otherwise, discard the line
Formal specification of Page Queue
If you are finding it very difficult to create this formal specification, it may indicate that your interface is too complex. Look for ways of simplifying your interface while retaining the necessary functionality. One of the useful consequences of formal specification is that it helps us simplify our designs.
You do not need to create interface specifications for data structures used by the implementation of your Page Queue component. For example, a queue of pages might be implemented via Java's Vector class, but can be adequately modeled as a mathematical set. Similarly, you may have used a hash table to implement the list of pages that have been visited in the past, while a mathematical set is again adequate to model it as an interface specification.
To Hand In
Submit a report including parts 1, 2, 3, 4 and 5. The report should have a table of contents, and be divided into one section per part, with appropriate section headings. The front page of the report must be the assignment 2 cover page, available here. (Two marks will be deducted for not providing this cover sheet.) Be sure to read this cover sheet for a breakdown of how the assignment will be graded.
This assignment is to be submitted to the CISC 322 drop box on the second floor of Goodwin Hall by 3:00 PM on Thursday, Oct 20, 2005. Printers tend to be busy shortly before assignment deadlines, so it is wise to avoid last-minute submission. No late assignments will be accepted.