Requirements Specification for HTML Formatter

Introduction and System Model

This document assumes that the reader is familiar with HTML. A detailed working knowledge is not necessary, but the recognition of technical terms such as tags, links and elements is assumed. This assumption allows the document to be written at a high level, with a reference to a low-level, more formal description of HTML to clarify any details. Futhermore, many decisions have been left purposely ambiguous. Most of these are design decisions and do not belong in a requirements specification. Any questions regarding the document or issues that are unresolved should be directed to the CS403 TA.

The formatter will be used in the context of a WWW toolkit. It will be used by other tools in the toolkit to format HTML documents to ASCII output. It will be used mostly to format documents for printing on ASCII printers, or viewing on ASCII terminals.

System Evolution

Since the HTML specification is changing, we foresee a need to support different types of HTML elements and element attributes. Another possible extension is to allow customized handling of some elements. It might also be necessary to add more sophisticated layout capabilities (such as relative positioning of elements or using the SIZE attribute of an IMG tag to leave the appropriate amount of whitespace). As the toolkit gets more sophisticated, it may become necessary to be able to format documents incrementally, that is, format the document one page (screen or hardcopy) at a time or until a particular point in the document. For example, it may be necessary to process the document up to the first link, and then continue processing, stopping at each link.

Functional Requirements

The program should conform to a subset of the Hypertext Markup Language - 2.0 specification. The subset is chosen so that the program will handle many typical HTML documents, without adding too much unnecessary complexity. The elements that must be handled are:

<HTML> ... </HTML>, <HEAD> ... </HEAD> and <BODY> ... </BODY>: All HTML documents will have these, but they doesn't effect how you process the document.
<TITLE> ... </TITLE>: Display left justified with a blank line below it.
<BASE> ... </BASE>: Should be used to create universal URLs from relative URLs encountered in the document.
<Hn> ... </Hn> (for n in 1..6): Display left justified with a blank line above and below and indented n spaces from the left.
Comments <! ... >: Comments should be removed from the output.
 ... : Display delimited text in all-caps (uppercase).
Lists (Unordered lists <UL> ... </UL>, Ordered lists <OL> ... </OL>, List items <LI>, Definition lists <DL> ... </DL>, <DT>, <DD>): Should be handled, you can decide how to display them. Nesting should be handled correctly.
Paragraph (), Line break ( ) and block quote <BLOCKQUOTE> ...</BLOCKQUOTE>: These three elements should produce distinguishable output. One reasonable way to handle them would be to have go to the begining of the next line, skip a line and go to the beginning of the following line and <BLOCKQUOTE> use bigger left and right margins. The element should be handled both as an indication that a new paragraph is being started (when used alone) and to delimit a paragraph (when used with ).
Images <IMG>: If an ALT attribute is specified, it should be printed, otherwise, some indication that there is an image should be printed.
Anchors <A HREF="foo">link text</A>: Special formatting of anchors is optional. The words "link text" may appear as normal text in the document. The link itself must be processed and the program must support an option which, when selected, will cause a form feed character to be appended to the document, followed by a list of links in the document. See below for details about formatting links.

Unless specified above, all output should be full-justified. The text should be flush at both the left and right margins, and spaces must be added between words in such a way that gaps are even. The addition of spaces should be not be repeated regularly from line to line to avoid large swathes of whitespace. The last line of a paragraph or block of text should be left justified.

In order to reasonably process documents which contain elements not specifically handled by the program, any HTML elements (Things in <>'s) or attributes (additional things in <> besides the name of the element) that your program does not understand should be ignored. That is, the elements (or attributes) themselves should have no effect, but any text that is delimited should be output normally. For example, the input

<FOO>This is a test.</FOO> <P SPACE=5> This is only a test.

Should be treated as if it were:

This is a test. <P> This is only a test.

You do not need to worry about HTML elements that allow for user input (textareas, forms, etc.) All you need to do is allow the program to accept as input an HTML file and produce as output an ASCII file resulting from processing the input.

The program should allow customization of the maximum line width of the output file. Default line width can be assumed to be 80 characters.

It must also support the optional appending of a list of links (Names and URLS) encountered in the document. This list should be separated from the rest of the output by a form feed character (^L). Links should contain the text delimited by the link and the value of the HREF attribute, if one exists. Other attributes do not generate a link listing. For example, if the links

 <A HREF="foo.html">foo link</A>,
 <A NAME="bar">The bar</A>
 and <A HREF="http://www.cs.washington.edu/education/">Courses</A>

appear in the input, and no base is specified, then the following should be appended to the output:


  ^L
 "foo link" = "foo.html"
 "Courses" = "http://www.cs.washington.edu/education/"

If the HEAD of the document contains a


  <BASE HREF="http://www.cs.washington.edu/education/courses/cse403/96sp/">

then the appended links should be formatted as follows:


  ^L
 "foo link" = "http://www.cs.washington.edu/education/courses/cse403/96sp/foo.html"
"Courses" = "http://www.cs.washington.edu/education/"

The HTML specification states that the BASE tag may not appear in the BODY of a document and anchor tags (<A>) may not appear in the HEAD. Thus it is safe to use the last occurence of BASE in the HEAD for all extensions of relative URL's in anchors in the BODY. (For those of you who want to get really picky, the HEAD may contain a tag called LINK which behaves just like an anchor, but is used for things like style sheets. I can't figure out what the spec says about interleaved BASE and LINK tags, but I'm not requiring you to handle LINK tags, so don't worry about it.)

In addition to the requirements described above, handling for any additional element types or attributes are of value. For example, other types of lists (<MENU>, <DIR>), style types and links (denoting links in the text), horizontal rule (<HR>) or additional attributes to <BLOCKQUOTE> such as INDENT=5 or LEADING_CHAR=">" are all possible extensions.

Constraints

The program shall be maintainable. It will be possible for a group of two or three programmers familiar with the task but not with this specific implementation to add functionality such as new elements or modify the algorithm used for layout in one week.

The input to the program will be streamed. Once an arbitrary number of characters have been read, they will not be rereadable. (The input may be a unix pipe, not a file.)

The input may not be a valid HTML document. If it is not, any errors must be handled. It is preferable that errors are handled by printing a message (to stderr rather than stdout) and either continuing to process the document as well as possible or ceasing the processing of the document. The error message should be as helpful as possible, describing they type of error, and, if possible, the source of the error in the document.

Priorities

The most important goal is the correct processing of valid HTML documents and flagging invalid documents. This includes the requirements that elements which the program doesn't explicitly handle are passed through to the output, as described in the functional requirements section, the requirement that text is full-justified and the support for optionally specifying the line width of the output. Next is the processing of the various elements and attributes. The ability to collect all links and append them to the output is the next most important feature. When these requirements have been met, extensions to the functionality may be added.

Interfaces to the environment

The program will be used as a unix filter, that is, HTML documents will be piped to it as input and the output will be piped to another program (such as more or lpr). The program is free to allow the option of specifying filenames for input and output instead of always using stdin to stdout.

Adam Carlson