Requirements Specification for HTML Formatter
Introduction and System Model
This document assumes that the reader is familiar with HTML. A detailed
working knowledge is not necessary, but the recognition of technical terms
such as tags, links and elements is assumed. This assumption allows the
document to be written at a high level, with a reference to a low-level, more
formal description of HTML to clarify any details. Futhermore, many decisions
have been left purposely ambiguous. Most of these are design decisions and do
not belong in a requirements specification. Any questions regarding the
document or issues that are unresolved should be directed to the CS403 TA.
The formatter will be used in the context of a WWW toolkit. It will be used
by other tools in the toolkit to format HTML documents to ASCII output. It
will be used mostly to format documents for printing on ASCII printers, or
viewing on ASCII terminals.
System Evolution
Since the HTML specification is changing, we foresee a need to support
different types of HTML elements and element attributes. Another possible
extension is to allow customized handling of some elements. It might also be
necessary to add more sophisticated layout capabilities (such as relative
positioning of elements or using the SIZE attribute of an IMG tag to leave the
appropriate amount of whitespace). As the toolkit gets more sophisticated, it
may become necessary to be able to format documents incrementally, that is,
format the document one page (screen or hardcopy) at a time or until a
particular point in the document. For example, it may be necessary to process
the document up to the first link, and then continue processing, stopping at
each link.
Functional Requirements
The program should conform to a subset of the
Hypertext Markup Language - 2.0 specification.
The subset is chosen so that the program will handle many typical HTML
documents, without adding too much unnecessary complexity. The elements that
must be handled are:
- <HTML> ... </HTML>, <HEAD> ... </HEAD> and <BODY> ... </BODY>
- All HTML documents will have these, but they doesn't effect how you
process the document.
- <TITLE> ... </TITLE>
- Display left justified with a blank line below it.
- <BASE> ... </BASE>
- Should be used to create universal URLs from relative URLs encountered in
the document.
- <Hn> ... </Hn> (for n in 1..6)
- Display left justified with a blank line above and below and indented n
spaces from the left.
- Comments <! ... >
- Comments should be removed from the output.
- <STRONG> ... </STRONG>
- Display delimited text in all-caps (uppercase).
- Lists (Unordered lists <UL> ... </UL>, Ordered lists
<OL> ... </OL>, List items <LI>, Definition lists <DL>
... </DL>, <DT>, <DD>)
- Should be handled, you can decide how to display them. Nesting should be
handled correctly.
- Paragraph (<P>), Line break (<BR>) and block quote
<BLOCKQUOTE> ...</BLOCKQUOTE>
- These three elements should produce distinguishable output. One
reasonable way to handle them would be to have <BR> go to the begining
of the next line, <P> skip a line and go to the beginning of the
following line and <BLOCKQUOTE> use bigger left and right
margins. The <P> element should be handled both as an indication that a
new paragraph is being started (when used alone) and to delimit a paragraph
(when used with </P>).
- Images <IMG>
- If an ALT attribute is specified, it should be printed, otherwise, some
indication that there is an image should be printed.
- Anchors <A HREF="foo">link text</A>
- Special formatting of anchors is optional. The words "link text" may
appear as normal text in the document. The link itself must be processed and
the program must support an option which, when selected, will cause a form
feed character to be appended to the document, followed by a list of links in
the document. See below for details about
formatting links.
Unless specified above, all output should be full-justified. The text should
be flush at both the left and right margins, and spaces must be added between
words in such a way that gaps are even. The addition of spaces should be not
be repeated regularly from line to line to avoid large swathes of whitespace.
The last line of a paragraph or block of text should be left justified.
In order to reasonably process documents which
contain elements not specifically handled by the program, any HTML elements
(Things in <>'s) or attributes (additional things in <> besides
the name of the element) that your program does not understand should be
ignored. That is, the elements (or attributes) themselves should have no
effect, but any text that is delimited should be output normally. For
example, the input
<FOO>This is a test.</FOO> <P SPACE=5> This is only a test.
Should be treated as if it were:
This is a test. <P> This is only a test.
You do not need to worry about HTML elements that allow for user
input (textareas, forms, etc.) All you need to do is allow the program to
accept as input an HTML file and produce as output an ASCII file resulting
from processing the input.
The program should allow customization of the maximum line width of the output
file. Default line width can be assumed to be 80 characters.
It must also support the optional appending of a list of links (Names
and URLS) encountered in the document. This list should be separated from the
rest of the output by a form feed character (^L
). Links should
contain the text delimited by the link and the value of the HREF attribute, if
one exists. Other attributes do not generate a link listing. For example, if
the links
<A HREF="foo.html">foo link</A>,
<A NAME="bar">The bar</A>
and <A HREF="http://www.cs.washington.edu/education/">Courses</A>
appear in the input, and no base is specified, then the following should be
appended to the output:
^L
"foo link" = "foo.html"
"Courses" = "http://www.cs.washington.edu/education/"
If the HEAD of the document contains a
<BASE HREF="http://www.cs.washington.edu/education/courses/cse403/96sp/">
then the appended links should be formatted as follows:
^L
"foo link" = "http://www.cs.washington.edu/education/courses/cse403/96sp/foo.html"
"Courses" = "http://www.cs.washington.edu/education/"
The HTML
specification states that the BASE tag may not appear in the BODY of a
document and anchor tags (<A>) may not appear in the HEAD. Thus it is
safe to use the last occurence of BASE in the HEAD for all extensions of
relative URL's in anchors in the BODY. (For those of you who want to get
really picky, the HEAD may contain a tag called LINK which behaves just like
an anchor, but is used for things like style sheets. I can't figure out what
the spec says about interleaved BASE and LINK tags, but I'm not requiring you
to handle LINK tags, so don't worry about it.)
In addition to the
requirements described above, handling for any additional element types or
attributes are of value. For example, other types of lists (<MENU>,
<DIR>), style types and links (denoting links in the text), horizontal
rule (<HR>) or additional attributes to <BLOCKQUOTE> such as
INDENT=5 or LEADING_CHAR=">" are all possible extensions.
Constraints
The program shall be maintainable. It will be possible for a group of two or
three programmers familiar with the task but not with this specific
implementation to add functionality such as new elements or modify the
algorithm used for layout in one week.
The input to the program will be streamed. Once an arbitrary number of
characters have been read, they will not be rereadable. (The input may be a
unix pipe, not a file.)
The input may not be a valid HTML document. If it is not, any errors must be
handled. It is preferable that errors are handled by printing a message (to
stderr rather than stdout) and either continuing to process the document as
well as possible or ceasing the processing of the document. The error message
should be as helpful as possible, describing they type of error, and, if
possible, the source of the error in the document.
Priorities
The most important goal is the correct processing of valid HTML documents and
flagging invalid documents. This includes the requirements that elements
which the program doesn't explicitly handle are passed through to the output,
as described in the functional requirements section, the requirement that text
is full-justified and the support for optionally specifying the line width of
the output. Next is the processing of the various elements and attributes.
The ability to collect all links and append them to the output is the next
most important feature. When these requirements have been met, extensions to
the functionality may be added.
Interfaces to the environment
The program will be used as a unix filter, that is, HTML documents will be
piped to it as input and the output will be piped to another program (such as
more
or lpr
). The program is free to allow the
option of specifying filenames for input and output instead of always using
stdin
to stdout
.
Adam Carlson