Problem 5: HTML Check

Background:

Unless you've been living under a rock for past 5 years, you've heard of the Web. Underlying all the web pages in the world is a klugey scripting language called HTML. HTML consists of a series of tags that the browser "typesets" to give you the page.

In this problem, you'll writing a small HTML checker. Given a HTML document as input, your program will have to say if it's valid or invalid.

An HTML tag is text that starts with a < and ends with a >. There are two types of tags, opening and closing. An opening tag has the format:

<name options>

while a closing tag has the format:

</name>

In both examples, name is the name of the tag we're interested in. options defines any options for that tag. In this problem, you may assume that options will not include any other HTML tag.

Closing tags may be optional, or required. A page is considered invalid if a closing required tag is not present.

An HTML document has two parts, the head and the body. The head section starts and ends with a head tag, while the body starts and ends with a body tag (see below). The head section is optional, and need not be present.

We're most interested in a subset of HTML. The following are the tags we're interested in.
Tag Name Description Required closing tag? extra info
html Starts and ends the document Y Must the first tag encountered. No tags should appear after the closing tag. It's ok for text to appear before the opening html tag, or after the closing tag, but no HTML tags may appear. This tag must appear in the document.
head Starts and ends a section with extra information about a document Y Must appear after the opening html tag, and must end before the opening body tag
title Sets the title of the document Y Must appear between opening and closing head tags
body Defines the start and end of the page "body" Y Must appear after the head closing tag (if present). The closing tag must be before the html closing tag. This tag must appear in the document
table Defines the start and end of a table Y Must appear within the body of an HTML document.
tr Defines the start and end of a table row N Only valid within an opening and closing table tag
td Defines start and end of a table data (1 cell) N Only valid within a tr tag
p Starts a new paragraph N Only appear within the body
br A line break N This tag does not have a closing tag. A page is invalid if you encounter a closing br tag.
h1,h2,h3,h4,h5,h6 All these tags are for `header' text (beginning sections, etc) Y This isn't one tag. This is 6 different tags. All of them must have a closing tag, and must appear in the body
center Centers all text between opening and closing tag Y Can only appear in the body
ul Starts an un-numbered (bulleted) list Y appears only in the body
li Creates a list item N Can appear only within a ul list

HTML text is to be considered if it meets all the above rules. It's invalid if it fails at least one rule (such as leaving off a required closing tag).

Of the tags, only html and body must appear in the document. html, body head, and title can appear only one time. If they appear anymore than that, then the page is invalid.

Those are the tags you have to check for. Occasionally, some companies decide to implement their own little extensions. If you get a tag that you don't know about, just remember that you saw an unknown tag. If some HTML text has some unknown tags, but is otherwise valid, the your program will consider it valid, and print a warning about the unknown tags. Note that since you can't tell if an unknown tag should have a closing tag or not, you should consider all unknown tags to have optional closing tags.

Input:

You will be given one HTML text file as input. Your program is to check to make sure that HTML text conforms to the above rules. You only need to check the tags, you do not have to check the options to the tags (you may assume that they are just "correct").

No tag will spread across a line. (ie, there will be no newline characters between the < and > of a tag). The closing and ending tags may be on different lines though.

All HTML tags will be in lower case. Furthermore, you can assume that, for an opening tag, there will be no spaces between a < and the tag name. For a closing tag, there will be no spaces between the < and the /, and no spaces between the / and name of the tag. there may be any number of spaces or characters between the name of the tag and the closing >.

As a simplification, you may assume that all tags will have their closing tags come after their opening tags (ie, you will never see a closing tag without having seen the opening tag).

Your program only needs to check the tags, and nothing else. ie, don't worry about whether or not all " have matching ".

Output:

The output of your program is exactly one of 3 lines:
Valid
If the HTML text is completely valid (ie satisfies all the above rules and has no unknown tags)
Valid, but has unknown tags
If the HTML text is valid, but has unknown tags
Invalid
The text doesn't meet one of the above rules (regardless of whether or not it has unknown tags)

Sample Input 1:


blah blah blah

<html>
<head></head>
<body>


<table>
<tr>
  <td>
<tr>
  <td>
</tr>
</table>
</body>
</html>

yakkity-yak


Sample Output 1:

Valid

Sample Input 2:


<html>
<body>

<center>

<table border=1>
<tr> <td> Boo!
</table>

Oops no closing center tag.
</body>
</html>

Sample Output 2:

Invalid

Sample Input 3:


<html>


<body>

<center><strong>Hi. valid, but unknown tags (strong)</strong></center>

</body>
</html>

Sample Output 3:

Valid, but has unknown tags