In this problem, you'll writing a small HTML checker. Given a HTML document as input, your program will have to say if it's valid or invalid.
An HTML tag is text that starts with a < and ends with a >. There are two types of tags, opening and closing. An opening tag has the format:
while a closing tag has the format:
In both examples, name is the name of the tag we're interested in. options defines any options for that tag. In this problem, you may assume that options will not include any other HTML tag.
Closing tags may be optional, or required. A page is considered invalid if a closing required tag is not present.
An HTML document has two parts, the head and the body. The head section starts and ends with a head tag, while the body starts and ends with a body tag (see below). The head section is optional, and need not be present.
We're most interested in a subset of HTML. The following are the tags we're interested in.
| Tag Name | Description | Required closing tag? | extra info |
| html | Starts and ends the document | Y | Must the first tag encountered. No tags should appear after the closing tag. It's ok for text to appear before the opening html tag, or after the closing tag, but no HTML tags may appear. This tag must appear in the document. |
| head | Starts and ends a section with extra information about a document | Y | Must appear after the opening html tag, and must end before the opening body tag |
| title | Sets the title of the document | Y | Must appear between opening and closing head tags |
| body | Defines the start and end of the page "body" | Y | Must appear after the head closing tag (if present). The closing tag must be before the html closing tag. This tag must appear in the document |
| table | Defines the start and end of a table | Y | Must appear within the body of an HTML document. |
| tr | Defines the start and end of a table row | N | Only valid within an opening and closing table tag |
| td | Defines start and end of a table data (1 cell) | N | Only valid within a tr tag |
| p | Starts a new paragraph | N | Only appear within the body |
| br | A line break | N | This tag does not have a closing tag. A page is invalid if you encounter a closing br tag. |
| h1,h2,h3,h4,h5,h6 | All these tags are for `header' text (beginning sections, etc) | Y | This isn't one tag. This is 6 different tags. All of them must have a closing tag, and must appear in the body |
| center | Centers all text between opening and closing tag | Y | Can only appear in the body |
| ul | Starts an un-numbered (bulleted) list | Y | appears only in the body |
| li | Creates a list item | N | Can appear only within a ul list |
HTML text is to be considered if it meets all the above rules. It's invalid if it fails at least one rule (such as leaving off a required closing tag).
Of the tags, only html and body must appear in the document. html, body head, and title can appear only one time. If they appear anymore than that, then the page is invalid.
Those are the tags you have to check for. Occasionally, some companies decide to implement their own little extensions. If you get a tag that you don't know about, just remember that you saw an unknown tag. If some HTML text has some unknown tags, but is otherwise valid, the your program will consider it valid, and print a warning about the unknown tags. Note that since you can't tell if an unknown tag should have a closing tag or not, you should consider all unknown tags to have optional closing tags.
You will be given one HTML text file as input. Your program is to check to make sure that HTML text conforms to the above rules. You only need to check the tags, you do not have to check the options to the tags (you may assume that they are just "correct").
No tag will spread across a line. (ie, there will be no newline characters between the < and > of a tag). The closing and ending tags may be on different lines though.
All HTML tags will be in lower case. Furthermore, you can assume that, for an opening tag, there will be no spaces between a < and the tag name. For a closing tag, there will be no spaces between the < and the /, and no spaces between the / and name of the tag. there may be any number of spaces or characters between the name of the tag and the closing >.
As a simplification, you may assume that all tags will have their closing tags come after their opening tags (ie, you will never see a closing tag without having seen the opening tag).
Your program only needs to check the tags, and nothing else. ie, don't worry about whether or not all " have matching ".
blah blah blah <html> <head></head> <body> <table> <tr> <td> <tr> <td> </tr> </table> </body> </html> yakkity-yakSample Output 1:
ValidSample Input 2:
<html> <body> <center> <table border=1> <tr> <td> Boo! </table> Oops no closing center tag. </body> </html>Sample Output 2:
InvalidSample Input 3:
<html> <body> <center><strong>Hi. valid, but unknown tags (strong)</strong></center> </body> </html>Sample Output 3:
Valid, but has unknown tags