Authoring Well-Formed HTML
A version of this page is also available for
4/8/2010
Well-formed HTML, or XHTML, simply means HTML that conforms to the rules of XML. This means that the same HTML tags are available, but the stricter XML syntax is required. An XSLT style sheet is itself XML and any HTML within it must be well formed.
In addition to HTML within an XSLT style sheet, you should consider authoring well-formed HTML for its own sake. The industry is moving toward well-formed HTML as a way to make the Web more robust, while simplifying and accelerating the processing of well-formed documents and data. Well-formed HTML has great advantages for authoring tools and can benefit hand authoring by ensuring that the markup is unambiguous. The industry expectation is that a future HTML standard will be an XML application.
The price for these benefits is that a less forgiving syntax must be used.
Writing well-formed HTML is simple. Here are the basic rules to follow as you author or convert to well-formed HTML.
All tags must be closed
HTML allows certain end tags to be optional, the most common being <P>, <LI>, <TR>, and <TD>. XML requires all tags to be closed explicitly. The following table shows tags in basic HTML compared to well-formed HTML.
HTML | Well-formed HTML |
---|---|
|
|
Leaf nodes must also be closed by placing a forward slash (/) within the tag. The most common examples are <BR>, <HR>, <INPUT>, and <IMG>. The following table shows leaf nodes in both basic and well-formed HTML.
HTML | Well-formed HTML |
---|---|
|
|
No overlapping tags are allowed
XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document. The following table shows an example of these tags.
HTML | Well-formed HTML |
---|---|
|
|
Case matters
Choose a consistent case for start and end tags. Generally, try to use uppercase for HTML elements. The following table shows how case matching should appear in well-formed HTML.
HTML | Well-formed HTML |
---|---|
|
|
Quote your attributes
All attribute values must be surrounded by either single or double quotation marks. The following table shows how to appropriately include attributes.
HTML | Well-formed HTML |
---|---|
|
|
Use a single root
Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed. The following table shows how to properly include the <HTML> element.
HTML | Well-formed HTML |
---|---|
|
|
Fewer built-in entities
XML defines only the following minimal set of built-in character entities:
- < — (<)
- > — (>)
- & — (&)
- " — (")
- ' — (')
Numeric character entities are supported.
Escape script blocks
Script blocks in HTML can contain characters that cannot be parsed, such as < and &. These must be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.
The following table shows HTML script block that contains both a character that cannot be parsed (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.
HTML | Well-formed HTML |
---|---|
|
|
Not all scripts will fail if they are not escaped in this way; however, Microsoft recommends that you do it as a matter of habit. This ensures not only that the script will work if it contains escaped characters or comments now, but also will continue to work if these characters are added in the future.
In addition, Microsoft JScript® (compatible with ECMA 262 language specification) comments terminate at the end of the line, so preserving the white space within script blocks containing comments is important. By default, the xml:space attribute value normalizes white space by compressing adjacent white space characters into a single space. This destroys the new line that terminates the JScript comment. Any JScript following the comment is treated as part of the comment and ignored, often resulting in script errors. The CDATA notation also ensures that the white space is preserved.