David Louis Edelman David Louis Edelman

The Joy of Strict XHTML

I’ve recently discovered something else the Mozilla Firefox browser can do that Microsoft’s Internet Explorer can’t: Firefox can accept documents using the “application/xhtml+xml” header.

Who gives a shit? you might be thinking to yourself. Wait, I’ll explain. This might actually change your life someday.

For years, people have been writing web pages using the dated and somewhat arbitrary HTML 4 specification. If you don’t know what HTML looks like, take a look at the source code on any web page (by going to the “View” menu and selecting “Page Source” in Firefox or “View Source” in IE).

The problem is that during the web browser wars of the ’90s, Microsoft and Netscape both decided that they wanted their browsers to be as inclusive as possible. You could be a sloppy or an amateur coder, make all kinds of errors in your HTML, and the browser would silently compensate for you. For instance, the proper way to create a bulleted list is by using this code:

<ul>
<li>apples</li>
<li>oranges</li>
<li>bananas</li>
</ul>

But you could just as easily get away with typing this instead:

<UL>
<Li>apples
<li>oranges<lI>
<li>bananas
</ul></Ul>

Now this sort of expansiveness worked really well when the Web was new and getting people to buy the entire concept was the name of the game. You didn’t need to be a programming geek to get your chicken tortilla soup recipe before the masses; all you needed was a half-hour tutorial in HTML and you were on your way.

But we’ve entered a new phase in the development of the Internet. Web 2.0 has arrived, to use the popular catchphrase. And though you’ll hear a lot about how social networking and sharing apps are what Web 2.0 is all about, the truth is that Web 2.0 is about machines talking to machines. I write this blog entry in WordPress blogging software, which talks to the MySQL database holding all the information; talks to Ping-o-Matic and tells it to alert various search engines; and talks to your feedreading client and tells it that I’ve written a new entry. Ta-da! Machines talking to machines.

So what’s the problem with sloppy HTML? Machines can’t understand it. Which means that Google (to take one example) has to use elaborate parsing algorithms in order to turn your website into something it can understand. And while it’s all very well and good for a mega-corporation like Google to build these fuzzy linguistic interpreters, the next guy who wants to market a simple web service in his basement doesn’t have that kind of luxury.

It also means that different browsers interpret your web pages differently, and therefore display things differently. If you use Internet Explorer and you come across an improperly coded XHTML page, the browser goes into “Quirks mode” (I swear I’m not making that up) and tries to figure out what the hell you’re trying to do. Many web programmers simply code for what looks good in Internet Explorer 6 on Windows — even if it’s “wrong” or “broken” code — and to hell with all the Safari, Firefox, Mozilla, Netscape, Opera, Konqueror, Lynx and Flock users.

Enter XHTML.

XHTML is basically the HTML language, cleaned up. It’s HTML after six weeks of boot camp under a hard-ass drill sergeant. You have strict rules, and those rules must be obeyed. Take the bulleted list code above. In proper XHTML, you cannot capitalize any of the tags. You must close each tag so that for every opening <ul> there is a closing </ul>. You can’t put an <li> tag outside of a <ul> tag floating around on its own.

You can debate the merits of sending unruly teenagers to military school all you want, but for web pages there’s no debate. Strictly followed XHTML makes things easier for the machines that read your code. If we all followed the rules to the letter, Google would have a much easier time categorizing websites and it would save us all a lot of time.

Now here’s the problem: most web browsers process XHTML like normal HTML. They applaud your good manners and give you a gold star for coding correctly, but they’ll still slide right into Quirks mode when you make a mistake.

Until Mozilla Firefox. All you need to do to turn Firefox into an A-1 hard-ass drill sergeant is to (1) assign your web page the XHTML Strict DTD and (2) have your web server send the page to the browser as application/xhtml+xml instead of text/html. (You can look up how to do this elsewhere if you care. It’s basically just two lines of code.) Once you do this, Firefox will stop the display of your website cold if you’ve made any coding errors. Missed a closing <p> tag? Added an extra space? Accidentally capitalized a tag? Tough shit. Your page does not display, and you see a yellow XHTML Parsing Error instead.

(It’s important to understand that Firefox will only do this on a page-by-page basis, when that page and web server tell it to. If you don’t give Firefox these instructions, it will “fail gracefully” and display sloppily coded pages just like any other browser.)

This just might change the world.

Here’s how it might work: (1) Web programmers start migrating towards Strict XHTML. (2) Web services begin to interpret properly coded websites better than sloppily coded sites. (3) Web programmers flock to Strict XHTML in droves so their sites aren’t penalized. (4) The creators of these web services eventually decide to stop processing pages that are incorrectly coded altogether because it’s too much of a hassle. (5) The overhead for creating a useful web service goes down drastically. (6) Useful web services multiply exponentially. (7) You can search Google for “Northern Virginia Mexican restaurants,” and Google will no longer suggest that the “Pamela Andersen Britney Spears Katie Holmes Nude Sex Tits!!!” website might be the one you’re looking for.

And the world will be a better place.

Comments RSS Feed

  1. Jonny Axelsson on April 28, 2006 at 9:34 am  Chain link

    XHTML creating world peace, or eliminating spam from the world, is going a few steps too far. XML (and XHTML) removes some nasty problems, as well as some problems that are not really problems, and it is more strict than it needs to be in some cases. A nasty problem is when tags are improperly intermingled, because there is no proper way to untangle it. WHAT WG is working on a way, though.

    A problem which is not usually a problem (but only covered by validating processors in HTML or XHTML) is when a invalid element is used. For instance you are not allowed to put text directly into a ‘blockquote’ in (X)HTML Strict, you must put it into a block element like ‘p’ inside ‘blockquote’. If you don’t do that nothing untowards will happen, any browser as well as Google can handle that.

    A case where XML is stricter than ideally is the example of case-sensitivity. If you start an unordered list with an lower-case ‘ul’ and end it with an upper-case ‘UL’ this will make the XHTML handling grind to a
    halt (not if you use ‘UL’ in both start and end tags, but it wouldn’t be recognised as an unordered list anymore). For (X)HTML handling case-sensitivity would be easy since the elements are all US-ASCII, but there were trickier cases for other Unicode characters, forcing a strict case-sensitivity.

    By the way, both Opera and Konqueror handle XML (and thus XHTML) by the rules, Lynx and other HTML browsers would fall by the wayside though, and while I haven’t followed this part of the process it is possible that IE7 will do so too.

  2. David Louis Edelman on April 28, 2006 at 10:54 am  Chain link

    Forgive a guy for being an optimist. :-) I don’t really think that we’re all going to be sitting in a techno-utopia sipping lattes on the beach because of XHTML… but it’s a nice start. Standards compliance of any kind is really the key, like interchangeable parts was the key for the automotive industry.

    As for IE7, I was disappointed to discover that it will not support application/xhtml+xml. Part of Microsoft’s obsession with backwards-compatibility, I guess. The MS dude at the IEBlog claims that he’s doing this to help XHTML’s long-term prospects — which seems like dubious logic to me, but then again, I don’t control how 85-90% of the world views the web.

Add a Comment

I don't censor comments; please don't make me have to start. You can use common HTML tags, such as <b>, <i>, <a>, and <blockquote>. Comments with more than one hyperlink automatically go into the moderation queue. Your information will not be rented or sold, ever.