Introducing (X)HTML

The first thing you need to know about HyperText Markup Language (HTML for short) and eXtensible Hypertext Markup Language (XHTML) is that they're simpler than they look. The main purpose of these languages originally was and still is to provide a common format for documents that can be understood by as many different computers as possible. The program used to interpret these documents is known as a browser; the result is a webpage. A group of these documents, hosted at the same domain (general web address) is known as a website.

The main point of these documents—or at least it should be—is the display of content in an easily- readable/useable format. The content varies widely—Wikipedia's content is mostly (hopefully factual) information on various topics. DeviantArt's content is mostly images; its galleries are links to pages on which these images are displayed and described; YouTube's content is videos along with descriptions thereof. Companies use websites as both advertisements and online stores, while people the world over create websites to air their views, display things from their lives, or show off their artwork, literature, and so on.

Only rarely are websites built for their own sake—and the results are seldom good.

What HTML And XHTML Are

Anything written, either on paper or onscreen, is almost impossible to read unless it's broken up into paragraphs, lists, or tables, whichever is most appropriate for the information displayed. Means of emphasising or otherwise highlighting words add further clarity to the content. In websites, content can be grouped together according to function: what is the main content? How do you get around the website? Are there footnotes? headers? side notes? Does the browser need extra files, such as images?

This sort of organization and information is what HTML and XHTML were created for. While it is possible to use them to give visual effects to websites (and for a while HTML was the only reliable means of doing so), that's not their job. Their job is and has always been adding structure and extra meaning to the website's content.

How HTML and XHTML Differ

HTML 4.01 and XHTML 1.0 have almost no differences in vocabulary (that almost will be explained in this book). That was deliberate; it means a browser can read a document written in XHTML 1.0 as if it were a document written in HTML. This was done to allow a smooth transition between the two languages.

The major difference is that XHTML uses a metalanguage (a computer language used to create computer languages) called eXtensible Markup Language, or XML. The most important thing about XML is that it must be valid—that is, it must adhere to the rules of XML—for the browser to be able to read it; if the XML isn't valid, the browser won't be able to process the document and will simply show an error.

With that warning, though, comes two pieces of good news. First, making sure that XML is valid is no harder than coding with a markup language and keeping an eye on things. Second, following the rules of XML when writing HTML is usually a good idea anyways (I'll also explain the usually) and a well-written HTML document can often be converted into an XHTML with little difficulty. I'll explain the various aspects of valid XML as we get to them.

Because there are so many similarities between HTML and XHTML, their abbreviations are usually combined into (X)HTML, or (eXtensible) HyperText Markup Language. From now on, I will use the abbvreviation HTML and XHTML when I am referring to one language and excluding the other, and (X)HTML when what I'm saying applies to both.

A Bit Of History

HTML, along with what would become the World Wide Web, was created around 1990 by Sir Tim Berners-Lee, then a contracter at European Organization for Nuclear Research (CERN). HTML itself—described here at http://www.w3.org/History/19921103-hypertext/hypertext/WWW/MarkUp/Tags.html (HTML Tags)—was the language used to structure the information.

Berners-Lee also created the first browser, WorldWideWeb (later Nexus), to interpret that structure into a readable document. This browser could be built (that is, the code taken and turned into an actual program) on many different computers. Since the different builds of WorldWideWeb/Nexus interpreted HTML uniformly, researchers could share information without having to worry about the differences between their computers, which was a major issue at the time—and still is. Look up issues between Apple Computers and Adobe Flash if you don't believe me.

The World Wide Web and HTML didn't stay a scientist-only gig for long. Elements beyond what was originally in HTML were added, including the img element. That's right, images weren't originally used in webpages.

In 1993, Mosaic was released, and became the first browser to become popular. It was also the first browser to display images as part of the page instead of in a new window. Other elements were proposed, implemented, and in 1995, the W3C released HTML 2.0, standardizing current practices.

But something else happened in 1995 that would result in the very opposite: Microsoft's Internet Explorer 1 was released, going head-to-head with Netscape Navigator. The First Browser War was on. Younger readers may not remember seeing the following two graphics: Best Viewed With Netscape Navigator and Best Viewed With Internet Explorer. Older ones surely do—these were everywhere in those days. The First Browser War resulted in a lot of proprietary elements (for example, blink would work in Netscape, while marquee would work in Internet Explorer, but not vice versa) and other innovations such as Netscape's JavaScript and IE's answer JScript, along with the first browser to implement CSS: Internet Explorer 3.0. Around 1998, Netscape was acquired by America OnLine. By 2002, Internet Explorer held over 90% of the browser marketshare. The First Browser War was over, leaving users with browsers that had plenty of features—but also loads of security holes, bugs, and compatability problems.

Still, the W3C tried to keep HTML standardized, releasing HTML 3.2 in 1997, which allowed for tables and image maps, and HTML 4.0 later that year. HTML 4.0 was updated in 1999 as HTML 4.01—still the gold standard in markup.

In recent years, a second browser war has begun, this time focused not on features, but security and adherance to coding standards—and new features are being decided not by browsers, but by the W3C itself. Internet Explorer is now pitched not against a rival company, but against a number of opponents: Netscape Navigator's successor Mozilla (specifically FireFox), Apple's Safari, Opera, and Google Chrome, all but one (Opera) being part of various open source groups.

The W3C also remade HTML as XHTML, releasing version 1.0 (based on HTML 4.01) in 2000 and XHTML 1.1 (based on XHTML 1.0 Strict) in 2001, and proceeded to work on XHTML 2.0 (based on what the experts thought the web should use). XHTML 2.0 was never really adopted. To begin with, it had very little to do with classic HTML and was not backwards-compatible, which meant that it was simply another language that browsers had to support—even though it technically did the same thing as HTML. Developers thought that the W3C had lost touch with developers and the Web Hypertext Application Technology Working Group (WHATWG) was created outside of the W3C to start updating the venerable language again in 2004. As of this writing, HTML5 remains a work in progress.

A Few Quick Definitions

Browser
The Program you use to view a webpage. Examples are: Internet Explorer, Mozilla Firefox, Opera, Safari, Konqueror, Netscape, and Lynx (a text-only browser).
Character Encoding
How a program that reads text interprets each character in the document. It is best to include code that defines this to avoid confusing browsers. A famous example of character encoding confusion is typing this app can break (or, as conspiracy theorists preferred, Bush hid the facts) into Windows Notepad, saving the file, closing it, and reopening it. Since Windows can't figure out which encoding to use while reading the file, the text appears as gibberish sometimes resembling Chinese characters.
Code, Markup
In this context, markup and code refer to the same thing: the text a browser interprets in order to properly display your content in the webpage. Unlike content, you don't see the code on the screen; you see its effects instead. For our purposes, code refers to the actual HTML. Although the words that are used in HTML code are based on English words, it is not actual English.
Content

What you want to display on the Internet via your webpage. It can be:

  • text such as stories or information,
  • images such as diagrams or artwork,
  • animations or movie clips
  • sound files such as music or spoken instructions

or any other media you can think of. Essentially, it's what you see on your computer screen (with the exception of sound files, obviously) and the entire point of a website—or at least it should be.

Document
The file that contains the content and code for a webpage.
Render
To display a webpage or part of a webpage.
User Agent
A program used for communicating over a network. The most common type—at least as far as the World Wide Web is concerned,—is a browser, though this also covers speech synthesizers and other programs that read webpages.
Webpage
What your browser is used to display.
World Wide Web
The portion of the Internet which is used for webpages.

Some Advice

Before I tell you how to create a webpage, I wish to make something very clear: you must never confuse a browser or let it guess at what you want. Browsers, like all programs, work on the basis of complete logic and always follow what instructions they are given exactly—even if those instructions aren't what you wanted. Most browsers are programmed to guess at what to do when given improper instructions (usually the result of taking shortcuts with code), but different browsers have differing guessing instructions, leading to inconsistency. This is why you never want to confuse a browser or let it guess.

Furthermore, if you confuse a browser using XML, the browser may simply refuse to render anything and simply display an error message.

With that in mind, on to the basics of all markup languages, one of which is (X)HTML.