Take Control of Your Information! (Part 1)

I started a post last week about the future of web content, and the problems of content distribution in web 2.0, and why we should all adopt open standards to structure everything we write on the Internet. The piece got much too long, so I split it into three parts. Here’s part 1.

***

Remember how the experts told us that computers were supposed to eliminate paper — and then for twenty years we wallowed in ream after ream of wasted paper? It was a running joke for a generation.

Now we see that computer technology really does eliminate the need for paper. My laser printer mostly sits idle these days. I primarily use it for two things: (1) printing out copies of MultiReal for editing and publishing purposes, and (2) printing out driving directions, because syncing them to my Treo is such a goddamn pain in the ass. The rest of the paper in my house is going away too. I can’t remember the last time I sent someone an actual letter. I gather most of my news through websites and RSS feeds instead of newspapers, and 90% of the mail is either advertising I don’t want or paper bills I should be receiving electronically.

So why did it take twenty years for paper to go away? Why was there such a huge boom in paper use after computers became ubiquitous?

A lot of it had to do with simple economics. Computer usage needed to hit a certain critical mass. Computer screens needed to be large enough, cheap enough, bright enough, and portable enough to serve as a comfortable substitute for the printed page. Adobe needed to see the business case for creating a technology like PDF.

But in addition to all that, people just didn’t see the potential of digital media. Some people still fail to see it. There’s a large segment of the population that sees the Internet as simply a convenient distribution system for paper. Attorneys are notorious for this, and doctors are the same way, but plenty of other businesses have this misconception too. Suffice to say any company whose website relies mostly on downloadable, printable PDFs to convey their product information just doesn’t get it.

The battle against paper is largely over. I’m bringing all this up because there’s another equally important battle looming, and that’s the battle against unstructured content. Turning all those stacks of paper into gigabytes of 1’s and 0’s was only the very, very first step. Just like there were many people twenty years ago who didn’t understand the benefits of email over paper mail, there are many people today who don’t understand the benefits of smart, structured content over unstructured content. Just like we understand now that the Internet isn’t just a distribution system for paper, we need to understand that it’s not just a distribution system for big globs of text.

The kicker is that the stakes are much higher this time around. I’ve recently come to the conclusion that plain, unstructured content is not only inconvenient in the long run, it’s dangerous. We need to develop open standards to structure all the electronic content we create, and we need to start insisting that everyone use them. If we don’t, we’re quickly going to find ourselves facing a monopoly that’s much more pernicious and dangerous than Microsoft’s ever was, and that’s a Google monopoly on information.

Let me back up.

What do I mean when I talk about structured vs. unstructured content? Well, here’s an example of unstructured content. You might recognize it as the opening sentences of (inevitable plug here) my novel Infoquake:

Natch was impatient. He strode around the room with hands clasped behind his back and head bowed forward, like a crazed robot stuck on infinite loop. Around and around, back and forth, from the couch to the door to the window, and then back again.

Here’s the typical way we structure that content today:

<title>Excerpt from David Louis Edelman's Infoquake</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta name="keywords" content="Infoquake, science fiction, sf, David Louis Edelman" /><meta name="description" content="Chapter 1 of David Louis Edelman's science fiction novel, Infoquake." /> <body> <p>Natch was impatient. He strode around the room with hands clasped behind his back and head bowed forward, like a crazed robot stuck on infinite loop.Around and around, back and forth, from the couch to the door to the window, and then back again.</p> </body>

As you can see, the only widely used open standards we have for describing web content now are things like meta tags and Technorati tags. It’s a rather piss-poor way of structuring content; meta information only gives you the most basic level of data about what’s on a particular web page. It’s pretty astounding that 15 years after the Internet took off, you still can’t reliably search the web by author or date.

The Dublin Core Metadata Initiative has been trying to get people to use other standardized tags like author, copyright, and language for years now, but on the web few people do. Tagging through blogging software helps to a point, but it still depends on a human editor to segregate out the important topics. Different people are going to have different tagging schemes and different conceptions of what’s important.

Unfortunately, meta tags are just a structured crust on top of a very large unstructured pie. Even if everyone adopted all of the Dublin Core tags and responsibly labeled their blog posts with pertinent keywords tomorrow, you still have the problem of dealing with the rest of the content inside those “body” tags. Right now, unstructured content is basically 98% of the content we throw onto the World Wide Web. Every time you open a text box somewhere, type a bunch of crap in, and click “Publish” — like I’m doing right now — you’re creating unstructured content.

The problem with unstructured content is that you and I can’t do anything with it. The only way we have of making sense of all the data in between the body tags of any HTML document is to read through it. Either that or we can skim through the headlines and the bolded words in the hopes that the author took the time to emphasize what was important. Try to do any kind of real aggregate analysis on large blobs of unstructured content, and you’re going to have a difficult time.

Trial attorneys know this better than anyone. If you’re preparing for a trial and you’ve got boxes and boxes of information to sort through, you need to know more than just the title and author of any particular document. You need to know who sent each piece of correspondence where, and when, and why, and which of your clients’ trade secrets were they mentioning, and did they violate their nondisclosure agreement while they were doing it. Until recently, there was really nothing you could do except hire lots of interns and paralegals to read through these documents one by one, and then input basic meta information into a database.

That’s no longer true. Now there are companies out there that are making great strides at understanding unstructured information. Companies like Google, Yahoo, and Amazon. Yahoo can look at your web page and sell ads based on the products you mention. Amazon can look at your web page and sell books based on that content. Google can look at your web page and find the street addresses, give you directions for how to get there, and sell ads for hotels along the way.

Writing a regular expression that can scrape web pages for street addresses and phone numbers is child’s play. But Google is taking this to a whole new level. They can figure out when you’re searching for apples to cook with and when you’re searching for Apples to make podcasts with. When I do a search for French restaurants, Google knows from my past search history that I’m likely to be searching for French restaurants in the Washington, DC area, and they’ve sold advertising to the DC Yellow Pages.

The fact of the matter is that Google knows how to parse the information on your website into chunks that are meaningful and useful to businesses, governments, etc. You don’t. Information wants to be free, right? The information you publish on the web is free, but by itself it’s useless. Nobody’s going to know it’s there. You need to go through the gatekeepers who have the money to comb through trillions and trillions of pages and make sense of it all.

In the Information Age, power will rest with those who can understand, aggregate, and manipulate information on a mass scale. I’m not just talking about tech companies like Google, Amazon, and Yahoo here — I’m also talking about Experian, Equifax, and TransUnion. The insurance companies that aggregate your medical data, the state and federal agencies that collect your tax information, the data mining companies that analyze your spending habits. (And if this is the first time you’re hearing this sentiment, please go trade in your computer for an XBox. You’re beyond help.)

Screen shot of Google Base Google understands this. In fact, so eager are they to take and parse your information that they’re practically begging you to upload it. Take a look at their Google Base project. Please, we don’t care what kind of information you’ve got, come to our site and upload it into our structured databases for free! We’ll host the data for free, we’ll give you access to it, just come on over! We don’t care if it’s just a list of the shoelaces you own — type it on in!

Google is rapidly becoming the primary gatekeeper between your personal information and the world. So far, with a few exceptions, they’ve largely abided by their motto of doing no evil. The data they collate is free and open to the public, and they’ve provided plenty of free tools and APIs for the hoi polloi to access it.

But what happens when their stock price takes a nosedive after a bad sales quarter, and Experian approaches Google with an offer to mine their databases for information to feed into your credit report? What happens when government agencies approach Google and ask for their help in tracking down tax evaders or deadbeat dads or illegal aliens? Do you want to depend on the CEO of a multi-billion-dollar company putting his hand over his heart and saying “trust me”? I don’t.

What’s the antidote to this? The antidote is for us, the users, to find ways to structure our content ourselves so we can make sense of the blob of information between the “body” tags. If we can structure this data ourselves, we can impose some measure of control over it. Not only that, but we can start building some of these tools ourselves. We won’t be dependent on Google’s largesse.

Every time we post unstructured content on the web, we’re putting money in the hands of Google, Yahoo, and Amazon. We’re contributing to the mass of data out there on the Internet that individual humans can’t possibly wrap their heads around. The only ones capable of using this information are the companies with the money to develop the high-end algorithms that can parse through it all. It might sound rather naive and utopian to think that everyone could agree to open standards for structured content — but the alternative is closed standards for structured content that you and I don’t have access to.

So what can we do about it? How can we start structuring our data? I’ve got some ideas. To be continued…

***

(One side note: It might sound like I’m trying to paint a picture of Google as an evil corporation hell-bent on taking over the world. I’m not. I wrote a spirited defense of Microsoft recently, and I think much of the same goes for Google. But any time you put so much exclusive capability in the hands of a single company or a single corporate entity, you’re asking for trouble. There are several other companies that do similar things to what Google does — Yahoo, Amazon, and Technorati, to name three — but Google is rapidly pulling away from the pack. People complain about how complacent Microsoft got when they achieved a monopoly over your desktop; what happens when Google achieves a monopoly over your content?)