In this exclusive interview, XML guru Norm Walsh chats with Scott Abel, The Content Wrangler about structured content, content standards, and the future of publishing. Read this interview and you’ll learn why XML documents aren’t a good fit for relational databases, how university professors are creating custom text books for students, and find links to several innovative projects that are demonstrating the power of XML and its cousin XQuery.
[Note: I’ll blogging from the MarkLogic User Conference, May 11-14, 2009 where I’ll be reporting on topics including those mentioned in this article. You can follow my adventures on the conference blog and on Twitter.]
TCW: Norm, thanks for taking time to chat with me today.we’ve known each other for some time now, but, for our readers who don’t know who you are, tell us a little about yourself and your connection to XML.
NW: Sure. I’ve been doing XML since we spelled it SGML. I started with Structured Generalized Markup Language back in the mid-nineties. My day job now is a wonderful combination of development work, helping customers build cool stuff with XML and XQuery, standards work at organizations like the W3C, pre-sales engagements talking about interesting and sometimes hard problems, speaking at conferences, working on community outreach programs, and other “evangelism” sorts of things.
I was an elected member of the W3C Technical Architecture Group for eight years, I’m also chair of the XML Processing Model Working Group at the W3C and co-chair of the XML Core Working Group and a member of the XQuery Working Group. At OASIS, I’m chair of the DocBook Technical Committee and a member of the RELAX NG Technical Committee.
TCW: Wow! That’s a lot of committee work. Thankfully, the work you do helping these groups also benefits what you do for your employer, MarkLogic. When you joined the company, there were a few people in the industry who were really surprised. After all, you were looked upon as a rock star in the XML arena. Why did you decide to leave Sun Microsystems after so many years employment?
NW: I’m not entirely comfortable with the notion of “rock star,” but between DocBook, open source projects, and standards work, I’ve guess I have become fairly well known.
Anyway, why did I leave Sun? I have tremendous passion for XML; let’s say that over time I felt like my vision for XML and Sun’s vision, as I perceived it, became so divergent that I decided to make a change.
As soon as I started talking to people at Mark Logic and had a chance to play with the server, I knew I’d found a group of exceptionally sharp folks who shared my passion for XML. A year and a few days after joining, I’ve never once felt otherwise.
TCW: After you joined Mark Logic, I recall you writing a blog post detailing who Dave Kellogg, the CEO of MarkLogic, challenged you to think differently about XML. Tell us a little about the challenge, what you thought before, and what made you “think differently?”
NW: That post is “Thinking differently about XML”.
I’d been at Mark Logic for a few months; I’d been thrown into a couple of small projects almost on day one, so I’d been busy. I was still trying to sink my teeth into the server, I wanted to develop something a little bit bigger.
In the course of building this project, I ran into some performance issues. I posted some basic questions to an internal discussion list and one of the folks who replied was Dave Kellog. His response wasn’t a challenge as much as a clear, patient explanation of how I had the wrong end of the stick.
I was used to thinking about XML in terms of a number of documents. The exact details escape me now, but roughly speaking I was trying to get all the documents I needed, then reach inside each to find the elements I needed, then process those. Dave’s observation was that I now had this great big, honking fast database that understands XML “natively”, and has everything indexed for fast access to XML. Instead of grabbing everything I might need and then filtering through it, I should push the constraints down into the database. Instead of applying XPath expressions to a document I had in hand, I could apply it to the whole database and get nearly instantaneous answers.
The app ran faster, I learned something pretty cool, and the *CEO* had taken the time to answer some newbie questions on an internal list. I thought that spoke volumes.
TCW: That’s a great example of good leadership and one of the reasons your CEO is admired by many others. In fact, his blog just won an SIIA Codie Award! And, the solutions your clients are creating using Mark Logic’s products are nothing short of miraculous, as far as I’m concerned. MarkLogic Server, for instance, has made it possible for organizations to see trends in—and answer questions derived from—unstructured and structured content, together in one repository. What is MarkLogic Server? Why is it so useful? And, what can it help organizations do today, that was impossible—or, at least extremely difficult—to do in the past?
NW: MarkLogic Server is a platform for rapidly building and deploying XML content applications. It’s a highly scalable native XML repository that can store and retrieve XML content and perform powerful search and analytics on it.
I’m a document guy. I think that most of what’s really important to an organization is bound up in documents one way or another. I’m not denying that there are huge quantities of tabular data out there, but documents provide the context for that data. The ideal way to store documents, so that you can extract the most value from them, is XML.
Because we have access to the structure and content of documents and metadata about them, we can do so much with them. Searching comes up a lot, of course, and we can easily provide both full-text searching: find me all the documents about “structured programming”, and faceted navigation: refine these search results by selecting only documents written by a specific author.
Alerting is important to a lot of people. Instead of querying a corpus of documents to find items of interest, you let the server do the work. By storing the queries, you can get the server to respond on the fly when new documents are inserted that match your criteria.
An area that I’ve been excited about for a while is geospatial applications. A *lot* of people are now carrying around devices that know exactly where they are, so I think the ability to quickly perform geospatial queries is going to become increasingly important.
One of the things that really impresses me about our core engineering team is how dedicated they are to maintaining the composability of features. Full text, structured and geospatial searching, for example, are all independent features, but you can compose them together arbitrarily and it “just works” *at speed*.
TCW: Publishers really see immediate benefit from using MarkLogic Server. Tell us about a few implementations by publishers and describe the value they’re receiving as a result.
NW: Custom publishing is a hot topic. At the MarkLogic User Conference (May 12-14 in San Francisco), Wiley is going to demonstrate Wiley Custom Select, an excellent example of a custom publishing application that I worked on recently. It allows professors to mix and match content from different textbooks to build their own custom textbook for a course. They can even upload their own content to be included in the book.
By putting all of the textbooks in MarkLogic Server, we can dynamically assemble a custom textbook in real-time. We give professors incredible freedom because we have effectively instantaneous access to every book in the system.
Another cool one I saw was a medical imaging application. A technician looking at an x-ray could enter a speculative diagnosis and the system would search a huge library of medical textbooks and journals. In this case, the system didn’t return whole documents, it returned just examples of x-rays that were diagnostic of the condition the technician entered. This provides instant access to exactly the x-rays that the tech wanted to use for comparison as an aid to making a final diagnosis.
TCW: Those are some great examples. The changing nature of consumer information consumption habits has led to drastic changes in the journalism arena. Are any publishers using MarkLogic Server to help them engage their readers online and to provide them with interactive experiences you just can’t have with a traditional print publication?
NW: Absolutely. The one that immediately comes to mind is BusinessWeek’s Business Exchange. This is a complete “social network” style site built for business professionals. The site incorporates content from many sources, including users. User participation drives the relevance of articles, creating a positive feedback loop for higher quality content and user participation. This also helps the editors decide where to expend their editorial resources.
Another cool application that comes to mind is one I saw recently that was built for a petroleum company. This company has all sorts of research documents about exploration for petroleum reserves of one sort or another. One of the important things contained in those documents is information about the location of the deposit in question. Using our geospatial capabilities, we could provide an interface that combined not just full text and faceted navigation but also a dynamic map. Search for a topic and the points on the map change to reflect only those search results. Select a subset of the points on the map and the search results change to reflect only documents about those points. It was all very cool.
TCW: The success of the iPhone is amazing. That little device (like the iPod before it) helped reshape an entire industry and impacted almost every vertical market along the way. Are you seeing anyone using MarkLogic Server to power iPhone solutions? Is there an iPhone app that uses your software?
NW: MarkLogic Server is well suited to supporting mobile clients. Applications running on the server have complete access to all of the content. They aren’t limited, for example, to just finding whole documents. They can find, combine, and aggregate information within documents at a very fine level of granularity. That means you can send exactly the right data to the mobile client. Sending less means better performance and less drain on the battery.
As apps get more sophisticated and need more capabilities, the need for strong server-side components will only increase. For example, geospatial capabilities are already hugely popular on the iPhone. If the next generation of iPhone software really supports push content, as it’s expected to, then alerts will be huge, too.
TCW: Let’s switch gears a bit. I know that you and I see eye-to-eye on the importance of XML to the world of publishing. But, I wonder if XML has been overlooked a bit as a powerful tool to help us provide data in ways that are meaningful to users. For instance, the whole field of information visualization is dedicated to helping us provide the best way to present information, depending on what type of information it is, and what question we’re hoping to answer. For instance, I love the examples provided by the often-cited Simile Project at Massachusetts Institute of Technology. Timeline, for example, provides an excellent and easy-to-understand example of how looking at data over time makes it easier to understand it. Do you know of other examples on the web that use data visualization in interesting and meaningful ways?
NW: Visualization is really important. Sometimes a picture really is worth a thousand words. I see this all the time in mapping mashups, but you can use MarkLogic Server to support almost any kind of visualization because it can retrieve the data you need so quickly. If you’ve seen MarkMail, for example, you’ve seen how it can quickly generate bar graphs. Sometimes those graphs help you really quickly narrow your search.
TCW: Guy Kawasaki has said that it may take a good 20 years for some new technologies to take off and gain widespread general acceptance. We’ve had structured content standards—leading up to the XML standard we have today. Is it time for XML to really take off? What does your crystal ball tell you? And, if so, what are some of the primary drivers for adoption?
NW: Prognostication is not my strong suit, but as far as XML goes, we’ve reached a point where it’s used in everything from traditional book publishing to mobile phones to gas pumps. I bet some part of your communication with the world passes through XML more often than you realize on any given day. Everyone using a recent version of Microsoft Office or Open Office is using XML, they just might not know it. I think widespread, general acceptance is upon us.
That said, I’m sure there are organizations that haven’t embraced XML. To the extent that information, that is, documents and other content artifacts, are an important asset to them, they’ll continue to be drawn to XML because it leverages reuse and repurposing of those assets in a way that drives down costs.
TCW: While you’re making predictions, you’ve been involved in the publishing world for a long while. What changes do you see in the publishing arena? Given what you know and believe, what will publishing look like ten years from now?
NW: Can I repeat that bit about my ability to prognosticate? If I could predict the future, I’d be rich.
As a distribution mechanism of ephemeral information, I think we’re all going to be reading more on some sort of electronic device. I’m hoping for an electronic paper breakthrough that lets me carry around something that more closely resembles a flexible, paperback book. Something with a hundred or so “pages” that I can turn and riffle through but where the “ink” is all electronic.
I think there’s a huge opportunity for publishers. They’ll have a ubiquitous platform for delivering rich, personalized content. With a powerful content server in the background, they’ll have the ability to deliver compelling content in an agile way to a diverse audience.
TCW: One question I continue to have to answer is “Why can’t we just put our documents in the database we have already?” Can you help our readers understand the problems with this line of thinking. Why aren’t databases that rely on tables and columns appropriate for content chunks that comprise typical documents? Why is is a bad idea to go down this path?
NW: The short answer is because typical documents don’t look like, simply aren’t, tables. The slightly geekier answer is “mixed content”. The simple truth is that relational databases were designed to solve a different problem.
Think about it this way. If you weren’t already biased by the tabular database systems that you already have, could you imagine in a million years that you would look at an XML document and think, yeah, let’s decompose the elements in this document and put them in columns in tables? Wouldn’t happen.
There’s a whole science built around analyzing tabular data, decomposing it into the right tables, and building the best indexes for it. The goal of this exercise is to systematically eliminate redundancy while preserving data integrity.
Tables just don’t make sense for documents. We’re chatting, but your audience will be reading this. Is this paragraph in the same column as the preceding paragraph? The same row? The same table? These questions don’t even make any sense.
Words and markup all mixed together in irregular ways, what happens in paragraphs and what we structured markup guys call “mixed content”, is just too loosely structured to sensibly model in rows and columns.
Another problem is information loss. Let’s say you shred a document and stick it into a table somehow and add metadata to let you query it. The fact that the metadata says that document 17 talks about Paris isn’t anywhere near as interesting as actually being able to write queries about the structure of the document. How does it mention Paris? In what sense? Is it near other markup that gives us clues about its importance? And so on…
There have been all sorts of approaches for getting documents into traditional databases: storing them as “blobs”, shredding them on element boundaries, inventing odd, nested table structures, and I’m sure at least that many other methods that escape me at the moment.
They’re all inferior in one way or another. If all you have is rows and columns and tables, then you can invent something that works at some cost in terms of time, or space, or utility. Sometimes, all three.
Relational systems give you a lot of query flexibility over data that has a constrained schema, and is consequently very structured and regular, which human prose is not. Curiously, if you invert those parameters, that is, if you think in terms of very little flexibility in the query and an effectively unconstrained schema, what do you get? You get full text search engines.
What an XML Server gives you is really the best of both worlds. You get a lot of query flexibility over data that, while it isn’t completely unconstrained, is also very flexible.
TCW: Wow, I just realized we’re out of time. Anything else you’d like to add?
NW: No, I don’t think so. It’s been fun chatting with you!
TCW: Thanks for your valuable time today. I really appreciate you helping our readers understand a little about you, the company you work for, and the content standards and technologies that are changing the way we consume content.
NW: You’re most welcome. Talk to you again soon.