TCW: Dave, thanks for taking time out of your busy schedule to chat with us about content technologies and what the future holds. Before we get started, can you tell our readers a little about yourself and your company, Mark Logic?
DK: I’m a New Yorker who lives in Silicon Valley and who has been working at database-related software companies for my entire career. Prior to joining Mark Logic in 2004, I ran marketing at business intelligence tools vendor Business Objects for nearly a decade.
I’m also an active blogger. I write about web 2.0, discuss technology trends in search, database, and content management software, and write about general issues facing the software industry as well. I also try to keep it fun.
At Mark Logic, we make an XML server, a special-purpose database management system designed for handling XML content that our customers use as a platform for building information products. Today, most of our customers are in two industries: the information industry (which includes publishing) and Federal government. Our customers include Reed Elsevier, Wolters Kluwer, McGraw-Hill, Oxford University Press, Cengage, the US Army, and many three-letter Federal agencies.
Were starting to get traction in some additional industries as well, such as in life sciences, aviation, and financial services.
TCW: Mark Logic software powers some of the most innovative technology products on the web today. has changed the model for college and university book sales, making it possible for instructors to legally provide a mix of their own content with content from book publishers. The web-based solution lowers the cost of books (something every student will be happy about) while ensuring writers and publishers get paid for their work. It’s a brilliant and elegant solution to a big problem. Can you tell us about other web-based services that are powered by Mark Logic technology, and point our readers to websites where they can see the solution in action?
DK: Sure, you’re right about SafariU and were happy to be working with O’Reilly to deliver it. I think of it as iTunes for books—disaggregating books into chapters the same way that iTunes disaggregated CDs into songs.
As for other examples, I’m pleased to say that Harvard Business School Publishing (HBSP)—an innovator in content re-use (think about those racks in the management section at the airport bookstore)—is also using Mark Logic. In many ways what HBSP does is similar to SafariU with the exception that rather than making repurposing a user-driven operation, they make it an editor-driven one. But it’s the same concept create new information products quickly (and inexpensively) from your existing assets or, as I like to call it, your existing contentbase.
If you’re looking for MarkLogic-based information products that you can see online, then Elsevier’s PathCONSULT is one of my favorites. It’s an application that helps pathologists do their job, which is frequently to figure out what kind of cancer someone has by examining a slice of a tumor removed in a biopsy. Obviously, thatas very important work where lives hang in the balance. While this may seem like a very specific application, I think it is an excellent example of what we call task-based interfaces to information. As the amount of information available online continues to grow, these highly verticalized content-centric solutions will become more important in getting people the information they need for their particular job faster.
TCW: The power of Mark Logic is, in part, at least, made possible by XQuery, an XML standard that is commonly thought of as appropriate only for searching through vast quantities of data, but in actuality it is a programming language that can allow us to do some pretty exciting things with content. What makes XQuery so useful? And, what types of things can we do with it that we cant do without it?
DK: Well, I think you’ve nailed it right in the question. Most people think that XQuery is simply a query language for either XML documents or XML documentbases. But, its much more than that. XQuery is a complete programming language. Our customers can, and do, create complete web applications in XQuery. I call this top-to-bottom XML. After all, if your browser speaks XML and your content is in XML, then why not just write you application in XQuery and be XML top-to-bottom. It eliminates a lot of work in mapping XML (which is hierarchical) to either object class in Java or relational tables in Oracle or DB2.
I’d add that XQuery was recently named by Gartner as highest visibility technology plotted on recent data management hype cycle, so its definitely getting a lot of attention in the community. And, Gartner has said that XQuery is only 2-5 years from mainstream adoption.
TCW: This year has been an exciting one for technology and technologists. Time Magazine has dubbed the iPhone as The Invention of the Year. We share that view. The iPhone has changed forever how consumers will expect mobile devices to work. But, there were lots of other technology achievements that deserve note. What other technologies got your attention this year and why?
DK:I think user-generated content (UGC) and social networking were the two biggies in 2007. While I love the iPhone, too, to me this was the year that UCG sites like Wikipedia went completely mainstream, this was the year of YouTube and user-contributed video. In some ways we went from Web 2.0, Release 1 (e.g., Flickr, Delicious) to Web 2.0, Release 2 with things like Facebook that really combine a lot of the other elements and incorporate them to deliver a better user experience. For example, Facebook envelops:
- Twitter status updates
- Flickr photo sharing
- In some ways, YouTubes video sharing
- Many sites groups and events
And, it wraps it all-in-one and combines it with a social graph.
I think Amazon sneaking in the Kindle just before years’ end counts as a biggie in 2007 as well. It may be the e-book reader that starts to take e-books mainstream.
TCW: The future certainly seems bright in the technology sector. Geospatial tagging, social networking, crowd-sourcing, 3D graphics, video documentation, and mobile computing have us pretty excited at The Content Wrangler. What do you think the future holds? Lets start with content technologies that utilize XML. Whats coming down the pipe? What do you think we will be able to do with content—because of XML—in the future that we cant easily do today? Can you provide some examples of what you see in your crystal ball?
DK: Sure, mobile devices will continue to get better and be better supported in 2008—XML device-independence is a big enabler of that. The screens will get better. The software will get better. More and more, travelers will start to wonder if they really need to bring their laptops with them—and that will have a profound affect on the software and SaaS marketplaces.
I think geospatial indexing will get bigger as well. It’s one thing tagging a photo as Paris. It’s another automatically figuring out you’re talking about Paris the city, not the person, (entity extraction) and another still to then geo-tag it with latitude and longitude. When you do that, you can start to run queries against content like:
- Show me all documents that talk about Apple the company (not the fruit) and mention travel to places within 50 miles of Paris. That’s powerful!
I think content analytics will start to get real in 2008. Thus far, most of the focus on leveraging XML content has been for activities related to single-source and/or custom publishing. Really, the focus has been on integrating content, and then slicing and dicing it in different ways to create different information products. A focus primarily on re-use and re-purposing. I think in 2008 well see more people starting leverage the contentbases they’ve built for analytics as well.
TCW: Hosted software solutions seem destined to become extremely popular—perhaps ubiquitous.Some hosted solutions certainly are useful under the right circumstances. What do you think well see in the hosted or software as a service sector? Is there a trend toward delivering software via the Web and does that mean that folks in the CD and DVD manufacturing space ought to be looking for new opportunities?
DK: Today, NetSuite closed with an approximately $2B market capitalization, so I’d say that Wall Street remains excited about software-as-a-service (SaaS). Personally, I like the SaaS model for certain applications. For example, at Mark Logic we use both Salesforce and NetSuite, happily. We’re not trying to gain competitive advantage from managing sales leads or financials better than our competition. We let these suppliers focus on providing us with what Geoffrey Moore calls “context” so we can then spend our energy focusing on “core”.
In some ways, if you look at what our publishing customers do, they build content-and-software-as-a-service applications. (Wow, thats a mouthful.)
But what do I mean? I mean that our publishing customers are increasingly building content applications that blend information from their vast content repositories, blend it with software they build (on top of MarkLogic), and sell it to their customers as an information product or service. If you think about it, what they-re doing is becoming SaaS vendors of a sort—but the difference is Salesforce and NetSuite are just about the application; they run with your data. With publishers, it’s about both the software and the content, and mixing them together to help a person accomplish a task, like help a nurse or a radiologist do their jobs better.
I should mention that one way to look at our new MarkMail offering is to consider it a SaaS product. MarkMail is a web service that we are offering that lets people search email archives so they can locate expertise and/or experts. We’re running the service ourselves. We host it. We spider the content and subscribe to the lists. Anyone can use it (it’s free).
TCW: Semantic Web technologies are getting folks excited, too. What still needs to happen for the Semantic Web to become a reality and why arent we there yet? Or, are we there, but we don’t yet realize it?
DK: While I’m not a huge fan of web-based inferencing to find new knowledge (e.g., dogs bark, Bandit is a dog, ergo Bandit barks) and then using that new knowledge as base facts to do further inferencing, I am nevertheless excited about semantic web technologies.
I think in 2008 were going to see more and more tagging, both from users and by computers using technologies like entity extraction. I like a vision where both humans and machines can tag things, people can rate the tags, and then users can do queries leveraging all the tags. In effect, turning the web from something that is simply text-searched into something that is more database-queried.
So, while you can count me out of the RDF and OWL fan clubs in 2008, I think that tagging, querying, and mass collaboration will continue to make the web more useful as an information resource in 2008. So, Im kind of a hybrid web 2.0 / semantic web guy.
TCW: Mining unstructured, semi-structured, and structured content is not yet commonplace in many organizations. But we think these technologies are valuable because they can help organizations answer business questions. By combining various types of data together, we can harvest, sort, compare, and contrast various types of data and combine them together to provide business value. What does the future hold for data mining? And, how can these new approaches help businesses see trends or patterns that are not readily evident otherwise? Can you provide a few examples??
DK: Data mining has historically been focused purely on data. Analyze one million credit card transactions looking for the 0.5% that are fraudulent. It’s historically been about finding patterns and unusual occurrences in vast databases of short and similar transactions.
Text mining on the other hand has been focused on extraction of meaning from text. Do you mean Paris the debutante or the city? Is this email positive or negative in tone? Which products are referenced in this email?
Thus far, I’d say the major attempt to bridge the two worlds has been to use text mining for summarization (i.e., to turn content into data). For example, to process 10,000 emails and generate a table that lists products in one direction and sentiment (positive, negative, neutral) in the other. If I’ve got 50 products, I can summarize those 10,000 emails in a 150-cell table. In conjunction with my data warehouse, I could then tell for example, which of my top products has the most unhappy customer email?
Thats what data people want to do when they see content: turn it into data. Then it can play well with their existing data infrastructure and tools. But with systems like MarkLogic, tagging, and text mining, you can do a lot more with content than just turn it into cross-tab reports. You can run queries like this one:
Now thats a query! And its goes way beyond summarizing content into data.
TCW: The Darwin Information Typing Architecture (DITA) certainly has gotten significant attention in the technical communication space. Now, it seems that DITA is making its way into other areas. There’s even an initiative to use DITA for business documents. What do you see for the future of DITA? What might be possible in the future if DITA usage keeps moving ahead full steam? Does your crystal ball offer a clue?
DK: Right now, I like DITA for techpubs. I think theres a lot of enthusiasm for DITA in the techpubs community and I think DITA does a great job helping to solve some of the big problems they’re trying to address, such as the cost of maintaining different information, in different versions, in different languages, for users of different skills/roles, and assembling that information into the usual suspect deliverables such as manuals and help files.
And, I think the techpubs world is controlled enough so that DITA will work there. More generally, for business documents, I’m not so sure. Personally, my bet is Microsoft Office Open XML (OOXML) to be the market-driver behind the mainstreaming of XML for business documents. And while the XML is working at a different level of abstraction (and DITA XML is higher level and more powerful), I think, OOXML will be more ubiquitous. It’s a Heineken vs. Budweiser argument.
TCW: User-generated video seems to be the big winner again this year. You Tube is now co-hosting US Presidential election debates with CNN. Video documentation websites are popping up to serve vertical markets. Consumers like watching how to do things more than they do reading how to do them. What does the future hold for video? Do you think were fast approaching a time when we will be able to deliver text and video content using the same systems? And, what are the obstacles well have to overcome before video is as easy to manage as text?
Right now user-generated video (UGV) is on fire, but only in the consumer space. Frankly, I’m not sure if its going to catch on anytime soon in the business space. In businesses, for a long time, we’ve had the ability to make, archive, and share videos. I know some companies use video libraries for training. But my sense is its still a relatively small piece of helping people learn. Personally, I prefer interactive learning objects to video when it comes to corporate and professional learning.
Most computer systems manage text well and I believe the way we will generally search video in the future is through capturing the soundtrack and then converting the speech to text, time-stamping it, and then jumping into videos where certain words occur. There are people today who sell such technologies. I’m just not sure theres a big market for them.
So, I think technologically, were actually on the cusp of video being as easy to access as text. I just think its a different media and is inherently harder to work with; more difficult to scroll through, skim, search, and navigate than text. And, that’s not going to change much.
TCW: Do you see a markup language like DITA playing a role in video. Movies are modular— each segment is filmed independent of one another. These video content components should—hypothetically—be able to be managed like components of text content and reassembled dynamically, on demand. Do you see what we see? Should we be able to pull video components from a repository and reassemble them as needed? Or, can we do that now?
DK: Yes, it’s all possible. MPEG-7 is all about XML and indexing metadata about the content. To find videos and clips, I think it will provide a great advance. Personally, I’d rather search metadata about a video than the captured speech-to-text.
In fact, I think metadata is generally the key to multimedia indexing and re-use of all kinds. So, I think it will happen.
TCW: One last question. What content technology is the one to watch for 2008?
DK: XQuery. This is the year. With Microsoft Office 7 and its native XML (OOXML) document formats we’re going to see an explosion in XML across the enterprise. Industry-specific XML standards (e.g., FpML, XBRL) continue to gather momentum. My guess is XML pops over the tipping point in 2008 and once the world goes XML it doesn’t take long for people to appreciate, want, and need XQuery.
XQuery has been a long time coming, but I firmly believe it is the best way to query XMLbases (that’s what it was designed to do, after all) and that in the future the database and search and content worlds will all converge around XQuery. It’s open. It’s standard. It’s powerful. It handles XML which means it can query both structured data and unstructured data. That makes it broader than SQL. It is the future.
As I often say, our kids will think about SQL the way that we think about COBOL. As some old, data-oriented language that people used to program in.
TCW: Thanks for sharing your knowledge and psychic predictionsfor the future with our readers. We really appreciate your time and effort.