Home » main blog » Currently Reading:

Findability: Jason Hunter On Mark Logic’s Use Of XQuery To Leverage The Power Of Legacy Content

January 5, 2007
These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • StumbleUpon
  • email
  • Facebook
  • TwitThis
No Comments

TCW: Jason, thanks for agreeing to chat with us today. For our readers who don’t know who you are, please tell us a little about yourself, your past experience, and your role at Mark Logic.

JH: Thanks for having me here, Scott.  Or do people call you Mr. Wrangler?

Some people may know me for my work in Java, where I wrote Java Servlet Programming (O’Reilly), helped develop Tomcat and Ant, created the JDOM open source library for XML manipulation, and worked as Apache’s representative to the Java Community Process Executive Committee.  For the last few years though I’ve been concentrating on XQuery and its ability to support large-scale XML content manipulation, and as part of that I joined Mark Logic and work here as Principal Technologist.

TCW: Can you tell us a little about Mark Logic. What types of solutions do you sell and to whom?

JH: Mark Logic sells an XML Content Server (called MarkLogic Server) that acts as a platform for people creating content applications.  We use XQuery as the language for interacting with the server, with extensions for advanced text search, transactional updates, and other useful features.  While most XQuery engines focus on handling XML data (such as purchase orders) we focus on XML content (such as books, articles, references, web pages, and blogs).  XML content, unlike data, is more textual, ordered, hierarchically structured, and diverse in its construct.  We’ve enjoyed a lot of success selling in the publishing and government verticals.

My focus within the company is primarily on publishing.  We help publishers as they move beyond simple “aggregation of content” to what you might call “interpretation of content”.  By understanding XML natively and providing an efficient mechanism to load, query, manipulate, and render XML content, publishers can raise the bar on what they deliver and how fast they can deliver it.

In my discussions with publishers I’ve identified a few trends in web publishing, trends I think we’ve helped advance:

  • Answers Not Links. – Users want to be one step away from an answer not two, so a results page that contains answers directly beats one containing only links.  You see this on Google when you enter “Mount Hood Elevation”. Publishers want to do this as well, and they can—by leveraging the XML structure of their content to improve their query capability as well as control their results page at all levels.  For example, Elsevier’s PathCONSULT web product uses Mark Logic to help pathologists (like “Dr House” on TV) identify the source of illness.  The site is extremely rich with content and uses the content to deliver answers (such as side-by-side diagnostic comparisons) not just links to material.
  • imagePathCONSULT “differential diagnosis” feature.

  • Sweat the Content – As a publisher you have two ways to make more money: Obtain more content, or do more with the content you have.  It’s generally better to do more with what you have.  That’s why publishers are creating (what I call) microproducts, like Oxford’s African American Studies Center, targeted at researchers of African American issues.  It’s built using a fraction of the vast collection of resources maintained by Oxford University Press, all picked for the relevance to the topic.  It’s a way for Oxford to take select content from Oxford Reference Online, Grove Music Online, and the American National Biography, add to it a set of content unique to the site like the Encyclopedia Africana (formerly only available in printed form), and offer subscriptions to people who probably wouldn’t want to subscribe to a generic site about people or music, but will pay to have a resource on African American issues.  It’s the first of many such sites Oxford will introduce.
  • Image

    Screenshot: Oxford African American Studies website. Get a free 30-day trial of Oxford Press.

    image

    Screenshot: “At A Glance” listing for Harriet Ross Tubman, Oxford African American Studies.

  • Content in Context – By “context” I mean three things. First, text location. The location of content within a larger context matters. For example, I once crafted a feature where you could enter a drug name and search the last year’s medical journals for information about that drug. How it worked was it looked in research articles, in conclusion sections, in paragraphs that mentioned the drug name and also mentioned a key phrase like “should not be used with”, “is contraindicated by”, or “may cause death”. Then the query printed the sentence that mentioned the drug’s name. In a sense this was Answers Not Links, and it was Sweat the Content also, but it was done by putting the content (the drug’s name) in context (which article, which part of the article, which contextual words). All this enabled by an understanding of the XML. Second, by context I mean the user’s location. The classic example is an electronic flight bag (EFB) on board a plane that gives pilots a checklist of actions in any conceivable event.  For many events you should get a different checklist if you’re cruising above 10,000 feet, if you’re taking off, or if you’re landing. EFBs can work with the plane to put the text in context of where the user is.  And third, I mean historical context. For example Congressional Quarterly uses Mark Logic to report how new legislation modifies previous legislation. It puts each bill in historical context.

image

Screenshot: Congressional Quarterly website.

There are several other trends. I give a presentation “Web Publishing 2.0” where I explain the top ten trends I’m seeing in the publishing industry.

TCW: Can you help our readers understand what the XQuery standard is, why it was needed, how Mark Logic uses it, and why it’s a better approach than some others?

JH: XQuery is a World Wide Web Consortium (W3C) standard language designed to query XML.  In some ways it is to the XML data model what SQL is to the relational data model. XQuery is currently a proposed recommendation, the last step before its formal 1.0 release.

Mark Logic uses XQuery because it provides a powerful mechanism to interact with our XML Content Server. Perhaps because of its name, some think of it as just a query language, but there’s a real programming language in there. You can do some amazing things with it.  If the job involves large-scale querying, manipulating, and/or rendering of XML, XQuery produces a solution far simpler than something like Java and JDOM.  XQuery can be easier to program than XSLT and, with an indexed store like MarkLogic, run much more efficiently as well.

Of course, XQuery has its limitations. The XQuery 1.0 release will lack a few important features like text search, the ability to modify documents, error capturing, and other features that will someday be standard but in the meanwhile Mark Logic has had to add these features to the language.

TCW: Can you tell us a little bit about one of your implementations?  For example, what is SafariU and what problems does it attempt to solve?

JH: SafariU is a web site from O’Reilly Media and Pearson that helps college professors create custom books for their classes. Instead of violating copyright and photocopying parts of books for students, professors can mix and match book pieces, articles, or even their own uploaded content into a new custom book.  It’s “rip, mix, burn” but for books instead of songs. The end result is a professionally bound book delivered to the college bookstore, sold for 16 cents a page. The SafariU back-end holds the content as XML (about 5 gigabytes worth) and uses that XML in conjunction with Mark Logic and XQuery to perform advanced search to find materials, HTML rendering for on-screen display, and PDF rendering for the printer-ready copies.  In the final PDF the dynamically constructed table of contents and back of the book index make it seem as if the context were always a single book, but in fact they’re just created on the fly with XQuery pulling out section titles and index term elements.

image

Screenshot: SafariU generated index.

O’Reilly recently setup a Labs site to experiment with the different alternate applications the SafariU system could support. There you’ll find a code search, an image search, a couple quiz games, and a content statistics application to learn all about what books O’Reilly sells. Before using Mark Logic, O’Reilly hosted their content on a simple shared NFS mount and had very little visibility into their most important assets. Now with the Labs site, we can all learn that they have over 303 million words in print and up for sale, including about 2.5 million lines of (now searchable) code.

image

Screenshot: A results page from O’Reilly Labs.

TCW: That seems like a very beneficial use of XQuery. What other services does SafariU hope to introduce?

JH: The list is long.  One thing I’d like to see is a “Search My Bookshelf” feature where you register the books in your office library and search via the site to find which books have the answer you’re looking for.  I’ll be happy if I never again use a back of the book index, even as I enjoy having paper copies of books to read.  If I do use a classic index, I’d want it to be a dynamically generated index, a concordance putting together all the indexes of all my purchased books, delivered to me as my own personal PDF to print. That’s actually quite an easy “query” to write.

TCW: This is the kind of solution that is more impressive when you see in it action. Is it possible to test drive SafariU?

JH: You can learn about SafariU and view a Flash demo.  I’m afraid you won’t be able to test drive unless you’re a professor, as users have unrestricted access to all of O’Reilly’s content, as well as much of Pearson’s.

TCW: Can you tell us a little bit about Oxford University Press?  How are they using XQuery and what types of problems are they solving?

JH: First, let me say Oxford University Press (OUP) has some of the most beautiful XML I’ve ever seen.  They invested heavily over the years in creating semantically rich XML, and it’s been a real pleasure to work with them to realize the value of this markup by helping them create a platform to host their online sites.  You can see XQuery in action on their sample “At a Glance” page.  This page includes content pulled from up to 10 different sources, more or less on the fly, and reconciled so that the user can get an instant overview of a particular topic.

TCW: That’s very impressive. What else can you tell us about the Oxford University Press project?

JH: With Oxford’s publishing platform, the platform underlying the African American Studies Center, they’re going to be able to roll out new sites faster than one per quarter.  Mark Logic literature says, “We accellerate the creation of information products.” I think their platform, built on Mark Logic and XQuery, proves that true.  One reason is there’s no “impedance mismatch” between the XML Content Server and the XML content being manipulated.  It’s the power of using the right tool for the job.

TCW: It seems like this is the kind of solution that is more impressive when you see in it action. Is it possible to test drive Oxford’s solution?

JH: Yes, on the home page (oxfordaasc.com) you can request a free trial login. If you’re just curious about what they offer, the Flash demo walks you through the main feature set.

TCW: There’s always a “Wow!” factor project in every information technology professionals arsenal. We use these examples to help folks understand the power of a technology in ways that are meaningful to them. What are some of the cool—and useful—things one could do with XQuery? Do you have an example you like to use that would help our readers understand what other problems XQuery could be used to solve?

JH: One time I was asked to create a “press view” of any arbitrary medical journal article.  The journal publisher wanted to help reporters understand things when they came to report on a new study with interesting consequences. The challenge of course is that these reporters weren’t medical experts.  I created the “press view” with XQuery and the query, manipulate, render sequence of actions.

I determined that more than anything else reporters would want to see the charts and images from the article, the eye candy.  Since I too wasn’t a medical expert, I knew the figures were what I looked at to try to understand what the article was about.  My first query thus was: find me all figures and figure captions in the article.  People need context, so for each figure I did another query to find any paragraph that mentioned the figure.  These I included after the figure.  (You can see a similar contextual display in the O’Reilly’s Labs Image Search offering.) To explain the purpose of the article I decided to query the for the table of contents page of the journal issue containing the article and place at the top of the “press view” the article’s blurb from the TOC.  At the end of the Press View then I queried all the later journal issues for any letters to the editor pertaining to this article, and I placed those in threaded order.  That way reporters could see any controvery the article stirred up.

I think the creation of this synthetic document shows the value of content applications.  It’s not just styling for output.  It’s querying, manipulating, and rendering.  That you can use the same scripting style language to search, extract, and style the content is really powerful.

TCW: Are there any questions you wish we would have asked you? If so, now is your time to ask them.

JH: You could ask where people should go for more information.  Here are a few resources:

Thanks again, Scott.  It’s always fun when someone asks about the thing you’re passionate about.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • StumbleUpon
  • email
  • Facebook
  • TwitThis

Comment on this Article:

Subscribe to the Newsletter

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.

Sponsors

Magnus Opus
JFM Concepts VDP Web
Tech Comm Suite
E-Spirit
Scriptorium
Intelligent Content
Oxygen
Future Changes
Earley & Associates
TC World Magazine
Byte Level Research
Edit Me

Readers

Subscribe by or


Twitter

Posting tweet...

Powered by Twitter Tools