TCW: Jason, the last we talked was about a year ago. For our readers who don’t recall that interview, can you tell us a little about yourself and the company you work for?
JH: Sure, Scott. I work as Principal Technologist with Mark Logic, focusing on large-scale XML content manipulation. Prior to Mark Logic I did a lot of work in Java: I wrote the book Java Servlet Programming (O’Reilly), helped develop Apache Tomcat and Apache Ant, created the JDOM open source library for XML manipulation, and worked as Apache’s representative to the Java Community Process Executive Committee.
TCW: Can you tell us a little about Mark Logic. What products do you make and what types of problems are they designed to solve?
JH: Mark Logic sells an XML Content Server (called MarkLogic Server) that acts as a platform for people creating content applications. We use XQuery as the language for interacting with the server, with extensions for advanced text search, transactional updates, and other useful features. While most XQuery engines focus on handling XML data (such as purchase orders) we focus on XML content (such as books, articles, references, web pages, and blogs). XML content, unlike data, is more textual, ordered, hierarchically structured, and diverse in its construct. We’ve enjoyed a lot of success selling in the publishing and government verticals.
TCW: Because many people can process and understand ideas when they can see them in action, can you show us a solution or two that is made possible by your software and explain what problem it’s solving?
Sure, one of my favorite examples is Elsevier’s PathConsult
web product. It’s a differential diagnosis tool built on MarkLogic to help pathologists (like “Dr House” on TV) identify the source of illness. These doctors need to be fast, accurate, and sure. The site runs against Elsevier’s vast library of medical literature. If you remember my discussion of web trends from our last interview
, PathConsult satisfies the three trends: “sweat the content”, “deliver answers not links”, and provide “content in context”. Next up: RadConsult
, a radiology diagnostic reference system.
TCW: Those are great examples of the power of Mark Logic Server. But, we received a recent press release announcing a new service, MarkMail. And, we were totally blown away. Tell us a little about MarkMail. What is it? How does it work? And, why do we need it?
JH: I’m glad you liked it! MarkMail is a free web site, hosted at
http://markmail.org, for interacting with email archives. The site lets you search and analyze emails from hundreds of public mailing lists: online forums where people email each other to discuss some shared interest. You may wonder, what do people talk about on public mailing lists? Software development is probably the most common topic. Others topics are as varied as fine wine collecting, large format photography, and techniques for guitar looping. So far we’ve loaded about 500 lists and 4,250,000 individual emails.
Email presents an interesting challenge. Email archives (both public and private) hold huge amounts of information, but the histories haven’t been well utilized. We think one reason for that is technical, that you need a product like MarkLogic Server before you can take full advantage of email content.
Our plan with MarkMail, being built on MarkLogic Server, is to actively push the envelope and build a content application targeted at the email challenge.
As you’ll see with the chart on the http://markmail.org home page, one of our goals with the site has been to focus heavily on analytics. We have lots of graphs and counts. Each and every query you write gets its own histogram chart. You can use these to put search results in context, track each list’s historical growth, check on a specific poster’s activity, or inquire whether the “buzz” on a topic is heading up or down.
Another goal has been interactivity. Every search result screen gives you lots of ways to refine your search (by sender, list, attachment type, etc).
Plus we did a lot with keyboard shortcuts. You can hit “n” and “p” to move to the next and previous result and “j” and “k” to move up and down the thread view. There’s a lot of little things like this. Plus if your result message includes Office or PDF files they’re shown in-line interactive, too. You don’t even have to leave the browser.
Another goal has been to focus on community. We could have launched MarkMail with many millions more emails from many sources, but we decided it’s better to work with communities one by one as we incorporate their historical archives and learn how they want to use them. At launch we focused on the emails of the Apache Software Foundation, an open source group responsible for much of the web’s core infrastructure.
TCW: The power of XML is at the heart of all things Mark Logic. How does XML enable the MarkMail solution? Or, more specifically, how are you using XML to make this type of solution possible?
JH: I’ve worked on email search systems before and I can tell you it’s a real challenge because of the nature of email. Email is messy. Email headers are fairly well structured, but not perfectly, because each mailer will send different headers using different formats and there’s no hard and fast constraints. The email body itself may seem like just flat text (what you’d call unstructured), but really there’s more to it. There are paragraphs, quote blocks (where person A quotes person B), initial greetings, and trailing footers (footers are like a person’s signature block, an auto-added listserv notice, an auto-added confidentiality statement, and things like that). There are also attachments, in which there are pages and paragraphs and things like that.
With MarkMail we load every email into MarkLogic Server as an XML document. It’s a very natural format for email. We mark up the headers, the body paragraphs, the quote blocks, the greetings, the footers, and everything. Even the attachments and their contents. Then, using XQuery as an XML-aware query language, we can build an application that uses the structure.
TCW: MarkMail appears to utilize both structured (author name, listserv name, post title) and unstructured (body of the email) content. When both types of content are married together, we can see things in the data that weren’t obvious before. The way that MarkMail presents the information allows us to visualize listserv information in meaningful ways. And, it allows us to make decisions based on that data.
While structured content is certainly valuable and sometimes easier to use than unstructured content, what can MarkMail do to help us take advantage of unstructured content. Can you provide some examples or potential uses?
JH: We use the internal XML structure for many purposes. For example, when performing a basic search we exclude all the boilerplate footer text. We can do that because we recognize it for what it is.
Of course there are times you may want to include the footer text, maybe only the footer text, like with a contact information lookup service. Imagine a service where you give us a person’s name, and we’ll find their contact information for you by extracting their signature footers, and showing you the most recent at the top. We can do this because we can mix the (irregular and unpredictable but still present) structure in the email body with the (more regular but still not fully predictable) structure of the metadata telling us who posted the mail and when it was sent.
Another example is our opt:noquote search option. It indicates you want to find text matches in original text only, not quoted text. So take this search as an example:
“godwin’s law” opt:noquote
It finds emails where the sender wrote the phrase “godwin’s law” and excludes emails where the sender only quoted someone else writing the phrase. Because we see quote structure in emails, this is easy. We use the quote structure for other things too, like to color code the email messages when we display them.
For another example, we use structure for our handling of attachment files. Try this search:
It finds emails containing attachments that have a PPT extension and relate somehow to axis (an Apache web services project). You’ll see with the email results that not only do we know which attachments include an “axis” term hit but on how many slides, and when you view the attachment we underline the slides with the hits. That’s all possible because what seems unstructured actually has lots of structure.
TCW: Is MarkMail a product or a service? Or both?
JH: MarkMail today runs a free service, designed to search public email archives. We’ve had requests by companies, organizations, and individuals who would like to have MarkMail functionality against their private email archives. We’re exploring ways to make that possible.
I think there’s a lot of potential there. Inside Mark Logic we use a private label MarkMail install for our own mailing lists. We have a mailing list dedicated to handling customer support issues. Using MarkMail against that helps speed our support response times. We have another mailing list for technical discussion, and new hires use MarkMail against that list to get up to speed. There’s also a mailing list where we discuss the competitive market. It acts as a knowledge base during sales calls.
TCW: It seems to us that MarkMail could easily be made to work with any set of listservs or discussion groups. Is there a way users can change the lists that MarkMail searches and use it to do research on a specific topic, person, or organization, or perhaps, perform competitive intelligence gathering? And, if not today, do you plan to introduce this feature in the future?
JH: We’re adding new lists every week. We do take requests (click the feedback link on the home page). Once the lists you’re interested in are loaded, you can be very focused in what you search for or generate analytics for. Here’s a cheat sheet on the search syntax:
TCW: Can MarkMail help us find attachments, on listservs that allow them? How does this feature work and what are its limitations?
JH: In MarkMail you can limit your view by attachment type or name. For example, searching for “extension:ppt” limits the result to emails that have
Microsoft PowerPoint attachments, while “attach:report” finds emails with attachments having “report” in their name. Most people don’t bother remembering these details because you can use the site’s guided navigation (the analytics pane on the left of a search result) to interactively narrow the search results by attachment type, sender, list, and message type.
We focused a lot on attachment handling in MarkMail Any attachment that’s formatted as a PDF, Microsoft Office file, OpenOffice.org file, or text file gets special processing so we can understand the attachment file’s internal structure and use that structure in searches.
When you find an email with an attachment that you want to view, we let you view it within the browser. This always surprises people when I give demos. I’ll click on a .ppt or .doc link and people expect a long download. Instead they see the first slide or page right away. It’s faster, plus when you have a 100 slide PowerPoint with just one slide containing a search hit, it’s a lot more convenient to let us show you the matching slide hit than have you try to find it yourself.
TCW: For those techies who are reading this interview, how did you make the MarkMail interface so awesome. It’s easy to use and works quickly, just like a desktop application. And, the functionality is useful.What’s the secret?
JH: We wanted MarkMail to be immersive, interactive, and fast. Some of the ways we accomplished this:
- Let people start with a simple query and then interactively refine it. Contrast this with an advanced search page where you’re supposed to type everything up front.
- Let the user hit “n” and “p” to go to the next and previous result. This really speeds the process of looking for answers. Now when I use Google I find myself wanting to hit “n” to navigate all the results in place.
- Sorry, I can’t tell you our last trick. We haven’t gone live with it yet. Let me just say we’ve found a great way to make the reading of each email a lot faster.
TCW: One thing that seems oddly missing from MarkMail is the ability for users to save search results as an RSS feed? If users could use the power of MarkMail to run persistent searches that automatically find relevant information and make it available to those who need it, they certainly would find ways to share the results with others. What are your thoughts on the syndication issue?
JH: In our first few weeks out, we’ve heard this feedback from a number of users. So it’s a safe bet that you’ll see RSS (probably Atom) feeds in the very near future.
TCW: Every new product launch seems to introduce a new and often confusing terms. MarkMail introduces us to a term that may need a little understanding. What are “lexicons” and why do we need them?
JH: Lexicons are the technical name for a core MarkLogic Server feature that lets you find the distinct values of a given XML element as constrained by some arbitrary query, and report on the number of occurrences of each value for the field. That’s a fancy way of saying it’s the magic that lets us quickly chart a histogram for every query you type, as well as print the top occurring list names, sender names, attachment types, and message types. We calculate all those numbers on the fly, thanks to lexicons.
TCW: We know of several IT pros who would love to work on such an innovative project. Which begs the question…Are you hiring?
TCW: MarkMail certainly is a much-needed solution. Thanks to your team for thinking outside the box and creating a solution that actually is useful and cool, at the same time. Is there anything you’d like to add?
JH: To the interview or to the site? To the interview: I think we’ve covered it. To the site: we’re just getting started.
TCW: Thanks so much for taking time out of your busy day to speak to our readers about MarkMail. We really appreciate it.
JH: Thanks a lot, Scott. As a last word, let me remind everyone, if there are lists you’d like us to load, write in and let us know.