The Haystack Problem
Finding content in your file system or content repository is hard enough when you’ve got simple text documents to deal with. When you’re using DITA — the Darwin Information Typing Architecture — and other component-oriented XML standards, you increase the difficulty by two or three orders of magnitude, because you’re looking for smaller needles in bigger haystacks. Having thousands of media-independent content objects that can be shared and reused across multiple deliverables allows you to create more sophisticated knowledge products, but it definitely poses a challenge in findability for content authors.
Sounds like a job for metadata and taxonomy.
Among its many features for content reuse, DITA provides content creators with a facility for tagging content objects with metadata. Metadata—literally “data about the data”—lets content authors and others who manage content describe what the content is about (descriptive metadata), as well as assign properties like who created the content, when, in what language, and for which audience (administrative metadata). A taxonomy is a hierarchical structure that organizes concepts and controls vocabulary. Taxonomies allow organizations to create and centrally manage important terms that can be applied to content as metadata. For example, a telecommunications manufacturer might have a taxonomy that includes concepts such as product categories (Mobile Phones, Wireless Routers, and so on), industries (Healthcare, Utilities, Transportation, and so on), or product models.
Once applied, this metadata and taxonomy can be leveraged by a search application to help users—internal or external—find and use content. Search engines can use taxonomy to organize search results in meaningful ways, such as refining search based upon certain properties (faceted search) and suggesting related searches based upon relationships between search terms and other concepts in the taxonomy.
Seems a natural fit—DITA and taxonomy, like peanut butter and chocolate: DITA creates a multitude of reusable components, and taxonomy helps describe and organize the components so that they may be readily found and reused by content authors and users. One would think that organizations getting involved in DITA and metadata structures would quickly see the benefits of taxonomy. We recently tested this hypothesis in a benchmarking study into the adoption of metadata and taxonomy.
XML, DITA, and Metadata Maturity
In June 2009, Earley & Associates and Taxonomy Strategies surveyed 270 organizations to discover how mature they were on the scale of best practices for enterprise search, taxonomy, and metadata. We found that organizations that use XML for structured authoring showed more widespread adoption of metadata best practices than average.
Part of our hypothesis was confirmed. It would seem to follow that taxonomies and descriptive metadata adoption would be widespread by organizations using DITA, since metadata-based search would improve findability of content objects. However, in follow-up interviews within the DITA community searching for metadata success stories, we discovered that in practice few organizations use DITA’s embedded metadata capabilities, and that fewer still do anything with taxonomies to organize descriptive metadata for DITA content, even for use by content management systems (CMS).
DITA Support for Metadata
Is it a matter of inadequate metadata and taxonomy support within DITA? Compared to other XML standards, DITA provides a relatively rich and extensible framework for embedding metadata directly within the XML objects themselves. The embedded metadata can be used by processing tools—like the publishing tools in the DITA Open Toolkit (DOTK)—to conditionally publish content or to create metadata in the final outputs, like HTML.
DITA objects—both topics and maps—have a prolog section in which metadata can be specified. Within the prolog, the metadata section can define metadata about the topic itself—such as the intended audience, the platform (for defining the applicability of the topic to specific hardware or operating systems), and so on. This metadata can be used for conditional publishing. For example, you can automate the production of a Linux version of your documentation by only outputting topics and maps that set platform to “Linux” in the metadata.
DITA objects can also embed administrative metadata about the author, copyright holder, source, publisher, and so on. Metadata can also contain descriptive keywords for the topic or map. Keywords or index terms are output to HTML or XHTML as metadata keywords to support search engines. Authors can also define Index terms for the automatic generation of back-of-book indices.
DITA also enables users to define custom metadata fields within the othermeta element. Like keywords, metadata defined as othermeta are output as HTML metadata elements but ignored for other types of output like PDF. Clearly, metadata is a powerful tool in helping to manage and publish DITA content. But in practice, the use of embedded DITA metadata is largely for driving conditional publishing, which is fairly ubiquitous. After this, most organizations generally don’t use DITA metadata, instead relying upon content management system (CMS) metadata to manage workflow and search of content objects. The use of taxonomies to manage vocabularies and organize the concepts for descriptive metadata is even rarer.
While we had great difficulty finding any organizations using DITA metadata for purposes beyond conditional publishing, we did confirm that organizations that use DITA often make extensive use of CMS metadata. CMS metadata can be very rich, especially in component content management systems designed to handle DITA content. Content management systems often provide mechanisms for automating metadata, assuring that it is applied to content more consistently, in turn making it more valuable for search. For example, information about the author, publisher, or copyright holder can be set automatically by the CMS without author intervention. As a result, CMS metadata is used instead of DITA metadata for common applications such as:
- Search – Administrative metadata, such as author or version, is typically automated and presented as search options or facets in the CMS search interface. However, just using administrative metadata misses much of the power of faceted search.
- Workflow automation – Metadata about the content lifecycle state (for instance, “draft,” “ready for review,” “ready for localization”) can be set by authors to trigger production and editorial workflows in the CMS.
- Publishing automation – A DITA-aware CMS can automatically set embedded DITA metadata to provide the DITA Open Toolkit with metadata for publishing automation, such as the metadata for conditional publishing. CMS metadata can also be exported with DITA content, as a separate XML file or “sidecar” to be used by other tools to process the DITA content.
- Dynamic publishing – Descriptive metadata can be used to present content objects dynamically to end users. So we did find that organizations using DITA do take advantage of CMS metadata for finding objects and producing deliverables, but mostly on the administrative side or for conditional publishing.
When we asked about taxonomy, even the most mature organization was just getting started with it.
Why don’t we see more DITA users adopting taxonomy to improve findability and reuse of DITA objects? Well, some are but aren’t aware of it, per se. Many organizations are creating product documentation, and they achieve high levels of structural reuse—reuse that flows naturally from the structure of the product line. For example, we interviewed one large information-technology hardware, software, and service firm with millions of DITA topics in use across the company. They report 80 percent reuse, but cite that this is largely due to the modular nature of their products; content reuse follows the Bill of Materials. Reusable content is easy to find by browsing the folder structure of the CMS, which is organized based upon the product lines of the company. Authors who created content are the ones organizing the CMS folder structure and are the ones filing and reusing content. Authors wrote it, they need to reuse it, and they know where it is. Search is helpful, but not critical to reuse.
This scenario does, in fact, describe the simplest and most limited use of taxonomy—the product taxonomy, reflected in the folder structure. However, in this case, taxonomy is used only for browsing, not for controlling metadata vocabularies. Putting a piece of content in the “Product X” folder is not the same as assigning metadata to that content so that a search engine can index it as “about Product X.” In practicality, folders may be all they need. A company with 80 percent structural reuse can only expect marginal improvement in reuse rates from using metadata-based search (since you can’t double 80 percent).
Software companies—the most common DITA users—typically report 40 percent reuse rates, with fewer opportunities for structural reuse, because there is less common content among products. Here the opportunities to increase reuse are in other departments, such as training and marketing. Improved metadata-based search becomes the key for reuse to happen across departments; unlike structural reuse inside of the technical publications department, now they didn’t write it, and they don’t know where it is (or even if it exists).
So when content can be successfully discovered by browsing, descriptive metadata is useful but not critical to reuse. But when search is the key to achieving higher reuse, metadata and a usable taxonomy that spans departments are the keys to higher reuse rates.
Metadata Enables Dynamic Publishing of Content
During our interviews with DITA thought leaders, one emerging opportunity for using metadata and taxonomy came to our attention: dynamic publishing. A major benefit of DITA is creating content that is media-independent. It also enables content objects to be organized by DITA maps, so that content can be recombined and re-sequenced into different deliverables. DITA maps provide flexibility, but at the end of the day, they are still as static as the Table of Contents of a printed book. Many organizations are beginning to experiment with dynamic publishing, in which the selection and sequencing of content is done at run-time, dynamically, and independent of a map. Dynamic publishing breaks the book metaphor.
Dynamic publishing lets content be chosen and presented to meet the unique needs of a user or situation. To best illustrate dynamic publishing, let’s contrast it with static publishing of a help system. In a statically published help system, the hierarchy of topics is fixed by the author and the selection of content is limited to what is in the DITA map at publish time. All of the related topics are manually linked. If an author wants to add a related topic, she needs to manually add the link (or update the related-links table) and republish. The publishing process creates a deliverable that—while interactive—is static with respect to its contents and the relationships among them.
To create the same help system with dynamic publishing, the author would publish her content to a server, but she would not create the structure and relationships between topics at publish-time. Instead, a taxonomy would specify the relationship between concepts and properties that are defined in metadata. The relationships among topics are generated at run-time, based upon metadata on the topics. The richer the metadata and the more complete the taxonomy, the more sophisticated the user experience.
We all have experienced faceted search on consumer web sites, where we can refine search results by selecting specific values for different attributes, such as the number of megapixels for a camera. This experience is driven by metadata. With rich metadata on DITA content, we can create very sophisticated electronic content browsers, where metadata-based search creates browser-like user experiences. In the past, IETMs—interactive electronic technical manuals—required manually creating links and weaving together content. Dynamically published IETMs enable users to navigate through content with dynamic, task-focused pathways based upon metadata. When new content is published to the server, it can find its place within the IETM based upon its metadata.
“There’s a really good business case around dynamic publishing,” says Chip Gettinger, Vice President of XML Solutions at SDL XySoft. “When you create flat output—like a PDF or a help system—metadata is useful, but it is absolutely critical for dynamic content. Dynamic content can justify DITA adoption by enabling a broader range of uses for DITA content.”
So while we found some basic best practices in use during our research, there is a case to be made for more extensive use of taxonomy and metadata by organizations that use DITA. If you want to increase reuse across departments or enable dynamic publishing, you probably have some work ahead of you. Here are some best practices to get you heading in the right direction:
- Start by identifying all your taxonomy use cases – You will be using taxonomy not only for authors to search content objects for reuse but also potentially for serving up content to users dynamically or in a faceted interface. These perspectives will provide you with the framework for your taxonomy.
- Reuse existing vocabulary – Many organizations already use controlled vocabularies for some metadata fields such as organization, audience, platform, and product but still rely on keywords supplied by authors for other descriptive metadata, such as subject. Look to existing sources for tagging your content, such as fault classification schemes (from the service hotline), hierarchical product or system models (from engineering), or hierarchical task models (from instructional/task analysis from the training organization) as places to start building hierarchical descriptive taxonomies.
- Authors are the best people to apply descriptive metadata – After all, they do the analysis to determine what content was required in the first place, so they have the best context for classifying it. However, they aren’t librarians. Don’t expect authors to tag a lot: automate tagging when possible—especially for administrative metadata (author, organization, creation date, language).
- Leverage the technology – Many content management systems can integrate third-party classification servers for automating descriptive metadata. These servers can automatically apply metadata from a taxonomy or controlled vocabulary when content topics are checked-in, then automatically populate subject metadata fields in the CMS. The metadata can in turn be reviewed and manually adjusted by authors. This metadata can be embedded into your DITA content for use in conditional publishing or to generate HTML tags in the final output to support search or dynamic publishing.
About the authorsPaul Wlodarczyk, Director, Solutions Consulting – Paul helps clients compete by improving their content lifecycles – business processes and workflows that span the collection, collaboration, authoring, assembly, styling, review, localization, publishing, reuse, management, and search of unstructured content.
Paul brings over 25 years’ experience in content lifecycle operations, consulting, and software development, with expertise in the areas of enterprise content management, knowledge management, technical publishing, localization, collaboration, user interface design, learning technologies, and information worker productivity.Stephanie Lemieux, Taxonomy Practice Lead – Stephanie has a Masters in Library and Information Studies (MLIS) from McGill University, specializing in knowledge and content management, taxonomy, and information architecture. For the past several years, she has been working on taxonomy & knowledge management contracts and research projects for a variety of clients.
Recent projects include the development of a global corporate taxonomy and its implementation in a content management system, the creation of faceted search taxonomies for large e-commerce websites, and a digital asset management taxonomy.