By Michael Gross, DCLNews

image In the early days of digitizing information, five years ago, it was enough to just make more and more content electronic, but that’s no longer enough. With the ever-enlarging mounds of data out there, it’s not enough to create more ‘electronic paper.’

There’s a tremendous need to enhance the information so it can be more readily found, more easily accessed, and more easily reorganized. Content tagging in XML and SGML is key in this effort.

This article discusses content tagging, and how one might incorporated this enhanced information into legacy documents that were not written with these tags in mind.

What is Content Tagging?

When documents are converted to XML (or SGML), part of the conversion process is to create the tags that makes XML so useful. Most of the created XML tags are there to replicate the printed structure of the document. These “appearance” tags describe constructs such as sections, lists, captions, paragraphs, and tables. “Content tagging,” on the other hand, refers to tagging that is based on the semantic meaning of the content. For example a content tag in a maintenance manual might identify a word or a phrase such as a tool or a part number. In a life sciences technical document, there may be a tag or to apply to phrases that contain the genus or species classification information for a certain type of animal.

Two recently popular technical markup schemes, the Darwin Information Typing Architecture (DITA) and S1000D, both require content tagging at a fairly high level. Going further, creating S1000D documents requires that the content be decomposed into modules, with each module requiring a Data Module Code that identifies where a particular module exists within the overall documentation set. This Data Module Code is also a form of content tagging, since it describes what the module is about. Similarly, in DITA, content is decomposed into topics, and those topics are identified by type (e.g. tasks, concepts, and references). Deciding which topic type to apply to a particular DITA topic requires content knowledge since topic types are determined from the semantic information that exists inside of it.

Finally, content tagging can often mean having to take content tables (tables that have a particular appearance) and decompose that appearance into a particular set of content tags designed to hold that information. As an example, a technical maintenance document may contain a table of required tools. Converting this type of document requires all of the information in those tables to be broken down into content tags such as , , , . This type of content tagging can be very powerful because the information contained within the table can help make for a more efficient repair process. However, because this document was probably not written with content tagging in mind, converting these legacy tables into this type of structure presents a challenge.

Legacy Document Content Tagging Conversion

The issue with deriving content tags, as useful as they are, is that they don’t usually exist explicitly in the document; rather they need to be inferred. This process usually requires a combination of automated tools, that can get you much of the way there, coupled with a manual review process (because tools cannot effectively deal with all situations) and occasional use of a Subject Matter Expert (SME) for some level of review. Since SME time is usually quite expensive, limiting his or her involvement is an important strategy.

To help illuminate successful approaches we’ve used, the following further discusses some of the examples referred to earlier:

Example 1: Converting information into tags such as the and tags, is often helped by the fact that in the source documents, these classifications might already be composed in italics, which helps limit the scope of the analysis that needs to be done. If we are lucky we might find an exhaustive list, of all of the genus and species tags that exist in the source document, and the content tagging then becomes a task of matching italicized text to the list (note: we’re usually not that lucky).

Example 2: With S1000D and DITA topic/modules, completely identifying module types and topic types requires a thorough understanding of what the text is about. Automated tools and text patterns can help highlight clues within the text, but usually requires the help of a person to review the text and select the proper classifications. This can be done fairly easily by supplying a table of all section headings within the legacy documents to the client. The client can then review the tables and complete them as needed. Review is typically needed by someone familiar with the materials, but not necessarily a Subject Matter Expert (SME), with an SME reviewing only those items that remain ambiguous.

Example 3: With our final example, the decomposition of content tables, the required level of effort can vary widely. In the best case situation, all tables were set up using the same table template. Conversion of these tables to content tagging can be rather simple because identification of this type of table can be done by examining the column headings, and each column heading will typically translate into a tag (or series of tags).

More typically, tables were created using “similar” column headings, with some columns added or deleted as needed by the author of the table, without following a rigorous template. In those cases it becomes important to identify which column headings actually identify the table type, and which ones are optional. Even in the best case situation, there will be exceptions such as notes (or occasional extra cells) used to convey additional information that the author felt was necessary to include, but for which there is no simple content tag. For those cases, a strategy will need to be worked out with the client on a case-by-case basis.


As our discussion illustrates, content tagging covers a broad range of issues that may need to be addressed in implementing a successful legacy conversion to a modern XML markup scheme. These issues often have to be dealt with on a case-by-case basis, automating when possible, but when it comes to content tagging, a manual review is often needed to detect anomalies that are just not practical to handle via automated methods.

About the Author

Michael Gross is the Chief Technology Officer and Director of Research and Development for Data Conversion Laboratory (DCL). He is responsible for all software-related issues, including product evaluations, feasibility studies, technical client support, and management of in-house software development. He has been solving digital publishing conversion problems at DCL for twenty years and has overseen thousands of legacy conversion projects.