By Diane Hillman, Dublin Core Metadata Initiative
What is Metadata?
Metadata has been with us since the first librarian made a list of the items on a shelf of handwritten scrolls. The term “meta” comes from a Greek word that denotes “alongside, with, after, next.” More recent Latin and English usage would employ “meta” to denote something transcendental, or beyond nature. Metadata, then, can be thought of as data about other data. It is the Internet-age term for information that librarians traditionally have put into catalogs, and it most commonly refers to descriptive information about Web resources.
A metadata record consists of a set of attributes, or elements, necessary to describe the resource in question. For example, a metadata system common in libraries—the library catalog—contains a set of metadata records with elements that describe a book or other library item: author, title, date of creation or publication, subject coverage, and the call number specifying location of the item on the shelf.
The linkage between a metadata record and the resource it describes may take one of two forms:
- elements may be contained in a record separate from the item, as in the case of the library’s catalog record; or
- the metadata may be embedded in the resource itself
Examples of embedded metadata that is carried along with the resource itself include the Cataloging In Publication (CIP) data printed on the verso of a book’s title page; or the TEI header in an electronic text. Many metadata standards in use today, including the Dublin Core standard, do not prescribe either type of linkage, leaving the decision to each particular implementation.
Although the concept of metadata predates the Internet and the Web, worldwide interest in metadata standards and practices has exploded with the increase in electronic publishing and digital libraries, and the concomitant “information overload” resulting from vast quantities of undifferentiated digital data available online. Anyone who has attempted to find information online using one of today’s popular Web search services has likely experienced the frustration of retrieving hundreds, if not thousands, of “hits” with limited ability to refine or make a more precise search. The wide scale adoption of descriptive standards and practices for electronic resources will improve retrieval of relevant resources in any venue where information retrieval is critical. As noted by Weibel and Lagoze, two leaders in the field of metadata development:
“The association of standardized descriptive metadata with networked objects has the potential for substantially improving resource discovery capabilities by enabling field-based (e.g., author, title) searches, permitting indexing of non-textual objects, and allowing access to the surrogate content that is distinct from access to the content of the resource itself.” (Weibel and Lagoze, 1997)
In the last years we have also seen an increase in the application of Dublin Core metadata in more closed environments. There are implementations where Dublin Core metadata is used to describe resources held, owned or produced by companies, governments and international organisations to supporting portal services or internal knowledge management. There are also implementations where Dublin Core metadata is used as a common exchange format supporting the aggregation of collections of metadata, such as the case of the Open Archive Initiative. In these cases, like in the open environment of the Web, the concept of standardized descriptive metadata provides a powerful mechanism to improve retrieval for specific applications and specific user communities. It is this need for “standardized descriptive metadata” that the Dublin Core addresses.
What is the Dublin Core?
The Dublin Core metadata standard is a simple yet effective element set for describing a wide range of networked resources. The Dublin Core standard includes two levels: Simple and Qualified. Simple Dublin Core comprises fifteen elements; Qualified Dublin Core includes an additional element, Audience, as well as a group of element refinements (also called qualifiers) that refine the semantics of the elements in ways that may be useful in resource discovery. The semantics of Dublin Core have been established by an international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, the museum community, and other related fields of scholarship and practice.
Another way to look at Dublin Core is as a “small language for making a particular class of statements about resources”. In this language, there are two classes of terms–elements (nouns) and qualifiers (adjectives)—which can be arranged into a simple pattern of statements. The resources themselves are the implied subjects in this language. (For additional discussion of Dublin Core Grammar, see “DCMI Grammatical Principles”) In the diverse world of the Internet, Dublin Core can be seen as a “metadata pidgin for digital tourists”: easily grasped, but not necessarily up to the task of expressing complex relationships or concepts.
The Dublin Core basic element set is an excellent resource for those seeking to understand elements and their usage. Each element is optional and may be repeated. Most elements also have a limited set of qualifiers or refinements, attributes that may be used to further refine (not extend) the meaning of the element. The Dublin Core Metadata Initiative (DCMI) has established standard ways to refine elements and encourage the use of encoding and vocabulary schemes. The full set of elements and element refinements conforming to DCMI “best practice” is available, with a formal registry in process.
Three other Dublin Core principles bear mentioning here, as they are critical to understanding how to think about the relationship of metadata to the underlying resources they describe.
- The One-to-One Principle – In general Dublin Core metadata describes one manifestation or version of a resource, rather than assuming that manifestations stand in for one another. For instance, a jpeg image of the Mona Lisa has much in common with the original painting, but it is not the same as the painting. As such the digital image should be described as itself, most likely with the creator of the digital image as Creator or Contributor, rather than the painter of the original Mona Lisa. The relationship between the metadata for the original and the reproduction is part of the metadata description, and assists the user in determining whether he or she needs to go to the Louvre for the original, or whether his/her need can be met by a reproduction.
- The Dumb-down Principle – The qualification of Dublin Core properties is guided by a rule known colloquially as the Dumb-Down Principle. According to this rule, a client should be able to ignore any qualifier and use the value as if it were unqualified. While this may result in some loss of specificity, the remaining element value (minus the qualifier) must continue to be generally correct and useful for discovery. Qualification is therefore supposed only to refine, not extend the semantic scope of a property.
- Appropriate values – Best practice for a particular element or qualifier may vary by context, but in general an implementor cannot always predict that the interpreter of the metadata will always be a machine. This may impose certain constraints on how metadata is constructed, but the requirement of usefulness for discovery should be kept in mind.
Although the Dublin Core was originally developed with an eye to describing document-like objects (because traditional text resources are fairly well understood), DC metadata can be applied to other resources as well. Its suitability for use with particular non-document resources will depend to some extent on how closely their metadata resembles typical document metadata and also what purpose the metadata is intended to serve. (Implementors interested in using Dublin Core for diverse resources are encouraged to browse the Dublin Core Projects pages for ideas on using Dublin Core metadata for their resources.)
Dublin Core has as its goals:
- Simplicity of creation and maintenance – The Dublin Core element set has been kept as small and simple as possible to allow a non-specialist to create simple descriptive records for information resources easily and inexpensively, while providing for effective retrieval of those resources in the networked environment.
- Commonly understood semantics – Discovery of information across the vast commons of the Internet is hindered by differences in terminology and descriptive practices from one field of knowledge to the next. The Dublin Core can help the “digital tourist”—a non-specialist searcher—find his or her way by supporting a common set of elements, the semantics of which are universally understood and supported. For example, scientists concerned with locating articles by a particular author, and art scholars interested in works by a particular artist, can agree on the importance of a “creator” element. Such convergence on a common, if slightly more generic, element set increases the visibility and accessibility of all resources, both within a given discipline and beyond.
- International scope – The Dublin Core Element Set was originally developed in English, but versions are being created in many other languages, including Finnish, Norwegian, Thai, Japanese, French, Portuguese, German, Greek, Indonesian, and Spanish. The DCMI Localization and Internationalization Special Interest Group is coordinating efforts to link these versions in a distributed registry. Although the technical challenges of internationalization on the World Wide Web have not been directly addressed by the Dublin Core development community, the involvement of representatives from virtually every continent has ensured that the development of the standard considers the multilingual and multicultural nature of the electronic information universe.
- Extensibility – While balancing the needs for simplicity in describing digital resources with the need for precise retrieval, Dublin Core developers have recognized the importance of providing a mechanism for extending the DC element set for additional resource discovery needs. It is expected that other communities of metadata experts will create and administer additional metadata sets, specialized to the needs of their communities. Metadata elements from these sets could be used in conjunction with Dublin Core metadata to meet the need for interoperabilbility. The DCMI Usage Board is presently working on a model for accomplishing this in the context of “application profiles.”
Rachel Heery and Manjula Patel, in their article “Application profiles: mixing and matching metadata schemas” define an application profile as:
“ … schemas which consist of data elements drawn from one or more namespaces, combined together by implementors, and optimised for a particular local application.”
This model allows different communities to use the DC elements for core descriptive information, and allowing domain specific extensions which make sense within a more limited arena.
The Purpose and Scope of This Guide
This document is intended to be an entry point for users of Dublin Core. For non-specialists, it will assist in creating simple descriptive records for information resources (for example, electronic documents, JPEG images, video clips). Specialists may find the document a useful point of reference to the documentation of Dublin Core, as it changes and grows.
This article illistrates, in a non-technical fashion, how Dublin Core metadata may be used by anyone to make their material more accessible. It discusses the principles, structure and content of Dublin Core metadata elements, how to use them in composing a complete Dublin Core metadata record, as well as how to qualify elements to support use by a wide variety of communities.
Another important goal of this document is to promote “best practices” for describing resources using the Dublin Core element set. The Dublin Core community recognizes that consistency in creating metadata is an important key to achieving optimal retrieval and intelligible display across disparate sources of descriptive records. Inconsistent metadata effectively hides desired records, resulting in uneven, unpredictable or incomplete search results.
As a general introduction, this document is necessarily brief, and cannot address all the issues implementors may encounter while planning their use of metadata. Several avenues remain for those who have additional questions beyond those addressed in this guide.
- Appended to this guide are references to relevant articles and other resources, including those with more technical guidance for implementors
- The Dublin Core Website contains references to additional documents and resources of the DCMI community and ways for implementors to become involved in the DCMI
- Specific questions can be addressed to AskDCMI. In addition to fielding questions, the AskDCMI service maintains a searchable archive of already answered questions and links to additional resources
In this guide, we have chosen to represent Dublin Core examples in a “generic” form (Element=”value”). Examples of other syntaxes, including: HTML (the Web’s Hypertext Markup Language format), RDF/XML (the Resource Description Framework using eXtensable Markup Language) and in plain XML can be found in syntax-specific documents available on the DCMI Website. Some are also referenced within this document and in the Bibliography Section of this guide.
HTML provides an easily understood format for demonstrating Dublin Core’s underlying concepts, but more complex applications using qualification may find that using XML or RDF makes more sense. When considering an appropriate syntax, it is important to note that Dublin Core concepts are equally applicable to virtually any file format, as long as the metadata is in a form suitable for interpretation both by search engines and by human beings.
HTML can be used to express either simple or qualified Dublin Core, although there are limitations inherent in representing refinements in HTML. Specific instructions for expressing Dublin Core in HTML can be found in the following two DCMI documents:
RDF (Resource Description Framework) allows multiple metadata schemes to be read by humans as well as parsed by machines. It uses XML (EXtensible Markup Language) to express structure thereby allowing metadata communities to define the actual semantics. This decentralized approach recognizes that no one scheme is appropriate for all situations, and further that schemes need a linking mechanism independent of a central authority to aid description, identification, understanding, usability, and/or exchange.
RDF allows multiple objects to be described without specifying the detail required. The underlying glue, XML, simply requires that all namespaces be defined and once defined, they can be used to the extent needed by the provider of the metadata.
A Guide to Growing Roses
Describes process for planting and nurturing different kinds of rose bushes.
This simple example uses Dublin Core by itself to describe an audio recording of a guide to growing rose bushes. With XML or RDF/XML, Dublin Core can now be mixed with other metadata vocabularies. For example, the simple Dublin Core description above might be used alongside other vocabularies such as vCard that can describe the author’s affiliation and contact information, or a more specialized “rose description” vocabulary that described the rose bushes in greater detail.
DCMI has made available several recommendations specifically about using these syntaxes:
- Guidelines for Implementing Dublin Core in XML
- Expressing Simple Dublin Core in RDF/XML
- Expressing Qualified Dublin Core in RDF/XML (Proposed Recommendation)
Metadata Storage and Maintenance Issues
Some implementations using Dublin Core have chosen to embed their metadata within the resource itself. This approach is taken most often with documents encoded using HTML, but is also sometimes possible with other kinds of documents. Simple tools have been developed to make provision of Dublin Core metadata within HTML encoded pages fairly easy. One such tool, DC.dot, extracts metadata information from an HTML document, and formats it so that it can be edited, then cut and pasted back into the HTML header of the original document.
On the other hand, metadata can be stored in any kind of database, and provide a link to the described resource rather than be embedded within it. This approach is likely to be most practical for many non-textual resources, and is increasingly used for text as well, primarily to support easier maintenance and sharing of metadata.
Each of these approaches have their advantages and disadvantages, and the balance point changes as implementations become larger and more diverse, and also as the metadata ages over time.
Element Content and Controlled Vocabularies
Each Dublin Core element is optional and repeatable, and there is no defined order of elements. The ordering of multiple occurrences of the same element (e.g., Creator) may have a significance intended by the provider, but ordering is not guaranteed to be preserved in every user environment. Ordering or sequencing may be syntax dependent, for instance, RDF/XML supports ordering, but HTML does not.
Content data for some elements may be selected from a “controlled vocabulary,” which is a limited set of consistently used and carefully defined terms. This can dramatically improve search results because computers are good at matching words character by character but weak at understanding the way people refer to one concept using different words, i.e. synonyms. Without basic terminology control, inconsistent or incorrect metadata can profoundly degrade the quality of search results. For example, without a controlled vocabulary, “candy” and “sweet” might be used to refer to the same concept. Controlled vocabularies may also reduce the likelihood of spelling errors when recording metadata.
One cost of a controlled vocabulary is the necessity for an administrative body to review, update and disseminate the vocabulary. For example, the US Library of Congress Subject Headings (LCSH) and the US National Library of Medicine Medical Subject Headings (MeSH) are formal vocabularies, indispensable for searching rigorously cataloged collections. However, both require significant support organizations. Another cost is having to train searchers and creators of metadata so that they know when using MeSH, for example, to enter “myocardial infarction” instead of the more colloquial “heart attack.” More sophisticated implementations can make such tasks much easier for users, but the controlled vocabulary terms must be present for them to be able to accomplish that task.
Using controlled vocabularies can be done most effectively using qualifiers. Without an encoding scheme specifically designated, a subject which might very well be carefully selected from a controlled vocabulary cannot be distinguished from a simple keyword.
Using Dublin Core is a DCMI Recommended Resource.
Still need help? Ask the metadata experts at Dublin Core.
About the author
Diane Hillmann is currently a co-PI for the Core Infrastructure portion of the National Science Digital Library. She has worked for the Cornell University Library since 1977, as cataloger and technical services manager as well as managing authorities and maintenance processes for the library’s database. She was a liaison to and member of MARBI from the late 1980’s to 2000, specializing in the Holdings and Authorities formats. Diane was an early participant in the Dublin Core Metadata Initiative, and is currently editor of “Using Dublin Core” as well as a member of the DCMI Usage and Advisory Boards.