This is one of the 52 terms in The Language of Localization published by XML Press in 2017 and the contributor for this term is Stephanie Piehl.

What is it?

The analysis of a given text or corpus, with the goal of identifying relevant term candidates within their context. Also called term mining or term harvesting.

Why is it important?

Term extraction is the starting point of all terminology management tasks. Term extraction is usually followed by the elimination of inconsistencies. Well-managed terminology improves quality, reduces costs, and improves time to market.

Why does a business professional need to know this?

When you extract terms, you are not only working on terminology, you are also managing the organization-specific or industry-specific knowledge. Terminology promotes knowledge sharing between people working in the same business field.

If you aim to improve the quality and consistency of your publications, term extraction is probably the best approach. When you start term extraction, you might find various synonyms and spelling variants for the same thing. For example, you might discover the terms electronic catalog, E-catalog, and eCatalog used as synonyms. Once you have identified synonyms and variants, you can determine which version of these terms should be used in all publications across all functional areas.

To start a term extraction task, compile a corpus from which you can extract term candidates. These term candidates are then validated and automatically or semi-automatically recorded. Usually, term extraction is either monolingual, to extract term candidates, or bilingual, to identify term candidates together with their equivalents in the target language.

Several tools exist that can help you to automate term extraction. Each tool has strengths and weaknesses, so there is no one-size-fits-all solution. Before you decide on a term extraction tool, test and evaluate the various tools.

In general, these tools use three main approaches:

  • Linguistic: the tool searches the corpus for word combinations that match a certain morphological or syntactical pattern, for example adjective+noun.
  • Statistical: the tool identifies repeated sequences of lexical items.
  • Hybrid: a combination of the previous two approaches, and thus, also the most frequently used approach.