Chapter 1. Concepts and data categories

A distributed infrastructure, such as provided by CLARIN-D, which aims at covering heterogenous language resources and tools, has to deal with the use of concepts which are similar but not identical when used by different stakeholders. This means that the ranges which these concepts cover overlap but they are not co-extensive. To complicate matters, identical terms might refer to different concepts if used by different parties, leading on the one hand to polysemy of terms, and, on the other hand, to synonymy. Both are not desirable but are not avoidable either in such a broad domain.

Explicit description of key concepts and of terms which signify these concepts is therefore vital for the success of an endeavour such as building a comprehensive infrastructure of language resources and tools. It is a keystone for the interoperability of language resources and tools. It is also necessary to make explicit the relations between the used concepts (e.g. synonymy, generalization / specialization).

A preliminary step to reach this goal is to collect concepts and notions which are used by the stakeholders. A further step is to provide definitions for these concepts and to link them to terms which are commonly used in the communities. It will also be necessary to keep track of concepts which emerge in course of the further development of the infrastructure.

That means that terminology management is a task which will never be finished. As an ongoing task which involves many stakeholders in the field it calls for tools which facilitate the description of categories and concepts and the transparent use of them when describing primary data (through metadata) as well as linguistic annotations and their descriptions.

In this chapter we introduce a Data Category Registry (DCR) to keep track of existing and new concepts, and discuss ISOcat, the Data Category Registry utilized within CLARIN-D, in detail and mainly from a (technical) users perspective.