Data Categories and Data Category Registries

Data Categories and Data Category Registries
	Chapter 1. Concepts and data categories

Descriptive terms, that signify linguistic concepts, are used in the description of resources, e.g. linguistic tools or linguistically annotated data. To give a simple example: if a corpus is described by a set of metadata, one important information to be conveyed to the prospective user is the object language(s) which this corpus covers. This information can be expressed in different manners, e.g. by using the name of the object language(s) in the language in which the metadata are provided:

Example 1.1. Unformalized data category value

Language: German

or by using a so-called language code, such as the ones provided by [ISO 639-3:2007]. Also the name of the descriptive category label itself may vary between “language”, “language name”, “Sprache”, “langue”, etc.:

Example 1.2. Formalized data category value

Language: deu

Sprache: deu

The point is that all the above descriptions have the same meaning: the object language of this resource is German. The statement has two parts: an attribute and a value that is assigned to this attribute. Both the attribute and its value, as well as many other attributes and values, are called data categories. To build an infrastructure of interoperable resources, it is necessary to keep track of all these categories in a category registry and to make an explicit reference to these categories by referring to them by unique category identifiers. This suffices for all the data categories which are already available. For those data categories which are used by a data provider and which are not yet registered, the data provider should be responsible to provide a new entry to the registry, providing information which is sufficient for other users to understand the meaning of this category.

For our example above, the fact that “Sprache” and “Language” mean the same in the given context can be expressed by providing a reference to a data category by an identifier which is the same in both cases. Concerning the value of this label, it is always preferable to resort to an existing standard, i.e., [ISO 639-3:2007] in that case.

This method of applying category names should also be used for information about the labels used for the (linguistic) annotation of primary data, e.g. to linguistic categories which give information about the part-of-speech of a linguistic unit which is part of the resource. Explicit reference to a data category and its description is helpful also in this case. To give a simple example: according to a linguistic framework which guides the part-of-speech annotation of a resource, the category noun might either refer to a concept which contains common nouns only or to a concept which contains both common nouns and proper nouns.

Terminology management in the domain of language resources and linguistic annotations is covered by the ISO standard [ISO 12620:2009]. See also the section on DCR on the CLARIN standards guidance web page for information on the ISO document and relations to other standards. It is CLARIN-D policy to follow [ISO 12620:2009]. The standard is embodied by the terminology management platform ISOcat. It is CLARIN-D policy to organize terminology management through this platform.


Chapter 1. Concepts and data categories	\| ToC	ISOcat, a Data Category Registry