ISOcat, a Data Category Registry

ISOcat is the reference implementation of ISO standard 12620 mentioned above. Its mission is defined as follows:

ISO 12620 provides a framework for defining data categories compliant with the ISO/IEC 11179 family of standards. According to this model, each data category is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the data category. In addition, data category specifications in the DCR contain linguistic descriptions, such as data category definitions, statements of associated value domains, and examples. Data category specifications can be associated with a variety of data element names and with language-specific versions of definitions, names, value domains and other attributes.

The use of ISOcat is endorsed by CLARIN-D. It is a core facility for the terminology management in the CLARIN-D infrastructure. This is also planned for its upcoming spin-offs RELcat and SCHEMAcat. In the following we give a description of some features of ISOcat. The presentation is largely based on a tutorial by Menzo Windhouwer (MPI, Nijmegen) and Ineke Schuurman (KU Leuven and Univerity of Utrecht). While the first sections describe some details from the technical user's perspective, the last section sums up ways how to use ISOcat. For more information, help pages and tutorial material see the ISOcat manual and

In the context of ISOcat a data category (short: DC) is an elementary descriptor in a linguistic structure or an annotation scheme. Its specification comprises three parts: an administrative part for administration and identification of the DC, a descriptive part for documentation which can also be written in various working languages and a linguistic part describing the conceptual domain(s). The UML class diagramm of the data model for the specification can be found on the ISOcat website.

In the following we will give an example for a DC specification, but before we will have a look at the different types of data categories differing by their conceptual domains. There are three main types of DCs: complex, simple and container DCs.

Simple DCs are atomic and can neither contain other DCs nor be assigned a value. For example, to represent neuter, masculine and feminine (as possible values for grammatical gender) in ISOcat, for each of them a simple DC is needed.

Complex DCs can be assigned a value and appear with three characteristics: open, closed and constrained.

The values assigned to complex open DCs are arbitrary within a chosen data type e.g., the complex open DC lemma can be assigned values of type string.

The values assigned to complex closed DCs are part of a closed vocabulary consisting of simple DCs, e.g., grammatical gender would be a complex closed DC if the simple DCs feminine, masculine and neuter are in its conceptual domain.

Values assigned to constrained DCs need to fulfill certain constraints, e.g., the complex constrained DC email might restrict its values to strings containing an @ character.

Figure 1.1, “DC Types” provides an overview of the data category types.

Container DCs would in principle include other DCs (simple, complex and container). For example, a container category lexicon might include another container lemma which again includes the complex open DC writtenForm and the container lexicon might also include the complex closed DC language. See Figure 1.2, “Examples for DCs of several types” where container DCs are orange, complex closed DCs are green, complex open DCs are yellow and simple DCs are white.

There is one restriction regarding recursion as it is not explicitly stored in ISOcat. Therefore, complex DCs only take simple DCs or basic datatypes (e.g. string) as values (value-domain relations up to depth one) and relations between container categories are not explicitly stored in ISOcat.

In the administrative part a DC is identified. It gets an identifier and a justification. It is also assigned one of the types mentioned above.

The identifier is a name in camel case, i.e. all of its words are written in one joint character sequence without spaces, starting each new word with a capital letter as in camelCase). It has to start with an alphabetical character (firstPerson rather than 1stPerson), be meaningful (no abbreviation) and written in English. So, for example, in describing the DC for grammatical gender one could choose the identifier grammaticalGender. Nevertheless, the identifier is not unique as a DC can be registered under the same name in more than one thematic domain.

The unique identification for the DC is done by automatically assigning it a persistent identifier (PID). This unique reference is guaranteed to be resolvable for coming decades. Usually the PIDs of data categories in ISOcat look like the following URI, with a varying number at the end for different categories:

The justification explains why this DC is needed or where it is used. Its origin, e.g. the name of the tag set or the piece of literature where this DC was described, can also be given.

In the descriptive part, the DC can be documented in various working languages, it is embedded into one or more profiles and it is assigned a data element name.

The definitions in various languages should be more or less translations of each other, they should be understandable, and they should not rely on other specific terminological items unless those are defined elsewhere by another DC. In the latter case, one has to make sure that those related definitions cover exactly the term that should be used in the new definition and that the new definition references them explicitly. At least an English language section has to be present. Each language section also includes the correct full name(s) related to the DC in the respective language. On top of that, each DC should be defined to be as much reusable as possible, while being still correct for the purpose of the author.

For example, in a particular tagset a personal pronoun might be a pronoun referring to persons, but in general it often also refers to other entities (The cat has five kittens. She … The table was very expensive but I like it very much.). Therefore, a more general definition would help making the DC more easily reusable. This effect is increased even more, when keeping the definition as neutral as possible, i.e., without reference to a specific language or project. Definitions like In English a personal pronoun … or In STTS a personal pronoun … restrict the usability of the DC. Information about the origin of a DC can be stated in the administrative part. Nevertheless, the topmost constraint for the definition of the DC is to be valid for the purpose of the author and then for as many other users as possible.

The DC is embedded into one or more profiles. Profiles are managed by so-called thematic domain groups (short: TDG), i.e., formally established groups of users who assume a certain responsibility for respective domains such as Metadata, Morphosyntax, Terminology, etc. TDGs can be proposed via the subcommittees of ISO/TC 37 (the technical commitee on Terminology and other language and content resources).

The data element name is the place to include abbreviations or tags used for this DC. These names do not have to be in English or even any other language, they are language independent. Returning to the example of grammaticalGender: GEN or gramGender could be data element names taken from an application-specific tagset, or a domain-specific XML schema.

In the linguistic part, complex DCs are assigned their conceptual domain(s).

In case of a complex open DC the base data type is specified and in case of a complex constrained DC the base data type is selected and the constraints are expressed in a constraint language, e.g. as an XML Schema regular expression or in Object Constraint Language. For complex closed DCs the simple DCs that are possible values are selected. To give an example for the conceptual domain of a complex closed DC, we go back to the DC grammaticalGender. Here another important feature becomes evident: multiple conceptual domains can be assigned, one for each profile and one for each object language respectively. For example with respect to the profile for morphosyntax its conceptual domain can include a range of DCs, e.g. feminine, masculine, neuter and commonGender. The language specific French conceptual domain includes feminine and masculine.

The data category registry is useful for making (parts of) the semantics of resources explicit and so gives insight where the same semantics are shared in different resources.

Therefore ISOcat can (and should) be used in different situations and for the following purposes

Looking up a DC:

The web interface provides a search engine where existing DCs can be searched for by name, profile and description. Screencasts presenting how to search and inspect DCs in the ISOcat web interface can be found at the ISOcat manual.

Usually when searching for a specific concept many DCs appear, so the profile classification and the description sections of the resulting DCs usually provide a first insight which category to choose. Sometimes many similar DCs may appear as well as DCs containing vague or even ill-formed definitions. Therefore markings in different shape and colour are visibly assigned to the DCs of the search result and state the correctness of the specification from a technical point of view. As only DCs with a (green) check mark are in principle qualified for standardization, those are the relevant candidates to be referred to.

As standardization of DCs by the ISO is a complex and therefore slow procedure one often has to refer to non-standardized or not-yet-standardized DCs. As those may still be changing at any time, one has to regularly check the referenced DCs for consistence with the original purpose and eventually look for new ones. A CLARIN approach could also be to select data categories, which can already be seen as de-facto-standardized DCs relevant for CLARIN purposes, into a separate workspace in ISOcat.

Referencing an existing DC:

DCs relevant for a resource can be referenced by their PID and can be selected into a data category selection. References to the DCs can be embedded in (the scheme of) the resource. Collecting the references in a schema written in a specific schema language (Relax NG, DTD, OWL, EBNF, XSD ...) is preferred, as putting it in the resource itself mostly means to store the references redundantly as a single DC from an annotation (e.g. noun) usually occurs many times in the resource. Example 1.4, “Part of a CMDI XSD specification” is a fragment of an XSD specification of a CMDI meta data profile where an element named Url is related to a respective DC in ISOcat:

The XML format <tiger2/> that constitutes an XML serialization of [ISO 24615:2010] Language resource management -- Syntactic annotation framework (SynAF) allows for references of annotation feature and value elements to data categories of a DCR in the annotation declaration of its corpus header. The following example is cited from [Romary et al. 2011] and links the concepts of the part-of-speech feature and one of its possible values (personal pronoun) to respective DCs in ISOcat:

Creating and registering a DC:

If a DC which is needed is still missing in ISOcat, a new one should be created according to the above specification description. In the manual section of the ISOcat webpage there is a screencast which also gives an introduction to how to add a new data category.

Nevertheless, the specification scheme for a DC is complex, so the author has to edit it with care, provide meaningful definitions and examples and not confuse the different name categories (identifier, data element name, name sections for specific languages) or specification sections (e.g. language section in descriptive part vs. linguistic section(s) defining language-specific values for complex DCs).

Negotiating changes to a DC:

In case you find a data category which fits your descriptive needs nearly but not exactly, you might consider, instead of creating a new DC from scratch, contacting the creator (or owner) of this already existing DCs by the contact information in her or his user profile in order to adapt the meaning and thereby the coverage of this DC. In other words, re-use of a DC with slight modifications is preferable to inventing a new one. You can also share your own DCs with other ISOcat users and make it possible that these DCs might become recommended for CLARIN purposes.

While referring to sets of DCs from different resources helps establishing relations between concepts from the outside, relations between complex or container DCs in ISOcat are not stored explicitly. Therefore the upcoming ISOcat spin-off RELcat is added as a relation registry providing different relation types for the ISOcat DCs.

RELcat is a first prototype of a relation registry, see [Windhouwer 2012], where different relations between ISOcat data categories, and also between data categories from different data category registries can be stored.

Figure 1.3, “Examples for relations in RELcat” shows relations for the metadata categories from the example in the section called “Data Categories and Data Category Registries”. Making use of RELcat, both ISOcat data categories languageID and languageName could be related to the Dublin Core metadata element language.

Figure 1.3, “Examples for relations in RELcat” also exemplifies the relation type rel:subClassOf of the lowermost relation in the figure, which is utilized to indicate the relation between the two non-equivalent data categories.

A core taxonomy of relationship types is provided for the relations in RELcat, see [Windhouwer 2012], nevertheless other existing vocabularies can be supported by adding their relation types to their proper place in the taxonomy, thereby allowing for the inclusion of existing linguistic knowledge bases. Moreover, generic queries can automatically take multiple relation types into account.

This is also an important feature for the data categories themselves, as data categories which are denoted as equivalent or almost equivalent by the relation type can be automatically included into a query without the user having to know the names or number of the data categories in the set of equivalent data categories.

As RELcat aims at being similarly flexible as ISOcat, the relations stored in RELcat can reflect different views in parallel, e.g. of single users as well as of specific communities, see [Windhouwer 2012].