Existing MD sets

Here we want to briefly introduce a few relevant metadata standards and best practices respectively that are relevant for the area of linguistic resources and tools.

The Dublin Core Metadata Initiative was started in the library world to come up with a descriptor set that can be used to describe all kinds of web-resources. In total 15 categories have been defined by a worldwide harmonization process. Due to the need to describe the most different types of objects these 15 categories are semantically broadly defined and a generic terminology is being used. DCMI does not make statements about a specific schema to be used. Since for many applications these 15 categories were too broadly defined DCMI defined later the qualified DC categories, which have a narrower semantic scope.

DCMI has still much relevance in the world of digital libraries and is often used in areas where global searches are seen as being sufficient. The DC set is widely accepted for cross-disciplinary metadata aggregation although much information in general is being lost.

Early in 2000 the ISLE (International Standards for Language Engineering) project was started to come up amongst others with a metadata standard more focusing on the world of multimedia and multimodal resources. After several discussions with the DCMI user group it was finally decided by the IMDI group to use a structured metadata set to have greater expressive power and to better model the descriptions of such complex resources or resource bundles as they occur in linguistics, to make use of linguistic terminology to support research questions, to leave space for extensions by user-defined key-value pairs and to allow data managers to use IMDI to organize and manage large data collections. A schema and tools were developed allowing interested people to use IMDI for real work. IMDI has been used by a number of projects and data centers worldwide; however, its outreach was limited.

In the autumn of 2000 the OLAC initiative was made public with one of its main goals to extend the DCMI category set by four additional and linguistically meaningful categories (linguistic field, linguistic type, discourse type, language code). OLAC presented itself as a metadata service provider, i.e. an organization that will harvest metadata from linguistic data centers and support retrieval services. Thus the OLAC set is not meant as a way to organize and manage large collections and to support specific research questions.

It is widely agreed by linguistic data centers to produce OLAC compliant mappings and to allow OLAC to harvest all metadata records. Large variations in the granularity of the offered descriptions present a challenge for search engines as well as for users.

The CHAT format was invented to support the CHILDES (Child Language Data Exchange System) program in creating and collecting much data in particular about how children are talking and interacting. The CHAT format is a pure ASCII-based flat text format that has defined so-called header categories that can be compared with typical metadata keywords, since it allows specifying the speakers, the language and other contextual aspects. CHAT allows describing also sub-sections, i.e. header type of information can occur everywhere in the annotation transcript. CHAT and with it the header categories is a widely used format in developmental linguistics and beyond. Metadata and annotation data are thus merged in one file, so that metadata extraction is required to support the relevant functions.

The Text Encoding Initiative recently presented their P5 version allowing users to describe a wide variety of resource types. Its expressional power is extensive since its structural specifications allow users to combine categories in almost unlimited ways. Also TEI has so-called header categories that are meant to describe the whole resource and these header descriptions are part of the total document structure, i.e. typical metadata categories and annotation categories can appear intertwined. Due its almost unlimited expressive power almost all applications of TEI are based on specific sub-schemas. Thus TEI resources appear in a large variety of flavors the interpretation of which requires knowledge about the specific schemas.

TEI is widely being used in some humanities disciplines. However the interpretation of the TEI files widely depends on the availability of specific schemas. For harvesting metadata from TEI files the specific sub-schema must be known and an extraction has to be done.

The Component Metadata Initiative brought together a large group of leading linguists from different sub-disciplines to define a metadata framework that is flexible enough to cover the different wishes from the various sub-disciplines and projects, but nevertheless has the expressive power to serve for the various functions mentioned above including those that are emerging in the e-Research scenario. As already indicated the core of CMDI was the definition of a set of categories the semantics of which are sufficiently specific to guarantee interoperability and interpretability. Substantial work was carried out by the group of linguists to define a robust set of categories which have been registered in ISOcat and which can be extended and altered if necessary. A flexible syntactical framework allows users to combine categories to components and profiles.

The CMDI framework will be explained in more detail below. Here we would like to summarize that CMDI can be seen as a flexible syntactical umbrella to include the metadata categories defined so far, to cover the needs of a wide variety of disciplines, the requirements posed by different usages and linguistic data types. In so far it is a big step ahead with respect to expressive power and coverage. However, communities of usage (such as OLAC etc) are not requested to change their practices as long as no additional functionalities are required.