The Component Metadata Initiative (CMDI)

Rather than a single metadata format, the Component Metadata Infrastructure provides a framework to create and use self-defined metadata formats. It relies on a modular model of so-called metadata components, which can be assembled together, to improve reuse, interoperability and cooperation among metadata modelers. With a culinary analogy one could call it metadata à la carte: instead of choosing one completely predefined schema you can select your preferred plates (components) and group them until you have a whole (profile) that suits your needs.

There is however an important difference with a restaurant. In case none of the proposed components fulfill the requirements, the user can always create a new one. Figure 2.2, “Profiles are made up of components” displays how a relatively small set of components (general metadata, metadata for textual resource, metadata for multimedia and metadata on persons) can be combined in tailored profiles.

In the previous section a whole list of existing metadata formats was listed. One could ask why it is necessary to come up with a new metadata formalism, and why CLARIN thinks this will make a difference compared to choosing one of the existing formats. The answer is manifold:

In the coming section we will go into more technical details of the CMDI model.

Work on CMDI started in 2008 in the context of the European CLARIN research infrastructure. Most existing metadata schemas for language resources seemed to be too superficial (e.g. OLAC) or too much tailored towards specific research communities or use cases (e.g. IMDI).

CMDI addresses this by leaving it to the metadata modeler how a schema should look like. It is based on the use of metadata components making use of agreed and registered categories. These elementary building blocks contain one or more elements (also known as fields) describing a resource. For instance, an actor component can group elements like first name, last name and sex. A component can also contain one or more other components, allowing a lego brick approach, where many small components together form a larger unit. To continue with the actor example, such a component could include a sub-component actor language, containing a set of fields describing the language(s) a person can speak.

Figure 2.3, “A component describing an actor” shows a typical component describing an Actor, consisting of two elements (firstName and lastName) and an embedded ActorLanguageName component. You can explore the component in the CMDI component registry.

Ultimately a set of components will be grouped into a profile – this master component thus contains all fields in a structured way that can be used to describe a language resource.

[Important]The CMDI component registry

In order to promote the re-use and sharing of components and profiles, the CMDI component registry was created. A web application (see Figure 2.4, “The CMDI component registry”) allows metadata modelers to browse through all existing components and profiles and to create new ones, with the possibility to include existing components. The component registry is open to anyone to read components. Submitting new components can only be done by accredited experts to guarantee that only correct and proven components are ready for reuse by others.

After creating or choosing a profile, the user can generate an XML W3C schema (also known as an XSD file) that contains a formal specification of the structure of the metadata descriptions that are to be created. This schema too can be accessed from the component registry, with a right click on the profile, choosing the Download as XSD option. From then on the schema can be used to check the formal correctness of the CMDI metadata descriptions. See the section called “Well-formedness and schema compliance” for more information on formal correctness checking.

[Important]CMDI data format

A CMDI metadata description (stored as XML file with the extension .cmdi) consists of three main parts:

  • A fixed Header, containing information about the author of the file, the creation date, a reference to the unique profile code and a link to the metadata file itself.

  • A fixed Resources section, containing links to the described resources or other CMDI metadata descriptions.

  • A flexible Components section, containing all of the components that belong to the specific profile that was chosen as a basis. In our earlier example there would be one Actor component immediately under the Components tag.

A set of CMDI example files can be found on the CLARIN EU website.

To avoid ambiguity and to achieve clear semantics when using metadata, the CMDI model has close ties to the ISOcat data category registry (see the section called “ISOcat, a Data Category Registry”). The model can also be easily extended to other widely agreed registries of data categories. Because the system relies on a (potentially large) set of components created by many different users, the risk is there that searching in the metadata descriptions becomes unfeasible as:

  • an element might have different names within multiple components (e.g. name or last name),

  • a user might not be aware that there are multiple components that contain a specific element, even when they are called the same (e.g. name), and

  • some elements might have the same label (say name) but might mean something different (e.g. project name versus person name).

The solution (or at least a part of it) is in declaring what each element really means. When adding an element to a CMDI component the metadata modeler has to add a link to the ISOcat data category registry, where very detailed definitions are available. This link provides a persistent and unique identification of the intended semantics.

Consider the example mentioned above consisting of two components, both containing a description of an actor, where one refers to a person’s last name with the label Name and another one with the label Last Name. When both point to the data category last name, both the search machines and human users know what is precisely meant. It is the reference and not the used label that clarifies the intended semantics. This method is illustrated in Figure 2.5, “Declaring explicit semantics in CMDI with links to data categories”.

More information on CMDI and ISOcat can be found on a frequently asked questions page of the CLARIN EU website.

As the CMDI framework aims to enhance the reuse and sharing of metadata components, one of the important aspects in using it is finding out existing material (be it profiles, components or data categories) that could be used. Only if this is not the case one should create these from scratch.

In general, the procedure to identify relevant building blocks looks as follows (the diagram in Figure 2.6, “Procedure for CMDI metadata modeling” summarizes the whole procedure):

  1. Search in the component registry for interesting profiles. If one of them matches the requirements, use this one.

  2. If a profile matches more or less the requirements, check how to create a derivative that meets your requirements.

  3. Otherwise, look in the Component Registry for useful components:

    1. If you can create a profile out of existing components, do so.

    2. Otherwise, create additional components:

      1. Link to existing data categories when they match the semantics of the elements in your component.

      2. Otherwise, create new data categories yourself and link to these from the new component. Consult the section called “ISOcat, a Data Category Registry” for more information on how to do this.

[Note]A practical example to CMDI profile creation

This section will contain a concise step-by-step guide starting an MD profile from scratch based on an existing ressource.

Many users find that there is an existing profile or component (further on: component) that mostly fits their needs, but just needs some minor changes, e.g. adding an extra element or renaming an existing one. In such cases it is possible for expert metadata modelers to use an existing component, make the preferred changes and to store it separately in the component registry. It is important to realize that such derivative components are not any longer connected to the original. In case of components with embedded subcomponents, this means that even a small change in one of the lower-situated components results in the need to create alternate versions of the parent components, as these are built in a bottom-up manner. Thus creating a new component requires some care from the user. This way of working is illustrated in Figure 2.7, “Creating a derivate profile with an altered component”.

With currently about 300 components and about 70 profiles available, the question arises if we can recommend some components. This is certainly the case, we strongly advise to use those components, which contain standardized vocabularies for language names, country names, continents, etc. This will greatly enhance the interoperability and the metadata quality. Some other components, like cmdi-description are also general enough to be recommended. It should be noted that the list of recommended components and profiles that adhere to high quality standards (e.g. there are ISOcat links for all elements used) is constantly growing and therefore it is recommended to check the most-up-to-date information.

Quite often there is already some metadata available in a non-CMDI format. For certain frequently occurring conversions there are conversion methods available. Such a method includes:

For OLAC, DC, IMDI and TEI-headers these conversion methods are described in detail on the CLARIN EU website. A similar conversion scheme for MetaShare metadata is planned but not yet available.