Lexical resources

In this section you will learn how you can integrate a particular lexical resource into the CLARIN-D infrastructure. You will be guided through some widely accepted standards for various types of lexical resources, for which we will introduce prototypical examples. With Toolbox/MDF we will introduce an example of an environment for the recording of lexical entries.

This section gives only a rather broad overview of the main concepts connected with lexical resources and is not intended to replace general and introductory works, e.g. [Svensen 2009] and, for German, [Kunze/Lemnitzer 2007] and [Engelberg/Lemnitzer 2009].

Since the handling of lexical resources in the CLARIN-D infrastrucure is not yet as mature as for example with text corpora, we will give you some recommendations of how to best prepare your resource in order to make integration as smooth as possible.

Lexical resources are collections of lexical items, typically together with linguistic information and/or classification of these items. Commonly encountered types of lexical items are words, multi-word units and morphemes. Individual items of a lexical resource need not be neccessarily of the same type. They are represented by a lemma. A lexical item and the information related to it together form a lexical entry. Other common general terms used for lexical resources are

Exact definitions of these terms vary slightly across different scientific communities.

Types of lexical resources can be established based on various criteria. The following list is not exhaustive. It rather serves to illustrate the conceptual diversity among lexical resources with regard to their internal structure and the information that they provide:

Generally, the micro structure of lexical entries and the relations among the entries are the key features of a lexical resource and consequently have to be accounted for in the data modeling.

This section constitutes a brief overview of the most commonly encountered types and formats of electronic lexical resources and the data formats they are encoded in. The sections on TEI, LMF and wordnets describe widespread general purpose formats; the section called “Toolbox/MDF” illustrates the broad spectrum of existing lexical models by sketching resources built for rather narrow domains.

Many dictionaries in the Text encoding initiative's (TEI) markup format are results of retrodigitization efforts due to the TEI's strong focus on printed text representation. Although it is entirely possible do develop born-digital dictionaries in TEI we are not aware of any such enterprise.

[TEI P5] provides a detailed description of the TEI markup format for electronic text of various kinds (chapter 9 specifically deals with the representation of dictionary data.).

The key to understanding TEI for dictionaries is the distinction between different views that can be encoded for a given dictionary:

  1. The typographic view is described by [TEI P5] as the two-dimensional printed page, i.e. a view on the textual data that allows the reconstruction of the page layout including all artefacts that are due to the print medium like line breaks.

  2. In the editorial view the text is represented as a one-dimensional sequence of tokens, i.e. a token stream without specific media-dependend typographic information.

  3. Finally, the lexical view represents the structured lexicographical information in the dictionary regardless of its typographic or textual form in print. This is also a way of enriching the dictionary with lexical data that is not present in the primary (printed) text.

    A purely lexical representation of a dictionary's textual content can be perceived as a database model of that data.

In a concrete instance these views need not necessarily be separated on a structural level although it is strongly recommended to keep them apart. This way the modeling of conflicting hierarchies across different views can be avoided. The TEI guidelines advise encoders to encode only one view directly using the primary XML element structure and provide information pertaining to another view by other devices like XML attributes or empty elements (e.g. milestone).

In the context of linguistic processing within the CLARIN-D workflow, the lexical view (the lexicographically informed structure and classification of the textual content) is of primary interest.

Example 6.1, “TEI encoded lexical entry for Bahnhof (train station), lexical view.” shows an example of a TEI dictionary article in the lexical view for the German word Bahnhof (train station) with a rather modest level of lexicographic details (wordform, grammatical properties, senses, definitions, quotations). The TEI is a highly configurable framework and the level of details for the lexicographical view can be considerably increased, e.g. by means of customizing @type or @value attributes and the set of permitted values.

Every TEI dictionary file contains an obligatory metadata header (see the section called “Text Encoding Initiative (TEI)” in Chapter 2, Metadata). Within the text section a highly developed vocabulary for lexical resources allows for the detailed annotation of a broad range of lexicographic elements.

In Example 6.1, “TEI encoded lexical entry for Bahnhof (train station), lexical view.”, the entry element is the basic unit of information which contains clearly form based (form) and semantic information (sense). The gramGrp element comprises grammatical information separate from the form description. This is not necessarily the case and may be encoded differently e.g. by subsuming gramGrp under form and thus relating the grammatical features to the concrete formal realization of the lemma. For existing print dictionaries the explicit relations between (micro)structural elements cannot always be reliably reconstructed or may be inconsistent throughout the resource. To account for this situation the TEI guidelines are very liberal with respect to permitted structures and often allow direct and indirect recursive inclusion of elements. For this reason encoding projects exploiting the guidelines resort to customization, i.e. project specific constrains and extensions for the generically permitted structures and possible values.

To tighten the TEI element semantics beyond the informal descriptions provided by the guidelines explicit references to the ISOcat data category registry (see Chapter 3, Resource annotations) can be established via @dcr:* attributes in the document instance. See Example 6.2, “Part of a TEI dictionary entry with references to ISOCat” for an example. To reduce the verbosity of directly linking each instance of a data category to the registry directly, the connection should preferably be maintained via equiv elements in the ODD file (One document does it all), a combination of schema fragments and documentation based on the TEI's tagdocs module which is typically used for customizing the TEI schema.

Unlike the general TEI model, the Lexical markup format (LMF, [ISO 24613:2008]) focusses exclusively on the lexical representation of the data (equivalent to TEI's lexical view) because the design goal for LMF was the creation of a detailed meta model for all types of NLP lexicons, i.e. electronic lexical databases. In LMF, the reference to a data category registry is mandatory. This guarantees semantic unambiguity for the categories used within this framework.

The framework consists of a generic core package accompanied by a set of more domain specific extensions (morphology, morphological patterns, multiword expressions, syntax, semantics, multilingual notations, constraint expression, machine readable dictionaries).

There is a strict division between form related and semantic information on the level of individual entries. Unlike in the TEI guidelines recursion is kept at a minimum. In the core package, only Sense allows for recursion. LMF provides a generic feature structure representation (feat class) which enables the modeling of data and annotations for LMF elements.

Compared to the vanilla TEI guidelines the LMF meta-model is much tighter designed due to its much narrower target domain, namely NLP applications. This condensed focus makes it a good choice for resource harmonization and alignment for electronic lexicographic resources in complex research infrastructures such as CLARIN-D. A number of research projects have already developed and exploited XML serializations of LMF mainly for the purpose of data exchange, including RELISH, KYOTO and LIRICS.

The Princeton WordNet and conceptionally similar databases for other languages than English are probably the most often exploited lexical resources within the NLP community. See e.g. [Fellbaum 1998] for some prototypical applications of wordnets.

In a wordnet, synsets (synonym sets) are the building blocks of the resource. Synsets represent mental concepts and can be linguistically expressed by one or more lexical word forms. These word forms are perceived as synonymous in the context of the synset; they share the same meaning that is represented by the synset. It is possible for a synset not to have a linguistic expression instantiating it in any given language, though (lexical gap). A linguistic expression on the other hand can appear in more than one synset. In many wordnets additional information is provided for a synset to make it more accessable for humans, most notably semantic paraphrases (glosses) or (corpus) quotations.

Among synsets different conceptual relations can be established which leads to a net-like conceptional structure (hence the name wordnet). If only hyponymy (is_a) and its reverse relation hyperonymy are considered the conceptional structure becomes tree-like and thus represents a (partial) conceptual hierarchy. Wordnets differ with respect to the enforcement of a tree structure for these relations. Another conceptional relation that is often found is meronymy (part_of), possibly further divided into sub-relations. Different wordnets may maintain different sets of conceptual relations.

While conceptional relations are established among synsets, lexical relations can be established among the linguistic expressions. These relations may be relatively broad and underspecified with respect to the grammatical processes involved (derived_from) but also very fine grained lexical relation systems can be implemented.

While the Princeton WordNet traditionally uses so called lexicographer files (a proprietary plain text database system spread across a number of files), and alternatively a Prolog version of the database, most other wordnet projects have adopted proprietary XML serializations as their distribution format. The Princeton WordNet project is currently also planning on a transition to an XML serialization.

The so called standard format used by MDF (Multi-Dictionary Formatter, part of SIL's Toolbox, a software suite that is popular among field linguists) is a set of field marker and value pairs (i.e. feature-value-pairs) that is linearized into a plain text file. MDF provides an implicit default hierarchy of its field markers that can be redefined by the user. This effectively allows the choice between the creation of form (i.e. part-of-speech) oriented or sense oriented entries (cf. Example 6.4, “MDF encoded sense oriented Iwaidja lexicon entry”).

The MDF standard format is primarily used for the generation of print dictionaries. It can be mapped onto LMF provided form and sense descriptions within entries are cleanly separated (see [Ringersma/Drude/Kemp-Snijders 2010]). There is also a native XML serialization for the standard format available (Lexicon interchange format, LIFT).

CLARIN-D directly supports LMF serializations and its own internal pivot format TCF. Due to the high structural diversity among lexical resources in TEI or Toolbox/MDF formats it is not feasable to maintain general purpose conversion tools for transformation into CLARIN-D compatible resources. However, a clearing center for lexical resources is operated by BBAW that provides guidance and support for the creation of project specific conversion tools. You can contact the clearing center via e-mail at mailto:dwds@dwds.de.

We encourage users to either contact the clearing center for lexical resources or provide an LMF serialisation of their data if that is already available.

Depending on the research question it may also be appropriate to use a dictionary's text or parts of it as a text corpus, e.g. as a collection of all quoted usage examples. If a lexical resource is intended to be used that way as opposed to a lexical database the recommendations of the section called “Text Corpora” apply.

[Important]Existing LMF serializations

LMF is intended to enable the mapping of all existing (NLP) lexicons onto a common model and LMF compliant resources are directly supported by CLARIN-D.

CLARIN-D provides and maintaines a (partial) mapping from LMF based serializations to its internal pivot format TCF.

[Important]Converting proprietary formats to LMF

To transform a proprietary format into an LMF serialisation the following steps have to be taken:

  • identify all micro- and macro-structural elements and possible annotations of the resource,

  • map these categories to their LMF representation,

  • determine the corresponding ISOCat categories (or create them in case they do not already exist), and

  • create a schema description for the LMF serialisation with references to the ISOCat categories. For details as to how to create a schema description see Chapter 5, Quality assurance .

Following this approach the Princeton WordNet together with the Dutch, Italian, Spanish, Basque and Japanese wordnets were sucessfully transformed into LMF serializations in the course of the KYOTO project. Alignment across the wordnets was also modeled in LMF demonstrating the suitability of this framework for representing this subset of lexical resources. Tools and data are available on the project's homepage.

[Important]TCF pivot format

Support for lexical resources via TCF – the CLARIN-D internal representation for linguistic data – is still in its infancy. The TCF lexicon module provides means for representing word form based result lists for queries on textual corpora (see the section called “Text Corpora”) or lexical resources. It follows the stand-off annotation paradigm and currently implements the layers lemmas (mandatory), POStags, frequencies, and word-relations (all optional). See Example 6.5, “TCF representation for lexical items” for an example instance.

The technical description of the TCF lexicon model is available at the CLARIN-D website.

We do not recommend the direct provision of TCF based versions of lexical resources as TCF is an ever evolving pivot format meant for data representation solely within the CLARIN-D infrastructure. Currently, it is not a suitable common model for lexical resources in the broad sense intended by LMF. The TCF format is tailor-made with the needs of NLP tool chains in mind and will therefore be subject to changes when additional needs for processing within CLARIN-D arise.