Lexical resources

Axel Herold

BBAW Berlin

In this section you will learn how you can integrate a particular lexical resource into the CLARIN-D infrastructure. You will be guided through some widely accepted standards for various types of lexical resources, for which we will introduce prototypical examples. With Toolbox/MDF we will introduce an example of an environment for the recording of lexical entries.

This section gives only a rather broad overview of the main concepts connected with lexical resources and is not intended to replace general and introductory works, e.g. [Svensen 2009] and, for German, [Kunze/Lemnitzer 2007] and [Engelberg/Lemnitzer 2009].

Since the handling of lexical resources in the CLARIN-D infrastrucure is not yet as mature as for example with text corpora, we will give you some recommendations of how to best prepare your resource in order to make integration as smooth as possible.

Introduction

Lexical resources are collections of lexical items, typically together with linguistic information and/or classification of these items. Commonly encountered types of lexical items are words, multi-word units and morphemes. Individual items of a lexical resource need not be neccessarily of the same type. They are represented by a lemma. A lexical item and the information related to it together form a lexical entry. Other common general terms used for lexical resources are

  • dictionary, mostly refering to reference works compiled to be used by humans directly, and

  • lexicon, used in a more technical sense mainly as integral part of complex natural language processing (NLP) applications such as language/speech analysis and production systems and thus often not used by humans directly.

Exact definitions of these terms vary slightly across different scientific communities.

Types of lexical resources can be established based on various criteria. The following list is not exhaustive. It rather serves to illustrate the conceptual diversity among lexical resources with regard to their internal structure and the information that they provide:

  • The subject of a lexical resource determines the information that is recorded in its entries. There are no inherent restrictions as to what information may be supplied. Morphological, semantic, phonological and etymological information is commonly found in lexical resources among numerous other types.

  • A lexical resource can be construed as an unordered set or an ordered list of lexical entries but it may also have a more complex internal structure such as a hierarchical tree (see the section called “WordNet and similar resources”). This relation between lexical entries is called the resource's macro structure.

  • Lexical items are generally considered to have an internal hierarchical structure termed micro structure. Thus lexical resources can be classified according to the depth of their micro structure. The minimal structural depth is found in lemma lists that only serve to identify or enumerate lexical items. There is no inherent restriction to the maximum depth of a lexical item's internal structure.

  • Human vs. machine readable: This dimension characterizes the intended main addressee of the resource. The content of human readable dictionaries can be readily read by humans in its primary representation (e.g. black ink on white paper or a rendered image on a computer screen). In machine readable dictionaries (MRD) the data is accessable for computers by means of a fixed set of encodings together with a scheme supporting the identification of individual units of information. These units need not necessarily be of linguistic or lexicographic relevance, i.e. they do not need to represent the lexical entries' micro structure. A plain text representation of a lexical resource as a stream of individual characters is already a most primitive form of MRD.

    Human and machine readability are not complementary features. As long as lexical data is provided in a plain text encoding an electronic lexical resource is both human and machine readable. For a more detailed account of the overlap between human and machine readability see the dicussion of views in the section called “Text encoding initative”.

  • The distinction between monolingual and multilingual resources focusses on the number of subject languages with regard to the lexical items of the resource. In a monolingual lexicon lexical items systematically occur in only one language while in a multilingual lexicon lexical items in one language are systematically connected with equivalent lexical items in at least one other language.

  • Lexical information given in a lexicon can be recorded in different modes of communication (see also the section called “Multimodal corpora”). While most lexicons contain only written data some record speech data or gesture representations. At the time of writing there are no lexical resources provided by CLARIN-D that contain data other than written text.

Generally, the micro structure of lexical entries and the relations among the entries are the key features of a lexical resource and consequently have to be accounted for in the data modeling.

Common formats

This section constitutes a brief overview of the most commonly encountered types and formats of electronic lexical resources and the data formats they are encoded in. The sections on TEI, LMF and wordnets describe widespread general purpose formats; the section called “Toolbox/MDF” illustrates the broad spectrum of existing lexical models by sketching resources built for rather narrow domains.

Text encoding initative

Many dictionaries in the Text encoding initiative's (TEI) markup format are results of retrodigitization efforts due to the TEI's strong focus on printed text representation. Although it is entirely possible do develop born-digital dictionaries in TEI we are not aware of any such enterprise.

[TEI P5] provides a detailed description of the TEI markup format for electronic text of various kinds (chapter 9 specifically deals with the representation of dictionary data.).

The key to understanding TEI for dictionaries is the distinction between different views that can be encoded for a given dictionary:

  1. The typographic view is described by [TEI P5] as the two-dimensional printed page, i.e. a view on the textual data that allows the reconstruction of the page layout including all artefacts that are due to the print medium like line breaks.

  2. In the editorial view the text is represented as a one-dimensional sequence of tokens, i.e. a token stream without specific media-dependend typographic information.

  3. Finally, the lexical view represents the structured lexicographical information in the dictionary regardless of its typographic or textual form in print. This is also a way of enriching the dictionary with lexical data that is not present in the primary (printed) text.

    A purely lexical representation of a dictionary's textual content can be perceived as a database model of that data.

In a concrete instance these views need not necessarily be separated on a structural level although it is strongly recommended to keep them apart. This way the modeling of conflicting hierarchies across different views can be avoided. The TEI guidelines advise encoders to encode only one view directly using the primary XML element structure and provide information pertaining to another view by other devices like XML attributes or empty elements (e.g. milestone).

In the context of linguistic processing within the CLARIN-D workflow, the lexical view (the lexicographically informed structure and classification of the textual content) is of primary interest.

Example 6.1, “TEI encoded lexical entry for Bahnhof (train station), lexical view.” shows an example of a TEI dictionary article in the lexical view for the German word Bahnhof (train station) with a rather modest level of lexicographic details (wordform, grammatical properties, senses, definitions, quotations). The TEI is a highly configurable framework and the level of details for the lexicographical view can be considerably increased, e.g. by means of customizing @type or @value attributes and the set of permitted values.

Every TEI dictionary file contains an obligatory metadata header (see the section called “Text Encoding Initiative (TEI)” in Chapter 2, Metadata). Within the text section a highly developed vocabulary for lexical resources allows for the detailed annotation of a broad range of lexicographic elements.

Example 6.1. TEI encoded lexical entry for Bahnhof (train station), lexical view.
<entry xml:id="a_1">
    <form>
        <orth>Bahnhof</orth>
    </form>
    <gramGrp>
        <pos value="N" />
        <gen value="masculine" />
    </gramGrp>
    <sense>
        <def>...</def>
        <cit>
            <quote>der Zug fährt in den Bahnhof ein</quote>
        </cit>
    </sense>
    <!-- ... -->
    <sense>
        <def>...</def>
    </sense>
</entry>

In Example 6.1, “TEI encoded lexical entry for Bahnhof (train station), lexical view.”, the entry element is the basic unit of information which contains clearly form based (form) and semantic information (sense). The gramGrp element comprises grammatical information separate from the form description. This is not necessarily the case and may be encoded differently e.g. by subsuming gramGrp under form and thus relating the grammatical features to the concrete formal realization of the lemma. For existing print dictionaries the explicit relations between (micro)structural elements cannot always be reliably reconstructed or may be inconsistent throughout the resource. To account for this situation the TEI guidelines are very liberal with respect to permitted structures and often allow direct and indirect recursive inclusion of elements. For this reason encoding projects exploiting the guidelines resort to customization, i.e. project specific constrains and extensions for the generically permitted structures and possible values.

To tighten the TEI element semantics beyond the informal descriptions provided by the guidelines explicit references to the ISOcat data category registry (see Chapter 3, Resource annotations) can be established via @dcr:* attributes in the document instance. See Example 6.2, “Part of a TEI dictionary entry with references to ISOCat” for an example. To reduce the verbosity of directly linking each instance of a data category to the registry directly, the connection should preferably be maintained via equiv elements in the ODD file (One document does it all), a combination of schema fragments and documentation based on the TEI's tagdocs module which is typically used for customizing the TEI schema.

Example 6.2. Part of a TEI dictionary entry with references to ISOCat
<gramGrp>
  <pos value="N" 
    dcr:datcat="http://www.isocat.org/datcat/DC-1345"
    dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256" />    
</gramGrp>

Lexical markup format

Unlike the general TEI model, the Lexical markup format (LMF, [ISO 24613:2008]) focusses exclusively on the lexical representation of the data (equivalent to TEI's lexical view) because the design goal for LMF was the creation of a detailed meta model for all types of NLP lexicons, i.e. electronic lexical databases. In LMF, the reference to a data category registry is mandatory. This guarantees semantic unambiguity for the categories used within this framework.

The framework consists of a generic core package accompanied by a set of more domain specific extensions (morphology, morphological patterns, multiword expressions, syntax, semantics, multilingual notations, constraint expression, machine readable dictionaries).

There is a strict division between form related and semantic information on the level of individual entries. Unlike in the TEI guidelines recursion is kept at a minimum. In the core package, only Sense allows for recursion. LMF provides a generic feature structure representation (feat class) which enables the modeling of data and annotations for LMF elements.

Example 6.3. Part of an LMF encoded lexicon entry based on the LMF core and morphology packages.
<LexicalEntry>
    <feat att="partOfSpeech" val="noun" />
    <feat att="gender" val="masculine" />
    <Lemma>
        <feat att="writtenForm" val="Bahnhof" />
    </Lemma>
    <WordForm>
        <feat att="writtenForm" val="Bahnhof" />
        <feat att="grammaticalNumber" val="singular" />
    </WordForm>
    <WordForm>
        <feat att="writtenForm" val="Bahnhöfe" />
        <feat att="grammaticalNumber" val="plural" />
    </WordForm>
    <Sense id="s1">
    </Sense>
    <!-- ... -->
    <Sense id="sn">
    </Sense>    
</LexicalEntry>

Compared to the vanilla TEI guidelines the LMF meta-model is much tighter designed due to its much narrower target domain, namely NLP applications. This condensed focus makes it a good choice for resource harmonization and alignment for electronic lexicographic resources in complex research infrastructures such as CLARIN-D. A number of research projects have already developed and exploited XML serializations of LMF mainly for the purpose of data exchange, including RELISH, KYOTO and LIRICS.

WordNet and similar resources

The Princeton WordNet and conceptionally similar databases for other languages than English are probably the most often exploited lexical resources within the NLP community. See e.g. [Fellbaum 1998] for some prototypical applications of wordnets.

In a wordnet, synsets (synonym sets) are the building blocks of the resource. Synsets represent mental concepts and can be linguistically expressed by one or more lexical word forms. These word forms are perceived as synonymous in the context of the synset; they share the same meaning that is represented by the synset. It is possible for a synset not to have a linguistic expression instantiating it in any given language, though (lexical gap). A linguistic expression on the other hand can appear in more than one synset. In many wordnets additional information is provided for a synset to make it more accessable for humans, most notably semantic paraphrases (glosses) or (corpus) quotations.

Among synsets different conceptual relations can be established which leads to a net-like conceptional structure (hence the name wordnet). If only hyponymy (is_a) and its reverse relation hyperonymy are considered the conceptional structure becomes tree-like and thus represents a (partial) conceptual hierarchy. Wordnets differ with respect to the enforcement of a tree structure for these relations. Another conceptional relation that is often found is meronymy (part_of), possibly further divided into sub-relations. Different wordnets may maintain different sets of conceptual relations.

While conceptional relations are established among synsets, lexical relations can be established among the linguistic expressions. These relations may be relatively broad and underspecified with respect to the grammatical processes involved (derived_from) but also very fine grained lexical relation systems can be implemented.

While the Princeton WordNet traditionally uses so called lexicographer files (a proprietary plain text database system spread across a number of files), and alternatively a Prolog version of the database, most other wordnet projects have adopted proprietary XML serializations as their distribution format. The Princeton WordNet project is currently also planning on a transition to an XML serialization.

Toolbox/MDF

The so called standard format used by MDF (Multi-Dictionary Formatter, part of SIL's Toolbox, a software suite that is popular among field linguists) is a set of field marker and value pairs (i.e. feature-value-pairs) that is linearized into a plain text file. MDF provides an implicit default hierarchy of its field markers that can be redefined by the user. This effectively allows the choice between the creation of form (i.e. part-of-speech) oriented or sense oriented entries (cf. Example 6.4, “MDF encoded sense oriented Iwaidja lexicon entry”).

The MDF standard format is primarily used for the generation of print dictionaries. It can be mapped onto LMF provided form and sense descriptions within entries are cleanly separated (see [Ringersma/Drude/Kemp-Snijders 2010]). There is also a native XML serialization for the standard format available (Lexicon interchange format, LIFT).

Example 6.4. MDF encoded sense oriented Iwaidja lexicon entry
\lx alabanja
\sn 1
\ps n
\de beach hibiscus.Rope for harpoons and tying up canoes
is made from this tree species, and the timber is used to
make |fv{larrwa} smoking pipes
\ge hibiscus
\re hibiscus, beach
\rfs 205,410; IE 84
\sd plant
\sd material
\rf Iwa05.Feb2
\xv alabanja alhurdu
\xe hibiscus string/rope
\sn 2
\ps n
\de short-finned batfish
\ge short-finned batfish
\re batfish, short-finned
\sc Zabidius novaemaculatus
\sd animal
\sd fish
\rf Iwaidja Fish Names.xls
\so MELP project elicitation
\eb SH
\dt 19/Dec/2006

Note the \sn (sense number) and \ps (part-of-speech) field markers.

The example was taken from the [Ringersma/Drude/Kemp-Snijders 2010] presentation.


Formats endorsed by CLARIN-D

CLARIN-D directly supports LMF serializations and its own internal pivot format TCF. Due to the high structural diversity among lexical resources in TEI or Toolbox/MDF formats it is not feasable to maintain general purpose conversion tools for transformation into CLARIN-D compatible resources. However, a clearing center for lexical resources is operated by BBAW that provides guidance and support for the creation of project specific conversion tools. You can contact the clearing center via e-mail at mailto:clarin@bbaw.de.

We encourage users to either contact the clearing center for lexical resources or provide an LMF serialisation of their data if that is already available.

Depending on the research question it may also be appropriate to use a dictionary's text or parts of it as a text corpus, e.g. as a collection of all quoted usage examples. If a lexical resource is intended to be used that way as opposed to a lexical database the recommendations of the section called “Text Corpora” apply.

[Important]Existing LMF serializations

LMF is intended to enable the mapping of all existing (NLP) lexicons onto a common model and LMF compliant resources are directly supported by CLARIN-D.

CLARIN-D provides and maintaines a (partial) mapping from LMF based serializations to its internal pivot format TCF.

[Important]Converting proprietary formats to LMF

To transform a proprietary format into an LMF serialisation the following steps have to be taken:

  • identify all micro- and macro-structural elements and possible annotations of the resource,

  • map these categories to their LMF representation,

  • determine the corresponding ISOCat categories (or create them in case they do not already exist), and

  • create a schema description for the LMF serialisation with references to the ISOCat categories. For details as to how to create a schema description see Chapter 5, Quality assurance .

Following this approach the Princeton WordNet together with the Dutch, Italian, Spanish, Basque and Japanese wordnets were sucessfully transformed into LMF serializations in the course of the KYOTO project. Alignment across the wordnets was also modeled in LMF demonstrating the suitability of this framework for representing this subset of lexical resources. Tools and data are available on the project's homepage.

[Important]TCF pivot format

Support for lexical resources via TCF – the CLARIN-D internal representation for linguistic data – is still in its infancy. The TCF lexicon module provides means for representing word form based result lists for queries on textual corpora (see the section called “Text Corpora”) or lexical resources. It follows the stand-off annotation paradigm and currently implements the layers lemmas (mandatory), POStags, frequencies, and word-relations (all optional). See Example 6.5, “TCF representation for lexical items” for an example instance.

Example 6.5. TCF representation for lexical items
<Lexicon xmlns="http://www.dspin.de/data/lexicon"
  lang="de">
  <lemmas>
    <lemma ID="l1">halten</lemma>
    <lemma ID="l2">Vortrag</lemma>
  </lemmas>
  <POStags tagset="basic">
    <tag lemID="l1">Verb</tag>
    <tag lemID="l2">Noun</tag>
  </POStags>
  <frequencies>
    <frequency lemID="l1">1257</frequency>
    <frequency lemID="l2">193</frequency>
  </frequencies>
  <word-relations>
    <word-relation type="syntactic relation"
          func="verb+direct-object" freq="13">
      <term lemID="l1"/><term lemID="l2"/>
      <sig measure="MI">13.58</sig>
    </word-relation>
  </word-relations>
</Lexicon>

The example was taken from [Przepiórkowski 2011].


The technical description of the TCF lexicon model is available at the CLARIN-D website.

We do not recommend the direct provision of TCF based versions of lexical resources as TCF is an ever evolving pivot format meant for data representation solely within the CLARIN-D infrastructure. Currently, it is not a suitable common model for lexical resources in the broad sense intended by LMF. The TCF format is tailor-made with the needs of NLP tool chains in mind and will therefore be subject to changes when additional needs for processing within CLARIN-D arise.