In this section you will learn how you can integrate a particular lexical resource into the CLARIN-D infrastructure. You will be guided through some widely accepted standards for various types of lexical resources, for which we will introduce prototypical examples. With Toolbox/MDF we will introduce an example of an environment for the recording of lexical entries.
This section gives only a rather broad overview of the main concepts connected with lexical resources and is not intended to replace general and introductory works, e.g. [Svensen 2009] and, for German, [Kunze/Lemnitzer 2007] and [Engelberg/Lemnitzer 2009].
Since the handling of lexical resources in the CLARIN-D infrastrucure is not yet as mature as for example with text corpora, we will give you some recommendations of how to best prepare your resource in order to make integration as smooth as possible.
Lexical resources are collections of lexical items, typically together with linguistic information and/or classification of these items. Commonly encountered types of lexical items are words, multi-word units and morphemes. Individual items of a lexical resource need not be neccessarily of the same type. They are represented by a lemma. A lexical item and the information related to it together form a lexical entry. Other common general terms used for lexical resources are
dictionary, mostly refering to reference works compiled to be used by humans directly, and
lexicon, used in a more technical sense mainly as integral part of complex natural language processing (NLP) applications such as language/speech analysis and production systems and thus often not used by humans directly.
Exact definitions of these terms vary slightly across different scientific communities.
Types of lexical resources can be established based on various criteria. The following list is not exhaustive. It rather serves to illustrate the conceptual diversity among lexical resources with regard to their internal structure and the information that they provide:
The subject of a lexical resource determines the information that is recorded in its entries. There are no inherent restrictions as to what information may be supplied. Morphological, semantic, phonological and etymological information is commonly found in lexical resources among numerous other types.
A lexical resource can be construed as an unordered set or an ordered list of lexical entries but it may also have a more complex internal structure such as a hierarchical tree (see the section called “WordNet and similar resources”). This relation between lexical entries is called the resource's macro structure.
Lexical items are generally considered to have an internal hierarchical structure termed micro structure. Thus lexical resources can be classified according to the depth of their micro structure. The minimal structural depth is found in lemma lists that only serve to identify or enumerate lexical items. There is no inherent restriction to the maximum depth of a lexical item's internal structure.
Human vs. machine readable: This dimension characterizes the intended main addressee of the resource. The content of human readable dictionaries can be readily read by humans in its primary representation (e.g. black ink on white paper or a rendered image on a computer screen). In machine readable dictionaries (MRD) the data is accessable for computers by means of a fixed set of encodings together with a scheme supporting the identification of individual units of information. These units need not necessarily be of linguistic or lexicographic relevance, i.e. they do not need to represent the lexical entries' micro structure. A plain text representation of a lexical resource as a stream of individual characters is already a most primitive form of MRD.
Human and machine readability are not complementary features. As long as lexical data is provided in a plain text encoding an electronic lexical resource is both human and machine readable. For a more detailed account of the overlap between human and machine readability see the dicussion of views in the section called “Text encoding initative”.
The distinction between monolingual and multilingual resources focusses on the number of subject languages with regard to the lexical items of the resource. In a monolingual lexicon lexical items systematically occur in only one language while in a multilingual lexicon lexical items in one language are systematically connected with equivalent lexical items in at least one other language.
Lexical information given in a lexicon can be recorded in different modes of communication (see also the section called “Multimodal corpora”). While most lexicons contain only written data some record speech data or gesture representations. At the time of writing there are no lexical resources provided by CLARIN-D that contain data other than written text.
Generally, the micro structure of lexical entries and the relations among the entries are the key features of a lexical resource and consequently have to be accounted for in the data modeling.
This section constitutes a brief overview of the most commonly encountered types and formats of electronic lexical resources and the data formats they are encoded in. The sections on TEI, LMF and wordnets describe widespread general purpose formats; the section called “Toolbox/MDF” illustrates the broad spectrum of existing lexical models by sketching resources built for rather narrow domains.
Many dictionaries in the Text encoding initiative's (TEI) markup format are results of retrodigitization efforts due to the TEI's strong focus on printed text representation. Although it is entirely possible do develop born-digital dictionaries in TEI we are not aware of any such enterprise.
[TEI P5] provides a detailed description of the TEI markup format for electronic text of various kinds (chapter 9 specifically deals with the representation of dictionary data.).
The key to understanding TEI for dictionaries is the distinction between different views that can be encoded for a given dictionary:
The typographic view is described by [TEI P5] as “the two-dimensional printed page”, i.e. a view on the textual data that allows the reconstruction of the page layout including all artefacts that are due to the print medium like line breaks.
In the editorial view the text is represented as a “one-dimensional sequence of tokens”, i.e. a token stream without specific media-dependend typographic information.
Finally, the lexical view represents the structured lexicographical information in the dictionary regardless of its typographic or textual form in print. This is also a way of enriching the dictionary with lexical data that is not present in the primary (printed) text.
A purely lexical representation of a dictionary's textual content can be perceived as a database model of that data.
In a concrete instance these views need not necessarily be separated
on a structural level although it is strongly recommended to keep them apart. This
way the modeling of conflicting hierarchies across different views can be avoided.
The TEI guidelines advise encoders to encode only one view directly using the
primary XML element structure and provide information pertaining to another view by
other devices like XML attributes or empty elements (e.g.
milestone
).
In the context of linguistic processing within the CLARIN-D workflow, the lexical view (the lexicographically informed structure and classification of the textual content) is of primary interest.
Example 6.1, “TEI encoded lexical entry for Bahnhof (train
station), lexical view.” shows an example of a TEI dictionary article in the
lexical view for the German word Bahnhof (“train
station”) with a rather modest level of lexicographic details (wordform,
grammatical properties, senses, definitions, quotations). The TEI is a highly
configurable framework and the level of details for the lexicographical view can be
considerably increased, e.g. by means of customizing @type
or
@value
attributes and the set of permitted values.
Every TEI dictionary file contains an obligatory metadata header (see the section called “Text Encoding Initiative (TEI)”
in Chapter 2, Metadata). Within the text
section a highly developed
vocabulary for lexical resources allows for the detailed annotation of a broad range
of lexicographic elements.
<entry xml:id="a_1"> <form> <orth>Bahnhof</orth> </form> <gramGrp> <pos value="N" /> <gen value="masculine" /> </gramGrp> <sense> <def>...</def> <cit> <quote>der Zug fährt in den Bahnhof ein</quote> </cit> </sense> <!-- ... --> <sense> <def>...</def> </sense> </entry>
In Example 6.1, “TEI encoded lexical entry for Bahnhof (train
station), lexical view.”, the entry
element is the basic unit
of information which contains clearly form based (form
) and semantic
information (sense
). The gramGrp
element comprises
grammatical information separate from the form description. This is not necessarily
the case and may be encoded differently e.g. by subsuming gramGrp
under
form
and thus relating the grammatical features to the concrete
formal realization of the lemma. For existing print dictionaries the explicit
relations between (micro)structural elements cannot always be reliably reconstructed
or may be inconsistent throughout the resource. To account for this situation the
TEI guidelines are very liberal with respect to permitted structures and often allow
direct and indirect recursive inclusion of elements. For this reason encoding
projects exploiting the guidelines resort to customization, i.e. project specific
constrains and extensions for the generically permitted structures and possible
values.
To tighten the TEI element semantics beyond the informal descriptions provided by the
guidelines explicit references to the ISOcat data category registry (see Chapter 3, Resource annotations) can be established via
@dcr:*
attributes in the document instance. See Example 6.2, “Part of a TEI dictionary entry with references to ISOCat” for an example. To reduce the verbosity of directly
linking each instance of a data category to the registry directly, the connection
should preferably be maintained via equiv
elements in the
ODD file
(“One document does it all”), a combination of schema fragments and
documentation based on the TEI's tagdocs
module
which is typically used for customizing the TEI schema.
<gramGrp> <pos value="N" dcr:datcat="http://www.isocat.org/datcat/DC-1345" dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256" /> </gramGrp>
Unlike the general TEI model, the Lexical markup format (LMF, [ISO 24613:2008]) focusses exclusively on the lexical representation of the data (equivalent to TEI's lexical view) because the design goal for LMF was the creation of a detailed meta model for all types of NLP lexicons, i.e. electronic lexical databases. In LMF, the reference to a data category registry is mandatory. This guarantees semantic unambiguity for the categories used within this framework.
The framework consists of a generic core package accompanied by a set of more domain specific extensions (morphology, morphological patterns, multiword expressions, syntax, semantics, multilingual notations, constraint expression, machine readable dictionaries).
There is a strict division between form related and semantic information on the
level of individual entries. Unlike in the TEI guidelines recursion is kept at a
minimum. In the core package, only Sense
allows for recursion. LMF
provides a generic feature structure representation (feat
class) which
enables the modeling of data and annotations for LMF elements.
<LexicalEntry> <feat att="partOfSpeech" val="noun" /> <feat att="gender" val="masculine" /> <Lemma> <feat att="writtenForm" val="Bahnhof" /> </Lemma> <WordForm> <feat att="writtenForm" val="Bahnhof" /> <feat att="grammaticalNumber" val="singular" /> </WordForm> <WordForm> <feat att="writtenForm" val="Bahnhöfe" /> <feat att="grammaticalNumber" val="plural" /> </WordForm> <Sense id="s1"> </Sense> <!-- ... --> <Sense id="sn"> </Sense> </LexicalEntry>
Compared to the vanilla TEI guidelines the LMF meta-model is much tighter designed due to its much narrower target domain, namely NLP applications. This condensed focus makes it a good choice for resource harmonization and alignment for electronic lexicographic resources in complex research infrastructures such as CLARIN-D. A number of research projects have already developed and exploited XML serializations of LMF mainly for the purpose of data exchange, including RELISH, KYOTO and LIRICS.
The Princeton WordNet and conceptionally similar databases for other languages than English are probably the most often exploited lexical resources within the NLP community. See e.g. [Fellbaum 1998] for some prototypical applications of wordnets.
In a wordnet, synsets (synonym sets) are the building blocks of the resource. Synsets represent mental concepts and can be linguistically expressed by one or more lexical word forms. These word forms are perceived as synonymous in the context of the synset; they share the same meaning that is represented by the synset. It is possible for a synset not to have a linguistic expression instantiating it in any given language, though (lexical gap). A linguistic expression on the other hand can appear in more than one synset. In many wordnets additional information is provided for a synset to make it more accessable for humans, most notably semantic paraphrases (glosses) or (corpus) quotations.
Among synsets different conceptual relations can be
established which leads to a net-like conceptional structure (hence the name
wordnet). If only hyponymy
(is_a
) and its reverse relation hyperonymy are
considered the conceptional structure becomes tree-like and thus represents a
(partial) conceptual hierarchy. Wordnets differ with respect to
the enforcement of a tree structure for these relations. Another conceptional
relation that is often found is meronymy
(part_of
), possibly further divided into sub-relations. Different
wordnets may maintain different sets of conceptual relations.
While conceptional relations are established among synsets, lexical
relations can be established among the linguistic expressions. These
relations may be relatively broad and underspecified with respect to the grammatical
processes involved (derived_from
) but also very fine grained lexical
relation systems can be implemented.
While the Princeton WordNet traditionally uses so called lexicographer files (a proprietary plain text database system spread across a number of files), and alternatively a Prolog version of the database, most other wordnet projects have adopted proprietary XML serializations as their distribution format. The Princeton WordNet project is currently also planning on a transition to an XML serialization.
The so called “standard format” used by MDF (Multi-Dictionary Formatter, part of SIL's Toolbox, a software suite that is popular among field linguists) is a set of field marker and value pairs (i.e. feature-value-pairs) that is linearized into a plain text file. MDF provides an implicit default hierarchy of its field markers that can be redefined by the user. This effectively allows the choice between the creation of form (i.e. part-of-speech) oriented or sense oriented entries (cf. Example 6.4, “MDF encoded sense oriented Iwaidja lexicon entry”).
The MDF standard format is primarily used for the generation of print dictionaries. It can be mapped onto LMF provided form and sense descriptions within entries are cleanly separated (see [Ringersma/Drude/Kemp-Snijders 2010]). There is also a native XML serialization for the standard format available (Lexicon interchange format, LIFT).
\lx alabanja \sn 1 \ps n \de beach hibiscus.Rope for harpoons and tying up canoes is made from this tree species, and the timber is used to make |fv{larrwa} smoking pipes \ge hibiscus \re hibiscus, beach \rfs 205,410; IE 84 \sd plant \sd material \rf Iwa05.Feb2 \xv alabanja alhurdu \xe hibiscus string/rope \sn 2 \ps n \de short-finned batfish \ge short-finned batfish \re batfish, short-finned \sc Zabidius novaemaculatus \sd animal \sd fish \rf Iwaidja Fish Names.xls \so MELP project elicitation \eb SH \dt 19/Dec/2006
Note the
\sn
(sense number) and \ps
(part-of-speech) field markers.
The example was taken from the [Ringersma/Drude/Kemp-Snijders 2010] presentation.
CLARIN-D directly supports LMF serializations and its own internal pivot format TCF. Due to the high structural diversity among lexical resources in TEI or Toolbox/MDF formats it is not feasable to maintain general purpose conversion tools for transformation into CLARIN-D compatible resources. However, a clearing center for lexical resources is operated by BBAW that provides guidance and support for the creation of project specific conversion tools. You can contact the clearing center via e-mail at mailto:dwds@dwds.de.
We encourage users to either contact the clearing center for lexical resources or provide an LMF serialisation of their data if that is already available.
Depending on the research question it may also be appropriate to use a dictionary's text or parts of it as a text corpus, e.g. as a collection of all quoted usage examples. If a lexical resource is intended to be used that way as opposed to a lexical database the recommendations of the section called “Text Corpora” apply.
Existing LMF serializations | |
---|---|
LMF is intended to enable the mapping of all existing (NLP) lexicons onto a common model and LMF compliant resources are directly supported by CLARIN-D. CLARIN-D provides and maintaines a (partial) mapping from LMF based serializations to its internal pivot format TCF. |
Converting proprietary formats to LMF | |
---|---|
To transform a proprietary format into an LMF serialisation the following steps have to be taken:
Following this approach the Princeton WordNet together with the Dutch, Italian, Spanish, Basque and Japanese wordnets were sucessfully transformed into LMF serializations in the course of the KYOTO project. Alignment across the wordnets was also modeled in LMF demonstrating the suitability of this framework for representing this subset of lexical resources. Tools and data are available on the project's homepage. |
TCF pivot format | |
---|---|
Support for lexical resources via TCF – the CLARIN-D internal representation for
linguistic data – is still in its infancy. The TCF lexicon module provides means for
representing word form based result lists for queries on textual corpora (see the section called “Text Corpora”) or lexical resources. It follows the stand-off
annotation paradigm and currently implements the layers Example 6.5. TCF representation for lexical items <Lexicon xmlns="http://www.dspin.de/data/lexicon" lang="de"> <lemmas> <lemma ID="l1">halten</lemma> <lemma ID="l2">Vortrag</lemma> </lemmas> <POStags tagset="basic"> <tag lemID="l1">Verb</tag> <tag lemID="l2">Noun</tag> </POStags> <frequencies> <frequency lemID="l1">1257</frequency> <frequency lemID="l2">193</frequency> </frequencies> <word-relations> <word-relation type="syntactic relation" func="verb+direct-object" freq="13"> <term lemID="l1"/><term lemID="l2"/> <sig measure="MI">13.58</sig> </word-relation> </word-relations> </Lexicon> The example was taken from [Przepiórkowski 2011]. The technical description of the TCF lexicon model is available at the CLARIN-D website. We do not recommend the direct provision of TCF based versions of lexical resources as TCF is an ever evolving pivot format meant for data representation solely within the CLARIN-D infrastructure. Currently, it is not a suitable common model for lexical resources in the broad sense intended by LMF. The TCF format is tailor-made with the needs of NLP tool chains in mind and will therefore be subject to changes when additional needs for processing within CLARIN-D arise. |