Aspects of annotations

Aspects of annotations
	Chapter 3. Resource annotations

In this section we introduce the different ways how to technically attach linguistic annotations to a resource, and which advantages, disadvantages and consequences come along with each of these. We show examples of different annotation styles and annotations from different linguistic layers, which are utilized by existing resources, current research projects and/or within the CLARIN-D infrastructure.

Inline vs. stand-off annotations

Annotations have mostly been attached to the related parts of the resource as inline annotations or stand-off annotations.

Inline annotations are directly included into the resource, thereby changing the primary data. Examples are token-based or phrase-based annotations which are directly attached to the related string with separators or structural tags. We use the term token to include punctuation and to avoid going into the definition of the notion word here, as it may vary in different resources.

Example 3.1, “LOB style inline annotation” shows an inline part-of-speech annotation in the encoding style of the tagged Lancaster-Oslo-Bergen (LOB) corpus, Example 3.2, “Penn Treebank style inline annotation” shows an inline annotation of part-of-speech and syntactic phrases using the bracketing style of the Penn Treebank. Both examples represent the sentence Er fährt ein Auto. (“He drives a car.”).

Example 3.1. LOB style inline annotation

Er_PPER fährt_VVFIN ein_ART Auto_NN ._$.

The tags included in this example and in Example 3.2, “Penn Treebank style inline annotation” are NN (common noun), PPER (personal pronoun, excluding reflexive pronouns), VVFIN (finite main verb), ART (determiner), $. (sentence-end marker), S (sentence), NP (noun phrase). Sentence and tags have not been taken from the mentioned corpora, which contain English text and make use of other tagsets, but have been chosen for exemplification of the annotation styles only.

Example 3.2. Penn Treebank style inline annotation

(S 
  (NP (PPER Er)) 
  (VP (VVFIN fährt) 
    (NP (ART ein) (NN Auto))
  )
  ($. .)
)

See Example 3.1, “LOB style inline annotation” for a description of the tags used here.

Stand-off annotations are stored separately from the primary data they refer to, see [Thompson/McKelvie 1997], thereby leaving the primary data untouched. References into primary data, e.g. the original text to be annotated, or into other annotation layers denote the parts to which the annotations belong, see [Zinsmeister et al. 2008]. For these references different mechanisms can be used, often depending on the media type of the primary data.

Example 3.3, “PAULA annotation” shows a stand-off annotation making use of XPointers, see [Grosso et al. 2003], for reference into the primary data. This requires the original resource to be stored in a file which can be referenced by XPointers such as an XML file.

Example 3.3. PAULA annotation

<body>Er fährt ein Auto.</body>

<mark id="tok_1" 
  xlink:href="#xpointer(string-range(//body,'',1,2))"/>
<mark id="tok_2" 
  xlink:href="#xpointer(string-range(//body,'',4,5))"/>

<!-- ... -->	  

<feat xlink:href="#tok_1" value="stts.type_pos.xml#PPER"/>
<feat xlink:href="#tok_2" value="stts.type_pos.xml#VVFIN"/>

<!-- ... -->

The example is encoded in the PAULA format, see [Dipper 2005], version 1.1. Tokens are defined with respect to relative character positions. The first token starts in position 1 of the string and includes 2 characters etc., see [Zinsmeister 2010].

Stand-off annotation is also used when annotating other media types than text, e.g. audio files. In the WAVES format, for example, transcriptions and annotations are aligned by timestamps.

The Linguistic annotation framework (LAF, [ISO 24612:2012]) refers to primary data via an arbitrary number of anchors specifying medium-dependent positions, e.g. coordinates, frame indexes or pre- and post-character positions in text, video or other kinds of data. For an example see the section called “Exchange and combination of annotations”

There are also hybrid forms of referencing combining stand-off and inline approaches such as the XML-format TIGER XML. In this format the original text is segmented into tokens without references into the original resource. The annotation of the syntactic phrases is represented as a separate layer on top of the part-of-speech annotation of the tokens. In the annotation part for the syntactic phrases, the tokens are referred to by identifiers, e.g. s1_1. Example 3.4, “Tiger XML” shows an excerpt of a part-of-speech and syntax analysis of the example given above, this time represented in TIGER XML.

Example 3.4. Tiger XML

<s id="s1" >
    <graph root="s1_500" >
        <terminals>
            <t id="s1_1" word="Er" pos="PPER" />
            <t id="s1_2" word="fährt" pos="VVFIN" />
            <t id="s1_3" word="ein" pos="ART" />
            <t id="s1_4" word="Auto" pos="NN" />
            <t id="s1_5" word="." pos="\$." />
        </terminals>
        <nonterminals>
            <nt id="s1_502" cat="NP" >
                <edge idref="s1_1" label="--" />
            </nt>
            <nt id="s1_503" cat="NP" >
                <edge idref="s1_3" label="--" />
                <edge idref="s1_4" label="--" />
            </nt>
            <!-- ... -->
        </nonterminals>
    </graph>
</s>

For resources other than text, stand-off annotation with references into the original data is the only way to annotate at all, but annotators of textual resources have so far made use of inline annotations in many projects. While processing or querying stand-off annotation includes some sort of link or pointer resolution problem, the problem with inline annotation is that often the original resource cannot easily be recovered just by removing the annotation. Difficulties arise for example when information like the placement of whitespaces, line breaks or other formatting information has not been made explicit in the annotated resource. Another difficulty with inline formats are overlapping annotations, for example, when different annotation schemes are applied to the same data, or if formating information such as page segmentations overlap with the annotation of linguistic units, e.g. sentences. Moreover it is more difficult to work in parallel with the same resource, as new annotation layers would have to be inserted into the same source or document.

Stand-off annotation provides for more sustainability and flexibility: each annotation layer is encapsulated and can coexist with alternative or even conflicting versions of the same type of annotation.

Although more and more projects prefer the usage of stand-off annotation for sustainability reasons, there are still cases where inline annotations are needed. Natural language processing tools need to take into account more (structural) information when working on stand-off data which may have an impact on processing time – especially if large amounts of data have to be processed.

While CLARIN-D recommends the use of stand-off annotations for newly annotated resources, existing resources making use of inline annotations can also be hosted by CLARIN-D center repositories. Annotations should be accompanied by thorough documentations and, where possible, follow well-established practices (e.g. TEI, PTB format etc).

An important representation format in CLARIN-D is the text corpus format (TCF, see [Heid et al. 2010]) as an intermediate format in web service chains. When entering a chain of annotation tools in the WebLicht platform (see Chapter 8, Web services: Accessing and using linguistic tools) the input is firstly converted into TCF, which is an example for a combined format utilized in CLARIN-D.

In TCF the input text and its annotations are stored in the same document (technically: one file). The segmented tokens are provided with identifiers (if they are not already present) stored in a separate section of the file. They do not contain explicit references into the input text. Higher layers of annotation, e.g. part-of-speech annotations, are also stored in separate sections of the file and refer to the token identifiers. This way, input text, segmentated text and annotations can be handled separately in the outcome of an annotation chain. It is important to provide the possibility to represent the original text as well as the token layer, so that the possible tool chains or future web services are not restricted by the input layers they can choose from. See Example 3.5, “TCF corpus format” for an illustration of the use of stand-off annotation in TCF.

Example 3.5. TCF corpus format

<TextCorpus xmlns="http://www.dspin.de/data/textcorpus"
  lang="de">
    <text>Er fährt ein Auto.</text>
    <tokens>
        <token ID="t1">Er</token>
        <token ID="t2">fährt</token>
        <token ID="t3">ein</token>
        <token ID="t4">Auto</token>
        <token ID="t5">.</token>
    </tokens>
    <sentences>
        <sentence ID="s1" tokenIDs="t1 t2 t3 t4 t5" />
    </sentences>
    <POStags tagset="STTS">
        <tag tokenIDs="t1">PPER</tag>
        <tag tokenIDs="t2">VVFIN</tag>
        <!-- ... -->
    </POStags>
</TextCorpus>

This example is encoded in TCF version 0.4.

Multi-layer annotation

As already demonstrated iabove, it is possible to have different layers of annotations attached to the same primary data. Example 3.6, “TCF corpus format, extended” shows an extension of the TCF example above with an additional layer containing annotations of base form(s) and a syntactic annotation layer, each encapsulated and referring to the token layer. In this case all annotation layers refer directly to the token layer, but there are also cases where one annotation layer refers to another one lying "beneath" it. In these cases, one annotation layer immediately depends on the other annotation layer.

Example 3.6. TCF corpus format, extended

<TextCorpus xmlns="http://www.dspin.de/data/textcorpus"
  lang="de">
    <text>Er fährt ein Auto.</text>
    <tokens>
        <token ID="t1">Er</token>
        <token ID="t2">fährt</token>
        <token ID="t3">ein</token>
        <token ID="t4">Auto</token>
        <token ID="t5">.</token>
    </tokens>
    <sentences>
        <sentence ID="s1" tokenIDs="t1 t2 t3 t4 t5" />
    </sentences>
    <POStags tagset="STTS">
        <tag tokenIDs="t1">PPER</tag>
        <tag tokenIDs="t2">VVFIN</tag>
        <!-- ... -->
    </POStags>
    <lemmas>
        <lemma tokenIDs="t1">er</lemma>
        <lemma tokenIDs="t2">fahren</lemma>
        <!-- ... -->
    </lemmas>
    <parsing tagset="tigertb"><parse>
        <constituent cat="TOP">
            <constituent cat="S-TOP">
                <constituent cat="NP-SB">
                  <constituent cat="PPER-HD-Nom"
                    tokenIDs="t1"/>
                </constituent>
                <constituent cat="VVFIN-HD" tokenIDs="t2"/>
                <!-- ... -->
            </constituent>
            <!-- ... -->
        </constituent>
    </parse></parsing>
</TextCorpus>

The tagsets utilized here are STTS for the part-of-speech annotation and the tags from the syntactic annotations of the TiGer treebank, see [Brants et al. 2002].

Relations between annotation types

Introducing different layers of annotations means to introduce relations, and sometimes also dependencies between the layers. Nearly each linguistic annotation depends on a layer that represents the results of a segmentation process. These segmentation layers therefore define the parts of the resource which can be annotated. The ability to include specific annotations depends on the granularity of the segmentation. For example, part-of-speech annotation can not be carried out if the smallest parts to add annotations to are sentences; syntactic trees refer to word-like terminals rather than to phonemes. If two single annotation layers refer to the same segmentation layer, they can be related to each other via the segmentation layer.

In multi-level annotation, annotation layers depend on the content and integrity of the layer they refer to. If a layer is changed in any way, e.g. due to manual corrections, a layer referring to it might become invalid, as the parts to which its annotations referred might have changed or be gone.

Like in representation formats, where stand-off annotation provides for higher flexibility, it can be helpful to keep each layer as self-sustaining as possible. Nevertheless, a specific layer of annotation often implies the existence of other annotation layers. E.g., in the annotation scheme of [Riester et al. 2010] for information status, the hierarchical structure directly relies on a constituent-based syntactic layer.

In automatic processing, dependencies also arise due to the tools which are utilized, see also Chapter 7, Linguistic tools.

As resources are to be used for more than one purpose it is always helpful to make the relations and dependencies explicit. In multi-layer stand-off annotation this should be done by including dependencies or versioning information into the metadata for each annotation layer. On top of that, before relying on another annotation layer for a new annotation one should check if the underlying annotation covers all of the resources needed as there are annotations which do not take every part of a resource into account, e.g. because they do not cover punctuation, or because some parts have not (yet) been annotated.


Chapter 3. Resource annotations	\| ToC	Exchange and combination of annotations