Chapter 3. Resource annotations

To annotate a resource usually means to enhance it with various types of (linguistic) information [McEnery/Wilson 2001, page 32]. This is done by attaching some kind of information to parts of the resource and/or introduce relations between those parts which can again be annotated with information. Thereby the annotated information is always an interpretation of the data with respect to a particular understanding.

Annotating a resource can be done manually by one or more annotators, automatically by a tool or semi-automatically e.g. by manually correcting automatic annotations. The annotated information can be represented in terms of atomic or hierarchical tags, complete feature structures or simple feature value pairs. The whole annotation document is stored or visualized in a specific representation format.

The rules by which information is attributed to parts of the resource are captured in an annotation scheme, specifying e.g. guidelines, which should be used to inform and control the work of human annotators or correctors. [Schiller et al. 1999] for example specifies guidelines for annotating parts-of-speech. They also provide a finite set of category abbreviations, thus composing a tagset (the Stuttgart-Tübingen tagset for part-of-speech tagging, STTS). Other tagsets for part-of-speech and syntax annotation are utilized in the Penn Treebank (PTB) project, see for example [Santorini 1990] and [Bies et al. 1995].

Linguistic annotation schemes reflect linguistic theories or are tailored with respect to the investigation of a specific phenomenon. They are an essential part of the documentation which accompanies an annotated resource.

However, it is important to distinguish between the concepts of guidelines, tagsets and representation formats. While tagsets often evolve when elaborating the annotation guidelines and the tags tend to be abbreviations for the decisions made, they could definitely be replaced by other abbreviations, i.e., there could be more than one tagset part of an annotation scheme, e.g. tagsets differing in granularity. Some tagsets can be understood as taxonomies. In the abbreviations applied in STTS the letters of the tag denote more general information on the left and more specific information on the right: VVFIN is a main verb (VV) which is finite (FIN), VAFIN is a finite auxiliary verb. This structure can be helpful when queries are conducted via regular expressions.

Some annotations cannot make use of a predefined tagset, such as when annotating synonyms. The representation format again is distinct from the concepts of the guidelines and tagsets. It specifies how the annotation content is represented. One has to be aware that on the one hand annotations with the same content can be represented in very different ways and that on the other hand using the same representation format, i.e., an XML format like TIGER XML for two annotations does not mean that the annotation content of the two is the same, or even similar.

Lastly, as stated by [McEnery/Wilson 2001, page 33] and [Leech 1993] one has to be aware, that any act of annotating a resource is also an act of interpretation, either of its structure or of its content. Therefore an annotation is never universal consensus. Moreover it has to be taken into account that the process of annotating is also likely to introduce errors, see Chapter 5, Quality assurance .