Aspects of the quality of resources

Corpora and lexical resources are considered well-formed if they conform to a particular defined format. This does not refer to linguistic concepts but the general document grammar as specified, for example, in an XML document type definition (DTD) or schema. The schema determines the structure of the document in terms of eligible markup. If an XML document conforms to such a specification, it is said to be valid with regard to this DTD or schema. One method of quality assurance is to make any such a specification available together with the resource.

A resource is adequate only with respect to a particular theory or a particular application. It is therefore essential to provide information about these. In the case of the linguistic annotation of corpora – and also to a certain extent for lexical resources – this is normally done by means of detailed guidelines that specify the tagset and define the (linguistic) concepts on which the tagset relies. Ideally, the guidelines also provide linguistic tests that human annotators can apply, and discussions of problematic cases, which make the annotation decisions comprehensible.

With respect to annotated resources, consistency […] means that the same linguistic phenomena are annotated in the same way, and similar or related phenomena must receive annotations that represent their similarity or relatedness if possible [Zinsmeister et al. 2008, page 764]. Consistency is undermined by two major factors: ambiguity and vagueness in the data, on the one hand, and errors made by annotators, on the other hand. Errors made by annotation tools normally are of a consistent nature.

One should be aware that any annotation contains errors. In addition to errors performed by human annotators basically any tool which makes decisions on how to annotate a resource will make some mistakes and analyze certain instances of the data inappropriately. Therefore, the expectation is that some sort of evaluation of the annotation quality is provided together with the resource. For automatic annotations tools this quality information can be based on evaluating the tools on some gold standard – a resource that has been approved to be adequately annotated – and report standardized scores such as precision, recall, and F-measure (or other task-related measures as developed in shared tasks such as the CoNLL shared task initiative. Nevertheless, one has to keep in mind, that also gold standards contain debatable decisions. Each processing step in course of creating a corpus is a case of interpretation. Even transcribing non-digitized texts is subject to interpretation if the quality of the master copy is poor, as it is sometimes the case with historical data, or if an originally hand-written scripts is unreadable.

For manual annotations it is recommended to let more than one annotator work on the same part of the result, calculating the inter-annotator agreement afterwards [Artstein/Poesio 2008]. The same holds for manual transcription of non-digitized text. In Digital Humanities, it is standard to employ the double-keying method in which two operators create two independent transcriptions, which are subsequently compared, see [Geyken et al. 2011].

From results of inter-annotator agreement, conclusions can be drawn on the ambiguity of either the guidelines, see [Zinsmeister et al. 2008], or the phenomenon to be annotated. The guidelines can be specified in some sort of bootstrapping approach, thereby also changing the tagsets, when the results of different annotators imply that two categories can not be clearly differentiated.

Next to trying to improve tools in inspecting their results, or by training them on tailored resources and adapting the features they use, there are also some mechanisms working on the annotated resources. For example an automatic processing software to find inconsistencies in text corpora taking part-of-speech tagging, constituent-based and dependency-based syntactic annotations into account has been developed within the DECCA project [Boyd et al. 2008].

All metadata describe the resource they belong to in some way or other and contribute to qualify it for reuse.

For assessing the inherent quality of a resource in particular it is relevant that the metadata keeps record of all processing steps the resource has undergone. This includes information about the method applied (e.g., in terms of guidelines) and the tools used. For all manual annotation steps, including transcription and correction, the metadata should provide information about inter-annotator agreement to specify the reliability of the resource as requested in the section called “Adequacy and consistency”. In the case that the resource is a tool itself, the metadata ideally also include gold standard evaluation or shared-task results for the same purpose. The latter is not yet part of standard metadata sets. Very often this kind of consistency and performance information is only indirectly provided in terms of reference to a publication, which includes a report on these measures. However, providing a reference to a publication on the resource is in itself part of high-quality metadata. It equips the user with a standard reference to be cited when the resource is used.

High-quality metadata for tools will specify the requirements on the input format of the data the tool is applied to, and it will also inform about the output format. In a similar vein, metadata for corpora and lexical resources might recommend tools that can be used to query, visualize or otherwise process the resource – if it is developed in a framework that offers such tools.