Manual annotation and analysis tools

Automatic annotation is not the end goal of most computational linguistics – linguistic tools are used to construct new resources and do linguistic research. Manual annotation and analysis tools touch on topics that have little place in automatic annotation technology: Visualization, usability, and human factors in linguistic analysis.

The earliest linguistic corpora were annotated by adding information to the raw text manually. Human annotation is still performed, sometimes to correct imperfect automatic annotation software, and sometimes because no adequate automatic software exists. Human-corrected annotation, which is presumed to have fewer errors than automatic annotation, is used to create gold standard corpora, which are very important for creating, training and evaluating automatic annotation tools. See Chapter 5, Quality assurance , but see also the section called “Automatic and manual analysis tools”.

Annotating corpora manually is slow work and one of the roles of linguistic tools is making that task easier.

Manual annotation tools may involve one or more levels of analysis. Many specialized tools are oriented towards particular kinds of annotation and research corpora, while others are very general and applicable to many different linguistic theories and kinds of analysis.

Implementations of manual annotation tools:

Annotate

Annotate is a corpus annotation tool that is freely available for research purposes. It allows linguists to annotate the syntactic structure of sentences either manually or with the help of external tools like taggers and parsers, and see the results in an easy-to-read format on screen. Annotated corpora are stored in a sharable relational database and there is a permissions control scheme for users, granting them individual read and/or write access to each project. Multiple corpora can be hosted in a single Annotate installation.

Annotate works with input encoded in several versions of the NeGra corpus format. Annotation tags can be specified for different levels of analysis, i.e. morphological tags, PoS tags, syntactic tags, etc.

SALTO

SALTO is a graphical tool for manual annotation of treebanks. It has been designed to work with syntactic trees and frame semantics annotation, but has also proven useful for other kinds of annotation.

Data is imported into SALTO in either the TIGER XML format or in its own SALSA XML format [Erk/Pado 2004]. Nevertheless as the TIGERSearch-related tool TIGERRegistry can convert a number of different treebank formats into TIGER XML, this makes it possible to use many different input sources. The output format is SALSA XML, an extension to TIGER XML. The added annotations can not be displayed or edited by TIGER tools.

Searching and retrieving the contents of annotated corpora is one of the major supports that linguistic tools bring to general linguistic research. Relatively few specialized tools exist for this purpose, and many linguists use simple text tools like grep [Bell1979] to do research. Nonetheless, a number of applications are available for search and retrieval specifically with annotated language data.

Implementations of an annotated corpus access tools:

TIGERSearch

TIGERSearch is a particularly widely used application for searching in treebanks. Users can import existing corpora into TIGERSearch and then query them using a lightweight search language developed specifically for TIGERSearch.

The results of queries are displayed visually and can be easily inspected. Matching patterns are clearly highlighted, so very small structures are quickly identified even in large trees. Results can be exported either as images (JPG, TIF, SVG, etc.) or as XML files. The statistics export module can be used to compute the frequency distribution for specific constructs in the query.

TIGERSearch uses the TIGER-XML format to encode corpora, but import filters are available for other popular formats such as Penn Treebank, NeGra, Susanne, among others.

ANNIS2

ANNIS2 is a versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation.

DDC-Concordance

DDC-Concordance is a tool for searching corpora with and without morphological markup. It is cross-lingual, but lemma searching is limited to English, German and Russian. Client APIs are available for Perl, Python, C and PHP.

Corpus Query Processor (CQP)

CQP is a query engine for annotated corpora. It supports a range of annotation types and queries, including regular expressions and matches over words that are not next to each other, along with functions for marshalling results for convenient viewing.