Automatic and manual analysis tools

Some computational linguistic tools exist just to provide linguists with interfaces to stored language data. The earliest digital corpus tools were made to search in texts and display the results in a compact way, like the widespread KWIC (key word in context) tools that date back to the 1970s.

However, many current linguistic tools are used to produce an annotated resource. Annotation involves the addition of detailed information to a linguistic resource about its contents. For example, a corpus in which each word is accompanied by a part-of-speech tag, or a phonetic transcription, or in which all named entities are clearly marked, is an annotated corpus. Linguistic data in which the syntactic relationships between words are marked is usually called a treebank. The addition of annotations can make texts more accessible and usable for linguistic research, and may be required for further processing. Corpus annotation is discussed in greater depth in Chapter 3, Resource annotations.

Automatic annotation tools add detailed information to language data on the basis of procedures written into the software, without human intervention other than to run the program. Automatic annotation is sometimes performed by following rules set out by programmers and linguists, but most often, annotation programs are at least partly based on machine learning algorithms that are trained using manually annotated examples.

The vast majority of automatic annotation tools for linguistic data today are based on statistical machine learning principles of some kind. This approach to problem-solving in computer science goes back to the early 1950s when they were first applied to computer programs for playing board games. Arthur Samuel's work at IBM on programs to play checkers is usually credited as the first non-trivial work in machine learning. Although there were some very promising early results, there was very little progress in machine learning from the mid 1960s to the early 1990s – the period called the AI Winter. There was also very little use of statistical methods in linguistics or natural language processing during that period, for reasons that linguists still debate. However, a series of theoretical breakthroughs followed by very strong practical demonstrations of the value of statistical analysis in linguistics turned the tide in the 1990s, and now automatic linguistic annotation is overwhelmingly based on statistical machine learning techniques. The history of natural language processing is discussed in part in [Jurafsky/Martin 2009], the standard textbook for NLP, including the explosive growth of machine learning and statistical and empirical methods in this field since the early 1990s. The history of natural language processing overlaps heavily with the histories of linguistics, artificial intelligence and computers, all of which are fast changing fields. Although there is a broad but brief treatment of the history of artificial intelligence in the first chapter of [Russel/Norvig 2009] that covers much of what is described in this section, no up-to-date, general history of the field is available.

Machine learning is a complex topic, but from a practical standpoint, many linguistic annotation applications learn to annotate texts from manually annotated linguistic data. This process of machine learning is usually called training. It means that a linguist prepares a manually annotated corpus and then a computer program processes this data and learns how to replicate automatically the linguist's manual annotation on new corpora. These kinds of programs can often be adapted to new languages and annotation schemes if they are provided with appropriate annotated data to learn from.

Linguistic tools that use machine learning generally save information about the tasks they are trained to perform in files that are separate from the program itself. Using these tools may require specifying the location of that learned information.

Automated annotation processes – whether based on machine learning or rules – almost always have some rate of error, and before using any automatic annotation tool, it is important to consider its error rate and how those errors will affect whatever further purpose annotated corpora are used for, see also Chapter 5, Quality assurance .

Where possible, researchers prefer manually annotated resources with fewer errors. Fewer errors does not mean no errors. Although we often treat manually annotated data as an objectively correct standard, we do so because, first, we expect human errors to be random and unbiased when compared to the systematic errors made by imperfect software; and second, because we have no other means of constructing correct annotations than to have trained people make them. Highly reliable manually annotated resources are, naturally, more expensive to construct, rarer and smaller in size than automatically annotated data, but they are essential for the development of automated annotation tools and are necessary whenever the desired annotation procedure either has not yet been automated or cannot be automated. Various tools exist to make it easier for researchers to annotate language data, some of which are described in the section called “Manual annotation tools” and the section called “Multimedia tools”.