Chapter 7. Linguistic tools

Scott Martens, Kathrin Beck, Thomas Zastrow

Universität Tübingen

Computational linguistic tools are programs that perform operations on linguistic data, i.e. analyses, transformations or other tasks that add to or change language data, or that assist people in performing such tasks. In this section we provide an introduction to the general classes of linguistic tools and what purposes they serve. It is intended to provide computer programmers, technicians, and humanities researchers outside of computational linguistics with a background for understanding linguistic tools in order to better use the CLARIN-D infrastructure and identify how it can meet their needs. This means discussing some notions from linguistics that are specifically relevant to understanding language processing tools.

The CLARIN-D project is focused on providing an infrastructure for the maintenance and processing of language resources, in which natural language processing (NLP) and computational linguistics (CL) play prominent roles. CLARIN-D supports many different tools of very different kinds, without regard for the scientific and theoretical claims behind those tools. Individual linguistic tools and resources may be based on specific linguistic schools or theoretical claims, but the CLARIN-D infrastructure is neutral with respect to those theories.

Many computational linguistic tools are extensions of pre-computer techniques used to analyze language. Tokenization, part-of-speech tagging, parsing and word sense disambiguation, as well as many others, all have roots in the pre-computer world, some going back thousands of years. Computational tools automate these long-standing analytical techniques, often imperfectly but still productively.

Other tools, in contrast, are exclusively motivated by the requirements of computer processing of language. Sentence-splitters, bilingual corpus aligners, and named entity recognition, among others, are things that only make sense in the context of computers and have little immediate connection to general linguistics but may be very important for computer applications.

Linguistic tools can also encompass programs designed to enhance and facilitate access to digital language data. Some of these are extensions of pre-computer techniques like indexing and concordancing (i.e. preparing a sorted list of words or other elements in a text, along with their immediate context, or all of the locations in the text where they occur, or both). But linguistic tools can also include more recent developments like search engines, some of which are already sensitive to the linguistic annotation of primary data. Textual information retrieval is a large field and in this user guide we will discuss only search and retrieval tools specialized for or based on linguistic analysis.