Technical issues in linguistic tool management

Many of the most vexing technical issues in using linguistic tools are common problems in computer application development. Data can be stored and transmitted in an array of incompatible formats, generated by different applications, and either there are no standard formats, or too many de-facto standards, or standards compliance is poor. Linguistic tool designers are rarely concerned with those kinds of issues or specialized in resolving them. Part of CLARIN-D's mandate, in developing infrastructure for language research, is overcoming these technical problems.

Many tools accept and produce only plain text data, sometimes with specified delimiters and internal structures accessible to ordinary text editors. Some tools require each word to be on a separate line, others require each sentence on a line. Some will use comma- or tab-delimited text files to encode annotation in their output, or require them as input. These encodings are often historically rooted in the data storage formats of early language corpora. A growing number of tools use XML for either input or output format.

Character encoding formats are an important technical issue when using linguistic tools. Most languages use some characters other than 7-bit ASCII (the standard letters and punctuation used in English) and there are different standards for characters in different languages and operating systems. Unicode and its most widespread implementation UTF-8 are increasingly used for written language data because they encode nearly all characters from nearly all modern languages. But not all tools support Unicode. Full Unicode compatibility has only recently become available on most operating systems and programming frameworks, and many non-Unicode linguistic tools are still in use. Furthermore, the Unicode standard supports so many different characters that simple assumptions about texts – like what characters constitute punctuation and spaces between words – may differ between Unicode texts in ways that are incompatible with some tools. For example, there are currently more than 20 different whitespace characters in Unicode, five dashes and many more characters that look much like dashes, as well as multiple variants for many common kinds of punctuation. In addition, many alphabetic letters appear in a number of variants in Unicode, some ligatures can be encoded as individual Unicode characters, and many other variations exist that are correctly human readable but anomalous for computer processing.

One of the major axes of difference between various annotation formats is inline or stand-off annotation. Inline annotation mixes the data and annotation in a single file or data structure. Stand-off annotation means storing annotations separately, either in a different file, or in some other way apart from language data, with a reference scheme to connect the two. See Chapter 3, Resource annotations for more information about annotation formats.