Multimodal corpora

[Note]editorial note

This section is currently being rewritten.

There are no strict standards of audio, video or annotations within the CLARIN-D network. Each center has developed their standards and solutions with respect to the needs of their users. Nevertheless, usability of the resources and tools by any other potential users was always desired. Therefore, popular and open standards are enforced.

When speaking about formats for audiovisual data, one needs to make a distinction between the container format and the actual encoding of the media streams. The container is used as a wrapper around one or more streams of audio and/or video data and typically some metadata about these streams. Many container formats can contain audio and/or video streams in a variety of encodings, although there are also “single coding formats” that only support one type of encoding. Examples of container formats are: WAV or AIFF for audio; AVI, Matroska (MKV) or QuickTime (MOV) for video.

Audio and video streams can be encoded and decoded by means of a “codec”, a piece of software that can read and write audio or video data according to a certain encoding scheme. Often these encoding schemes make use of data compression in order to reduce storage or bandwidth requirements. Data compression can be lossless, i.e. completely reversible such as in the case of a zip file, or lossy, in which case information is thrown away that can no longer be recovered. Most audio and video compression algorithms make use of weaknesses of human perception in order to throw away information that is less noticeable to us.

It depends a lot on the research questions and on the long-term preservation goals whether lossy compression of audiovisual data is an issue or not. A lot of research questions in linguistics for example can be answered perfectly well when using mp3-compressed audio. The question is then whether this data should serve other purposes as well and whether it needs to be preserved for the long term. Certain phonetic analyses, for example, could be affected by the omissions introduced by the mp3 compression algorithm. Converting from one lossy compressed encoding to another typically introduces artifacts (“data” that has its origin not in the depicted real world event or state, but by data manipulation algorithms, such as the curly paper effect on grey areas in a Xerox copy of a Xerox copy), so after several conversions the quality of the recordings will decrease. Since file formats and encodings typically have a limited life span, one should in principle store data in uncompressed form if interpretability in the long run is a goal. The current situation however is such that this is easily feasible for audio recordings but nearly impossible for video recordings. Video recording equipment that falls within the typical humanities research budgets does not record in uncompressed form and even if it did, the storage requirements would be still too demanding for today’s storage prices.

Compression of video data can be done within each video image (intra-frame compression, comparable to JPEG compression of still images) as well as between consecutive images (inter-frame compression). Since two consecutive images in a video signal typically share a lot of content, some compression algorithms make use of that by only storing the differences rather than each full image. If the data rate is sufficiently high and there is not a lot of motion in the video, the intra-frame compression is hardly noticeable, but with lower data rates and fast motion, some artifacts can occur (blocks appear in the image). The MiniDV tape-based format only used intra-frame compression, but most consumer and semi-professional camcorders today use MPEG2 or MPEG4/H.264 compressed video, which also uses inter-frame compression.

The quality of audio and video recordings depends on many more factors than just the file format and encoding (microphone, lens, or imaging sensor quality, correct settings and operation of the equipment, good recording circumstances, etc.), but these are beyond the scope of this survey. When looking at uncompressed digital audio, there are basically two parameters that determine the maximum quality that can technically be obtained: the bit rate and the sampling frequency. The bit rate determines the maximum possible dynamic range of the recording (difference between the loudest and softest possible sound) and the sampling frequency determines the highest possible frequency component that can be represented in the signal (half the sampling frequency, e.g. 22.05 kHz for a signal recorded at sampling frequency of 44.1 kHz).

When looking at uncompressed digital video, there are four basic parameters that determine the theoretical maximum quality that can be obtained: the spatial image resolution (number of pixels in each image) the temporal resolution (number of images per second), the colour depth (how many different colour values can a single pixel have) and the dynamic range (how many different light intensity values can a single pixel have). In common video hardware we have seen a shift in the spatial resolution of the image from Standard Definition to High Definition, which contains up to 4 times as many pixels. The temporal resolution on typical video recording hardware is limited by the video standards being used in the particular region, e.g. 25 frames per second for the PAL standard and 30 frames per second for the NTSC standard. Some recent hardware can record double the standard number of frames per second, and there are specialised high-speed video cameras that can record thousands of frames per second. Most video recording hardware reduces the colour information that is stored by making use of a technique called “Chroma Subsampling”. Since the human eye has less spatial sensitivity for colour information than for light intensity, the colour information is not stored for every single pixel but for groups of 2 or 4 pixels. Colour information in a digital video signal is typically represented as a 24-bit value (8 bits per primary colour) resulting in over 16 million possible different colours, which is more than can be distinguished by the human eye. The dynamic range of the human eye is much wider than the dynamic range of current video and display technology, even when not considering dynamic adaptations that can occur within the human eye over time (pupil dilation or constriction). This is already a limitation of the imaging sensor technology and therefore not in the first place an issue of the digital representation in current common video formats.