Multimodal corpora

Dieter van Uytvanck

MPI for Psycholinguistics Nijmegen
[Note]editorial note

This section is currently being rewritten.

Examples of possible modes within corpora

  • Audio: Spontaneous conversation, interview with one or more interviewees, experiment (structured, following a plan), cultural events, celebrations without particular speaker.

  • Video: videos corresponding to above listed points, fieldtrip interviews, cultural events and experiments; sign language, gesture analysis: conversation, story telling, structured experiment (performing tasks according to some schedule, answering questions).

  • Text: transcription, translation, deeper linguistic annotations (describing morphosyntactic properties, actions in the video, including human interaction, emotions, reactions, comments, etc.).

  • MRI data recorder during an experiment.

  • Eye-tracking data – direction of the gaze, pupils’ size, focused area.

  • Hand movement – virtual glove recording the position of the hand and all the joints.

  • Motion capture data – part or full body with numerous markers attached.

Some background on audiovisual data formats

There are no strict standards of audio, video or annotations within the CLARIN-D network. Each center has developed their standards and solutions with respect to the needs of their users. Nevertheless, usability of the resources and tools by any other potential users was always desired. Therefore, popular and open standards are enforced.

When speaking about formats for audiovisual data, one needs to make a distinction between the container format and the actual encoding of the media streams. The container is used as a wrapper around one or more streams of audio and/or video data and typically some metadata about these streams. Many container formats can contain audio and/or video streams in a variety of encodings, although there are also “single coding formats” that only support one type of encoding. Examples of container formats are: WAV or AIFF for audio; AVI, Matroska (MKV) or QuickTime (MOV) for video.

Audio and video streams can be encoded and decoded by means of a “codec”, a piece of software that can read and write audio or video data according to a certain encoding scheme. Often these encoding schemes make use of data compression in order to reduce storage or bandwidth requirements. Data compression can be lossless, i.e. completely reversible such as in the case of a zip file, or lossy, in which case information is thrown away that can no longer be recovered. Most audio and video compression algorithms make use of weaknesses of human perception in order to throw away information that is less noticeable to us.

It depends a lot on the research questions and on the long-term preservation goals whether lossy compression of audiovisual data is an issue or not. A lot of research questions in linguistics for example can be answered perfectly well when using mp3-compressed audio. The question is then whether this data should serve other purposes as well and whether it needs to be preserved for the long term. Certain phonetic analyses, for example, could be affected by the omissions introduced by the mp3 compression algorithm. Converting from one lossy compressed encoding to another typically introduces artifacts (“data” that has its origin not in the depicted real world event or state, but by data manipulation algorithms, such as the curly paper effect on grey areas in a Xerox copy of a Xerox copy), so after several conversions the quality of the recordings will decrease. Since file formats and encodings typically have a limited life span, one should in principle store data in uncompressed form if interpretability in the long run is a goal. The current situation however is such that this is easily feasible for audio recordings but nearly impossible for video recordings. Video recording equipment that falls within the typical humanities research budgets does not record in uncompressed form and even if it did, the storage requirements would be still too demanding for today’s storage prices.

Compression of video data can be done within each video image (intra-frame compression, comparable to JPEG compression of still images) as well as between consecutive images (inter-frame compression). Since two consecutive images in a video signal typically share a lot of content, some compression algorithms make use of that by only storing the differences rather than each full image. If the data rate is sufficiently high and there is not a lot of motion in the video, the intra-frame compression is hardly noticeable, but with lower data rates and fast motion, some artifacts can occur (blocks appear in the image). The MiniDV tape-based format only used intra-frame compression, but most consumer and semi-professional camcorders today use MPEG2 or MPEG4/H.264 compressed video, which also uses inter-frame compression.

The quality of audio and video recordings depends on many more factors than just the file format and encoding (microphone, lens, or imaging sensor quality, correct settings and operation of the equipment, good recording circumstances, etc.), but these are beyond the scope of this survey. When looking at uncompressed digital audio, there are basically two parameters that determine the maximum quality that can technically be obtained: the bit rate and the sampling frequency. The bit rate determines the maximum possible dynamic range of the recording (difference between the loudest and softest possible sound) and the sampling frequency determines the highest possible frequency component that can be represented in the signal (half the sampling frequency, e.g. 22.05 kHz for a signal recorded at sampling frequency of 44.1 kHz).

When looking at uncompressed digital video, there are four basic parameters that determine the theoretical maximum quality that can be obtained: the spatial image resolution (number of pixels in each image) the temporal resolution (number of images per second), the colour depth (how many different colour values can a single pixel have) and the dynamic range (how many different light intensity values can a single pixel have). In common video hardware we have seen a shift in the spatial resolution of the image from Standard Definition to High Definition, which contains up to 4 times as many pixels. The temporal resolution on typical video recording hardware is limited by the video standards being used in the particular region, e.g. 25 frames per second for the PAL standard and 30 frames per second for the NTSC standard. Some recent hardware can record double the standard number of frames per second, and there are specialised high-speed video cameras that can record thousands of frames per second. Most video recording hardware reduces the colour information that is stored by making use of a technique called “Chroma Subsampling”. Since the human eye has less spatial sensitivity for colour information than for light intensity, the colour information is not stored for every single pixel but for groups of 2 or 4 pixels. Colour information in a digital video signal is typically represented as a 24-bit value (8 bits per primary colour) resulting in over 16 million possible different colours, which is more than can be distinguished by the human eye. The dynamic range of the human eye is much wider than the dynamic range of current video and display technology, even when not considering dynamic adaptations that can occur within the human eye over time (pupil dilation or constriction). This is already a limitation of the imaging sensor technology and therefore not in the first place an issue of the digital representation in current common video formats.


General recommendations

Data producers should search for information about the limitations of certain audio-visual data representations and select formats that offer sufficient resolution in all dimensions that are relevant for the research questions that they or future users might have. At the same time they should inform themselves about any formats recommended or required by the data repository at which they would like to archive their material.

Data repositories that preserve research data for the long term should select a limited set of audio-visual formats as archival formats, based on the following criteria:

  • suitability to represent the recorded events in sufficient detail in all relevant dimensions

  • suitability to represent the recorded events in sufficient detail in all relevant dimensions

  • suitability to convert to a different format without (too much) loss of information

  • long-term perspective of the format

  • openness of the format

  • general acceptability of the format within the research domain (in case the criteria above are already met and one needs to choose between formats that are in principle suitable as archival formats).

For the time being, in practice the choice for certain video formats will be a compromise of some kind since not all criteria can be fulfilled. Derived working formats can typically be generated for practical usability if the archival format itself is not usable.

Practical recommendations given the current state of technology

Audio recordings: Use uncompressed linear PCM audio at sufficient bit rates and sampling frequencies for the recorded material. For speech this would be 16 bit 22 kHz as a minimum, for field recordings where the environment needs to be accurately represented this would be 16 bit 44.1 kHz, for highly dynamic music this would be 24 bit 44.1 kHz.

Video recordings: For most purposes, video formats produced by the higher end of today’s consumer camcorders (MPEG2 or MPEG4/H.264 at high bit rates) are sufficient in quality and can be stored in their original form until storage space is cheap enough to convert the material to a lossless format, if long-term preservation is a requirement.

  • standard definition video size (720 × 576, 704 × 480), MPEG-2 compression up to 9.8 Mbit/s (usually around 3.5 Mbit/s) – Fieldtrip recordings, field interviews, cultural events

  • high definition video (1280 × 720, 1920 × 1080), H.264/MPEG-4 AVC compression up to 48 Mbit/s (usually around 9 Mbit/s) – detailed analyses of gestures, eye gaze and facial expressions (although this also depends a lot on the framing and distance when filming)

For special collections one could already consider to store the material in a lossless compressed format, such as currently the MJPEG2000 format. Keep in mind, however, that all consumer camcorders today record audio in compressed form.