Technologies & Tools

This section of the K-OAr Center’s site is a porting of some content from the Speech Data & Tech website, formerly at https://speechandtech.eu/ (you can find its latest Internet Archive copy here).

Speech & Tech provided an overview of the main technologies and tools used in the processing and analysis of speech data, with the aim of supporting humanities researchers in navigating a field often characterized by technical terminology and specialized knowledge.

Working with speech data typically involves a sequence of interconnected processes – from recording, to recognition, to analysis – each relying on specific technologies and tools.

Recording

Before addressing tools for processing and analyzing speech data, it is important to consider the initial stage of recording. The quality of audio recordings – including microphone choice, recording environment and technical settings – has a direct impact on all subsequent steps, from transcription to analysis. An introduction to best practices in audio recording and data preparation is provided in this presentation by Christoph Draxler.

From recording to analysis

The first step concerns the digitization of data. Speech recordings and written materials are converted into digital formats, ensuring sufficient audio quality and technical parameters for further processing.

The second step is recognition, where raw data is transformed into symbolic representations. This includes technologies such as:

Automatic Speech Recognition (ASR), which converts speech into text
Optical Character Recognition (OCR), which extracts text from images
Speaker diarisation, which identifies who is speaking
Emotion recognition, which detects affective features in speech

The final step is analysis, where meaning is extracted from the processed data. This stage often combines the outputs of recognition systems with linguistic, statistical and contextual knowledge in order to interpret spoken content.

Automatic Speech Recognition (ASR)

A central technology within Speech & Tech is Automatic Speech Recognition (ASR), which enables the conversion of speech into written text. While earlier systems relied on statistical models such as Hidden Markov Models, recent developments in deep learning have led to end-to-end systems capable of directly transforming audio into text.

Contemporary models — such as Wav2Vec-2 and Whisper — have significantly improved performance, particularly in multilingual contexts and in handling variation in accents and recording conditions. These systems allow not only transcription, but also translation and alignment of speech with text.

At the same time, ASR remains imperfect: issues such as the simplification of disfluencies, limited speaker identification, and sensitivity to training data still affect its use, especially in research contexts where fine-grained features of speech are relevant.

Transcription Portal

To support researchers working with speech data, CLARIN ERIC supported the development of the Transcription Portal, a web-based service for automatic transcription.

The portal allows users to upload audio files, select the spoken language and, where available, a language model, and process the files through Automatic Speech Recognition. Once processing is complete, the resulting transcripts can be downloaded for further use. Access to the service requires academic login.

The Transcription Portal operates as a distributed system. Audio files uploaded by users are routed through a central node (currently hosted at LMU Munich) and then sent to ASR engines located in different countries, depending on the selected language. For example, a Dutch audio file may be processed by an ASR system based in the Netherlands, with results returned through the same infrastructure.

This architecture reflects the broader goal of making speech recognition tools available across European languages. However, coverage is not uniform, and in some cases commercial ASR systems may be used, raising potential issues regarding data protection and GDPR compliance.

Workflow

The use of the Transcription Portal follows a relatively simple workflow:

Upload one or more audio files
Select the language (and optionally a language model)
Start the transcription process
Download the generated transcripts

At present, audio files are typically required in WAV format (mono or stereo), although future versions aim to support automatic format conversion. For stereo recordings, users can choose whether to process channels separately or together. Processing channels separately can be useful in interview settings where speakers are recorded on different channels, facilitating speaker distinction and, in some cases, improving recognition quality.

Alignment and transcription tools

Beyond automatic transcription, the project addressed tools for refining and structuring speech data:

alignment, which synchronizes an existing transcript with the audio signal, allowing precise time alignment at the level of words or phonemes
Manual transcription tools, which support the creation of time-aligned transcripts and facilitate navigation within audio data (e.g. oTranscribe, Subtitle Edit, ELAN)

Subtitles represent a further layer of processing, transforming transcripts into synchronized textual representations of spoken content. They can be used both for accessibility (e.g. for deaf or hard-of-hearing audiences) and for translation across languages. Subtitles may be embedded in video or provided as separate, selectable tracks, and they often include additional information about sound and context.