The Multimedia Analytics Module in the INSPEC2T Platform

Vicomtech-IK4 , December, 2017

As part of the INSPEC2T project’s objective to accelerate communication and information sharing between police operators and members of the community, Vicomtech-IK4 has successfully developed and integrated a new technology which automatically transcribes spoken messages sent by citizens as crime evidence.

Here, we focus on and illustrate some technical aspects of the INSPEC2T solution. Please note that the INSPEC2T consortium will ensure that the privacy rights and other users’ fundamental rights will be respected at all times. We will provide a full analysis of the INSPEC2T legal and ethical requirements in a future blog entry.

This technology allows police operators to obtain highly annotated and informative text output from audio recordings, and spot key words even when recorded under adverse acoustic conditions. The technology has been developed for English and Spanish languages and will be tested in nearly real conditions and environments. Among other parameters, it deals with the variability in English accents, acoustic conditions, emotional states and audio capture devices. The three main components of the system architecture are: (1) a speech recognition engine which transcribes speech segments to raw text, (2) a capitalizer that detects named entities and proper names and capitalises them, and (3) a punctuation module to add full stops and commas to the capitalised text.

Figure 1 presents the main architecture of the rich transcription system integrated into the INSPEC2T platform for both English and Spanish languages.

Figure 1 – Architecture of the rich transcription system integrated in the INSPEC2T platform

Additionally, the speech recognition engine includes a keyword spotting technology (KWS) that deals with the identification of keywords in spoken utterances – the system is given keywords as input and searches for them in the audio.

The first step includes the speech recognition of the spoken content, generating a type of network (known as lattice) which connects all the combinations of possible recognized words in the audio (see Figure 2). The searching process is then performed over this lattice and the keywords are recovered along with confidence scores and time-stamps.

Figure 2 presents the main architecture of the KWS system.

Figure 2 – Architecture of the KWS system integrated in the INSPEC2T platform

The systems achieve a good performance even under adverse conditions as the training of the acoustic and language model included a great variety of acoustic environments and text data. In fact, clean training data have been mixed with several noisy samples from restaurants, shopping centres and streets in order to generate additional synthetic speech content.

This speech recognition technology was built employing the latet modelling paradigms in the scientific community, using deep learning algorithms to build the acoustic and language models.