How speech to text apps work
Meeting app addons provide a transcript of meetings so that users can preserve the action items, next steps, and knowledge discussed in meetings for future reference. After a meeting on zoom, google meet, microsoft teams, or another app, users can use an app like dialoggBox or otter to convert the proceedings into text.
So what is involved in converting a user’s speech into text ?
The first step is to capture the audio content. Higher the clarity of content, better the results from the steps that follow. Noise reduction techniques are therefore often used in this step.
Simple speech recognition used for captioning
Speech recognition software can be used to convert the audio into text a stream of words synchronized with video. This is often used to create subtitles for use during playback of recorded video. At any time during playback, a picture frame provides the context of what is occuring in the scene, while a subtitle provides the spoken words in text format. This results in a stream of words without capitalization or punctuation. The words are useful even without being in the form of sentences because scene transitions and visual expressions enable a large part of the comprehension.
word1 word2 word3 ……………………
etc.
Capturing or watching a video is time consuming and video storage consumes a large amount of storage space. Overall only few users have time for video playback of meetings. Users who are required to make objective assessments may have to turn off video, and in some cases, even audio, focusing only on text.
State-of-the-art speech recognition
To convert into readable text, audio input is split by silence segments and factored into utterances. Each utterance has a start time, content, and end time. This information is recorded. Next, each utterance is recognized into a set of words using artificial intelligence techniques for speech recognition. An utterance may be one or more sentences of text. Utterances are then punctuated so that they are readable and not just a stream of words.
During live capture speech recognition software may recognize speech as one set of words initially e.g. wordset-1, and as more speech flows in, the context available for recognition increases, and the software may then revise the recognition to a more accurate set of words e.g. wordset-2. The recognition software usually maintains a concurrent set of words that it keeps evaluating for matching with the spoken utterance. It provides the word sets as potential alternatives. At the conclusion of the meeting, it evaluates the full audio available and provides the best matching set of words overall, wordset-f, for each utterance.
Most speech recognition software outputs results in the following format. The speaker of any particular utterance is not known in this output.
Time 1: Utterance 1
Time 2: Utterance 2
Time 3: Utterance 3
.. etc.
Speech recognition enhanced with diarization
Sophisticated speech recognition software can tag each utterance with an id corresponding to its speaker by determining features such as the waveform of the utterance aka prosody, the pitch of the speech, and the rate of speaking i.e. how fast or how slow. Precision and recall of speaker identification by good speech recognition software having this capability ranges between 70%-80%. The accuracy is low because speech features of any person vary widely with acoustics of the room they are in, their health, who they are speaking with, and other factors, much more so than something like a fingerprint, which is static and does not change. User interfaces that show this information provide end users with editing capabilites to edit speaker names. Speaker identification is more accurate when the number of speakers is low, such as 2, 3, or 4. Providing the number of speakers in the audio as an advance input into the software increases accuracy.
Time 1: Speaker 1 > Utterance 1
Time 2: Speaker 2 > Utterance 2
Time 3: Speaker 3 > Utterance 3
.. etc.
Conversation method and interfaces
People do not talk in a grammatically correct way, they often speak in half sentences with the backdrop of a historical context among the conversants. They often use local dialects, abbreviations, slang, and their speech may be accented with another language. The medium of speaking such as phone call, online conference call, media broadcast, and others might have characteristics such as channel id, which, if provided as input, can improve the results of speech recognition software. Some speech may have multiple utterances overlayed with each other when more than one person was talking at the same time, in a heated discussion. Speech can be conversation, music, or rhetoric, and the recognition software does not know what it is. Simple speech recognition software in open source or free mode can usually transcribe up to 1 minute of audio. Another thing speech recognition apps have to handle is the ability to send requests for recognition that do not cut off in between sentences or phrases. All these factors add a layer of complexity to getting the right answers.
My company dialoggBox is a productivity and collaboration app. We use the above methodology and enhance it with a variety of techniques to perform speech to text with the highest accuracy on speaker diarization. The resulting transcript becomes the DNA for actions and insights from the conversation. We provide an interactive transcript, easy follow ups, and insights for any conversation. Check us out at https://dialoggBox.com, click on TRY IT NOW to get an automatic searchable and actionable transcript in 50+ languages, for any meeting on zoom, google meet, microsoft teams, webex, or blue jeans in 50+ languages, simply by putting in the meeting link.
Follow me on Medium to continue to read more of my articles: https://gigaspoke.medium.com
Bookmark dialoggBox at https://www.dialoggBox.com and follow dialoggBox on LinkedIn at https://www.linkedin.com/company/dialoggbox