Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41

AdamSobieski · 2018-09-14T10:01:38Z

Introduction

We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.

Advancing the State of the Art

Speech Recognition

Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.

Speech Synthesis

Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.

Translation

Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.

Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.

Multimodal Dialogue Systems

Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.

Client-side Scenarios

Client-side Speech Recognition

These scenarios are considered in the current version of the Web Speech API.

Client-side Speech Synthesis

These scenarios are considered in the current version of the Web Speech API.

Client-side Translation

These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.

Server-side Scenarios

Server-side Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.

Server-side Speech Synthesis

These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.

Server-side Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.

Third-party Scenarios

Third-party Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.

Third-party Speech Synthesis

These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.

Third-party Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.

Hyperlinks

Amazon Web Services

Google Cloud AI

IBM Watson Products and Services

Microsoft Cognitive Services

Real Time Translation in WebRTC

guest271314 · 2018-09-15T17:38:25Z

One important item to note here relevant to speech recognition is that Chrome/Chromium currently records the user and sends the recording to a server - without any notification provided that clearly indicates the user of the browser is being recorded and their voice is being sent to an external server. Also, it is not clear if the users' biometrics (their voice) is stored (forever) by the service; see https://bugs.chromium.org/p/chromium/issues/detail?id=816095

AdamSobieski · 2018-09-16T01:12:44Z

@guest271314 , thank you. By including these client-side, server-side and third-party scenarios in the design of standard APIs and by more tightly integrating such standards and APIs with WebRTC we can: (1) provide users with notifications and permissions with respect to which client-side, server-side and third-party components and services are accessing their microphones and their text, SSML, hypertext and audio streams, (2) produce efficient call graphs (see also: https://youtu.be/EPBWR_GNY9U?t=2m from 2:00 to 4:12), (3) reduce latency for real-time translation scenarios, (4) improve quality for real-time translation scenarios.

AdamSobieski · 2018-09-16T03:11:39Z

I’m hoping to inspire interest in post-text speech technology (speech-to-X₁ and X₂-to-speech) as well as interest in round-tripping where we can utilize acoustic measures and metrics to compare the audio input to and output from speech-to-X₁-to-X₂-to-speech.

X₁, X₂ could be SSML (1.0, 1.1 or 2.0), hypertext or new formats.

X₁-to-X₂ machine translation is also topical.

AdamSobieski · 2018-09-16T10:40:56Z

In the video Real Time Translation in WebRTC, the speaker indicates (at 7:48) that a major issue which he would like to see solved is that users have to pause their speech before speech recognition and translation occur.

Towards reducing latency, we can consider real-time online speech recognition algorithms which, instead of processing natural language sentence-by-sentence and outputting X₁, process natural language lexeme-by-lexeme and produce event streams. In these low-latency approaches, speech recognition components and services process speech audio in real-time and produce event steams which are consumed by machine translation components which produce event streams which are consumed by speech synthesis components which produce resultant speech audio.

AdamSobieski · 2018-09-19T19:00:51Z

This issue pertains to Issue 1 in the Web Speech API specification.

Issue 1: The group has discussed whether WebRTC might be used to specify selection of audio sources and remote recognizers. See Interacting with WebRTC, the Web Audio API and other external sources thread on [email protected].

AdamSobieski mentioned this issue Sep 14, 2018

SSML support needs to be possible to feature detect #37

Open

AdamSobieski mentioned this issue Sep 18, 2018

Beyond Speech-to-Text #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41

AdamSobieski commented Sep 14, 2018

guest271314 commented Sep 15, 2018

AdamSobieski commented Sep 16, 2018

AdamSobieski commented Sep 16, 2018 •

edited

Loading

AdamSobieski commented Sep 16, 2018 •

edited

Loading

AdamSobieski commented Sep 19, 2018

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41

Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41

Comments

AdamSobieski commented Sep 14, 2018

Introduction

Advancing the State of the Art

Speech Recognition

Speech Synthesis

Translation

Multimodal Dialogue Systems

Client-side Scenarios

Client-side Speech Recognition

Client-side Speech Synthesis

Client-side Translation

Server-side Scenarios

Server-side Speech Recognition

Server-side Speech Synthesis

Server-side Translation

Third-party Scenarios

Third-party Speech Recognition

Third-party Speech Synthesis

Third-party Translation

Hyperlinks

guest271314 commented Sep 15, 2018

AdamSobieski commented Sep 16, 2018

AdamSobieski commented Sep 16, 2018 • edited Loading

AdamSobieski commented Sep 16, 2018 • edited Loading

AdamSobieski commented Sep 19, 2018

AdamSobieski commented Sep 16, 2018 •

edited

Loading

AdamSobieski commented Sep 16, 2018 •

edited

Loading