-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation #41
Comments
One important item to note here relevant to speech recognition is that Chrome/Chromium currently records the user and sends the recording to a server - without any notification provided that clearly indicates the user of the browser is being recorded and their voice is being sent to an external server. Also, it is not clear if the users' biometrics (their voice) is stored (forever) by the service; see https://bugs.chromium.org/p/chromium/issues/detail?id=816095 |
@guest271314 , thank you. By including these client-side, server-side and third-party scenarios in the design of standard APIs and by more tightly integrating such standards and APIs with WebRTC we can: (1) provide users with notifications and permissions with respect to which client-side, server-side and third-party components and services are accessing their microphones and their text, SSML, hypertext and audio streams, (2) produce efficient call graphs (see also: https://youtu.be/EPBWR_GNY9U?t=2m from 2:00 to 4:12), (3) reduce latency for real-time translation scenarios, (4) improve quality for real-time translation scenarios. |
I’m hoping to inspire interest in post-text speech technology (speech-to-X1 and X2-to-speech) as well as interest in round-tripping where we can utilize acoustic measures and metrics to compare the audio input to and output from speech-to-X1-to-X2-to-speech. X1, X2 could be SSML (1.0, 1.1 or 2.0), hypertext or new formats. X1-to-X2 machine translation is also topical. |
In the video Real Time Translation in WebRTC, the speaker indicates (at 7:48) that a major issue which he would like to see solved is that users have to pause their speech before speech recognition and translation occur. Towards reducing latency, we can consider real-time online speech recognition algorithms which, instead of processing natural language sentence-by-sentence and outputting X1, process natural language lexeme-by-lexeme and produce event streams. In these low-latency approaches, speech recognition components and services process speech audio in real-time and produce event steams which are consumed by machine translation components which produce event streams which are consumed by speech synthesis components which produce resultant speech audio. |
This issue pertains to Issue 1 in the Web Speech API specification.
|
Introduction
We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.
Advancing the State of the Art
Speech Recognition
Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.
Speech Synthesis
Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.
Translation
Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.
Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.
Multimodal Dialogue Systems
Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.
Client-side Scenarios
Client-side Speech Recognition
These scenarios are considered in the current version of the Web Speech API.
Client-side Speech Synthesis
These scenarios are considered in the current version of the Web Speech API.
Client-side Translation
These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.
Server-side Scenarios
Server-side Speech Recognition
These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.
Server-side Speech Synthesis
These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.
Server-side Translation
These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.
Third-party Scenarios
Third-party Speech Recognition
These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.
Third-party Speech Synthesis
These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.
Third-party Translation
These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.
Hyperlinks
Amazon Web Services
Google Cloud AI
IBM Watson Products and Services
Microsoft Cognitive Services
Real Time Translation in WebRTC
The text was updated successfully, but these errors were encountered: