Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ptime cmdline arg #357

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

piotrgregor
Copy link

@piotrgregor piotrgregor commented Jul 9, 2024

By now, example code sends audio in 64k chunks every second.
However, in a real time audio processing scenarios audio is read at different intervals, e.g. 20 ms in VoIP. As a user I would like to use code example to see/experiment with a speech to text feature working similarly as it is going to be integrated with my real time audio processing (particular sampling rate and ptime).

To provide additional context, I work on a text / speech processing in VoIP where packetization time interval is dictated by packetization time setting (ptime). Most often this is set to 20 ms, therefore audio is processed in 20 ms packets on an audio call. The example code speech/api/streaming_transcribe.cc on GoogleCloudPlatform sends audio in a fixed 1 second intervals. I need to know if speech to text code example will work when I send packets as they come in on my infrastructure with different ptime and packet size or if I need to implement buffering to send them exactly in 1 second 64k chunks as example does. It's understood that speech to text result is mostly driven by accuracy of underlying speech to text method/solution (model/AI) being applied to speech and ideally it is not impacted by audio packetization, but as a code integrator I need to verify my custom case and that would be great if code example let me to mirror as closely as possible audio processing in my environment.

This PR is adding a support for ptime command line argument, so user can experiment with real time audio at various settings. Now, when ptime is set on file in RAW or ULAW encoding, packets are sent in size and with time interval reflecting a ptime and sampling rate (I did not apply that to AMR, FLAC and AMR-WB as number of bytes to send using those codecs per ptime is impacted by additional settings [encoding mode in case of AMR/AMR-WB and compression ratio for FLAC])

% .build/streaming_transcribe --help       

Standard C++ exception thrown: the option '--path' is required but missing
Usage:
  streaming_transcribe [--bitrate N] [--ptime N] audio.(raw|ulaw|flac|amr|awb)

Example 1. Using ptime 20 ms:

% .build/streaming_transcribe --bitrate 16000 --ptime 20 resources/audio2.raw

Sending 640 bytes.
Sending 640 bytes.
Sending 640 bytes.
(...)
Sending 640 bytes.
Sending 640 bytes.
Sending 640 bytes.

Result stability: 0
0.986006        the rain in Spain stays mainly on the plain

Example 2. Using ptime 200 ms:

.build/streaming_transcribe --bitrate 16000 --ptime 200 resources/audio2.raw

Sending 6400 bytes.
Sending 6400 bytes.
Sending 6400 bytes.
(...)
Sending 6400 bytes.
Sending 6400 bytes.
Sending 6400 bytes.

Result stability: 0
0.986006        the rain in Spain stays mainly on the plain

@piotrgregor piotrgregor requested a review from a team as a code owner July 9, 2024 10:09
@coryan
Copy link
Contributor

coryan commented Jul 9, 2024

/gcbrun

@dbolduc
Copy link
Member

dbolduc commented Jul 9, 2024

/gcbrun

@dbolduc
Copy link
Member

dbolduc commented Jul 9, 2024

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants