-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Start/stop recoding from the backend. Add guide on conversational cha…
…tbots (#9419) * Add code * stop displatch * first draft * edit * add changeset * lint * Docstring * Make recording * fix video * fix guide link * redirect * add changeset --------- Co-authored-by: gradio-pr-bot <[email protected]>
- Loading branch information
1 parent
4d75f02
commit 018c140
Showing
12 changed files
with
228 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
"@gradio/audio": minor | ||
"gradio": minor | ||
"website": minor | ||
--- | ||
|
||
feat:Start/stop recoding from the backend. Add guide on conversational chatbots |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
# Building Conversational Chatbots with Gradio | ||
|
||
Tags: AUDIO, STREAMING, CHATBOTS | ||
|
||
## Introduction | ||
|
||
The next generation of AI user interfaces is moving towards audio-native experiences. Users will be able to speak to chatbots and receive spoken responses in return. Several models have been built under this paradigm, including GPT-4o and [mini omni](https://github.com/gpt-omni/mini-omni). | ||
|
||
In this guide, we'll walk you through building your own conversational chat application using mini omni as an example. You can see a demo of the finished app below: | ||
|
||
<video src="https://github.com/user-attachments/assets/db36f4db-7535-49f1-a2dd-bd36c487ebdf" controls | ||
height="600" width="600" style="display: block; margin: auto;" autoplay="true" loop="true"> | ||
</video> | ||
|
||
## Application Overview | ||
|
||
Our application will enable the following user experience: | ||
|
||
1. Users click a button to start recording their message | ||
2. The app detects when the user has finished speaking and stops recording | ||
3. The user's audio is passed to the omni model, which streams back a response | ||
4. After omni mini finishes speaking, the user's microphone is reactivated | ||
5. All previous spoken audio, from both the user and omni, is displayed in a chatbot component | ||
|
||
Let's dive into the implementation details. | ||
|
||
## Processing User Audio | ||
|
||
We'll stream the user's audio from their microphone to the server and determine if the user has stopped speaking on each new chunk of audio. | ||
|
||
Here's our `process_audio` function: | ||
|
||
```python | ||
import numpy as np | ||
from utils import determine_pause | ||
|
||
def process_audio(audio: tuple, state: AppState): | ||
if state.stream is None: | ||
state.stream = audio[1] | ||
state.sampling_rate = audio[0] | ||
else: | ||
state.stream = np.concatenate((state.stream, audio[1])) | ||
|
||
pause_detected = determine_pause(state.stream, state.sampling_rate, state) | ||
state.pause_detected = pause_detected | ||
|
||
if state.pause_detected and state.started_talking: | ||
return gr.Audio(recording=False), state | ||
return None, state | ||
``` | ||
|
||
This function takes two inputs: | ||
1. The current audio chunk (a tuple of `(sampling_rate, numpy array of audio)`) | ||
2. The current application state | ||
|
||
We'll use the following `AppState` dataclass to manage our application state: | ||
|
||
```python | ||
from dataclasses import dataclass | ||
|
||
@dataclass | ||
class AppState: | ||
stream: np.ndarray | None = None | ||
sampling_rate: int = 0 | ||
pause_detected: bool = False | ||
stopped: bool = False | ||
conversation: list = [] | ||
``` | ||
|
||
The function concatenates new audio chunks to the existing stream and checks if the user has stopped speaking. If a pause is detected, it returns an update to stop recording. Otherwise, it returns `None` to indicate no changes. | ||
|
||
The implementation of the `determine_pause` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/eb027808c7bfe5179b46d9352e3fa1813a45f7c3/app.py#L98). | ||
|
||
## Generating the Response | ||
|
||
After processing the user's audio, we need to generate and stream the chatbot's response. Here's our `response` function: | ||
|
||
```python | ||
import io | ||
import tempfile | ||
from pydub import AudioSegment | ||
|
||
def response(state: AppState): | ||
if not state.pause_detected and not state.started_talking: | ||
return None, AppState() | ||
|
||
audio_buffer = io.BytesIO() | ||
|
||
segment = AudioSegment( | ||
state.stream.tobytes(), | ||
frame_rate=state.sampling_rate, | ||
sample_width=state.stream.dtype.itemsize, | ||
channels=(1 if len(state.stream.shape) == 1 else state.stream.shape[1]), | ||
) | ||
segment.export(audio_buffer, format="wav") | ||
|
||
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: | ||
f.write(audio_buffer.getvalue()) | ||
|
||
state.conversation.append({"role": "user", | ||
"content": {"path": f.name, | ||
"mime_type": "audio/wav"}}) | ||
|
||
output_buffer = b"" | ||
|
||
for mp3_bytes in speaking(audio_buffer.getvalue()): | ||
output_buffer += mp3_bytes | ||
yield mp3_bytes, state | ||
|
||
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f: | ||
f.write(output_buffer) | ||
|
||
state.conversation.append({"role": "assistant", | ||
"content": {"path": f.name, | ||
"mime_type": "audio/mp3"}}) | ||
yield None, AppState(conversation=state.conversation) | ||
``` | ||
|
||
This function: | ||
1. Converts the user's audio to a WAV file | ||
2. Adds the user's message to the conversation history | ||
3. Generates and streams the chatbot's response using the `speaking` function | ||
4. Saves the chatbot's response as an MP3 file | ||
5. Adds the chatbot's response to the conversation history | ||
|
||
Note: The implementation of the `speaking` function is specific to the omni-mini project and can be found [here](https://huggingface.co/spaces/gradio/omni-mini/blob/main/app.py#L116). | ||
|
||
## Building the Gradio App | ||
|
||
Now let's put it all together using Gradio's Blocks API: | ||
|
||
```python | ||
import gradio as gr | ||
|
||
def start_recording_user(state: AppState): | ||
if not state.stopped: | ||
return gr.Audio(recording=True) | ||
|
||
with gr.Blocks() as demo: | ||
with gr.Row(): | ||
with gr.Column(): | ||
input_audio = gr.Audio( | ||
label="Input Audio", sources="microphone", type="numpy" | ||
) | ||
with gr.Column(): | ||
chatbot = gr.Chatbot(label="Conversation", type="messages") | ||
output_audio = gr.Audio(label="Output Audio", streaming=True, autoplay=True) | ||
state = gr.State(value=AppState()) | ||
|
||
stream = input_audio.stream( | ||
process_audio, | ||
[input_audio, state], | ||
[input_audio, state], | ||
stream_every=0.5, | ||
time_limit=30, | ||
) | ||
respond = input_audio.stop_recording( | ||
response, | ||
[state], | ||
[output_audio, state] | ||
) | ||
respond.then(lambda s: s.conversation, [state], [chatbot]) | ||
|
||
restart = output_audio.stop( | ||
start_recording_user, | ||
[state], | ||
[input_audio] | ||
) | ||
cancel = gr.Button("Stop Conversation", variant="stop") | ||
cancel.click(lambda: (AppState(stopped=True), gr.Audio(recording=False)), None, | ||
[state, input_audio], cancels=[respond, restart]) | ||
|
||
if __name__ == "__main__": | ||
demo.launch() | ||
``` | ||
|
||
This setup creates a user interface with: | ||
- An input audio component for recording user messages | ||
- A chatbot component to display the conversation history | ||
- An output audio component for the chatbot's responses | ||
- A button to stop and reset the conversation | ||
|
||
The app streams user audio in 0.5-second chunks, processes it, generates responses, and updates the conversation history accordingly. | ||
|
||
## Conclusion | ||
|
||
This guide demonstrates how to build a conversational chatbot application using Gradio and the mini omni model. You can adapt this framework to create various audio-based chatbot demos. To see the full application in action, visit the Hugging Face Spaces demo: https://huggingface.co/spaces/gradio/omni-mini | ||
|
||
Feel free to experiment with different models, audio processing techniques, or user interface designs to create your own unique conversational AI experiences! |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters