Same thing. Everyone is doing it. They give some info about what they are doing here They have a turn taking model.
This it the best one so far. Very very fast. But interruptions are not great. It speaks over me a lot.
Similar to Retell AI, they also have an endpointing model.
Allows voice cloning.
A project by LAION AI. The code base is not very clean. Seems like a standalone project. Very talented people working. Good discussion and research on their discord. Interruptions aren't really robust to umms and ahs. Signal strength based VAD. They calc attention kvs while stt is running which is good. They say the biggest bottleneck is waiting for a sentence to TTS. Their roadmap is very cool. They aim for a speech to speech model for chat.
Also have a hosted version company. Codebase is bad again. Not clear if they support interruptions.
Very very similar to ovc. Codebase a little complicated. Business model is good. Good features, has twillo support, uses tts, stt, llm services. Also supports local models. Also very fast.
Making Speech language models. Very impressive. We should also train adapters for stt -> llm and llm -> tts models.