Hear Me Out: The Potential of Low-Latency Voice AI

Picture this: two users, same exact need – to get advice on a health issue.

User 1

User 1 opens up a text interface. Types in their symptoms, medical history, the works. Maybe they're a little embarrassed, but hey, no one's watching. They take their time, make sure they don't leave anything out.

The AI comes back with a detailed response. User 1 reads it once, twice, a few times. Lets it sink in. They highlight the key points, the action items. They feel informed, empowered. They've got a plan.

User 2

Now User 2, they go for voice. They start explaining their symptoms, and the AI jumps in with clarifying questions. It's a back-and-forth, a real conversation. User 2 feels heard, understood.

The AI shares its advice. User 2 listens intently. It's like the AI is right there in the room with them, guiding them. The inflection, the pauses, it all lands differently. User 2 feels cared for, supported.

Same need, two very different experiences. All because of the interface.

So what does this mean for those of us building AI tools and experiences? It means we have to think deeply about the core characteristics of these modalities and how they shape the user experience.

Here are some differences between voice and text interfaces to consider:

Embodiment

Voice interfaces provide a stronger sense of the AI's presence and personality, making the interaction feel more personal. In contrast, text interfaces may feel more detached or impersonal, as the AI's presence is less tangible.

Emotional Weight

Voice has the power to evoke strong emotions. The right tone, the right inflection, can make all the difference in how a message lands. This emotional weight can be particularly impactful in applications where building trust, rapport, or emotional connection is important, such as mental health support or personal development. Text interfaces, while still capable of conveying emotion through language, may have a less immediate and powerful emotional impact.

Privacy

Voice can be overheard by others in the vicinity. This lack of privacy may influence the types of topics or information users feel comfortable discussing with AI through voice. Text interfaces, on the other hand, might provide a greater sense of privacy and anonymity, allowing users to engage with sensitive or personal topics more freely. The social context in which the interface will be used should be taken into account.

Information Presentation

Text lets you go deep. Users can go at their own pace, revisit information as needed. It's ideal for complex topics that require reflection and analysis. Voice, on the other hand, is linear. It's better for simpler exchanges. The modality should be matched to the depth of information being presented.

Retention and Reference

Voice interfaces feel ephemeral. Consequently, they may be better suited for in-the-moment interactions, where the focus is on the immediate exchange of ideas or experiences. Text interfaces, with their persistent nature, allow for easier retention and review of information.

The choice between voice and text interfaces in AI interactions should be an informed one. The modality becomes the message, shaping how users perceive, process, and internalize the information provided by the AI. The choice is not neutral. It carries implicit assumptions about the user's needs, capabilities, and context. 

With OpenAI's recent announcement of GPT-4o, voice interfaces are now top of mind. If you haven't seen the demos, check them out - they show AI understanding sarcasm, modulating its voice (speaking fast, singing/whispering), and more. GPT-4o's multimodal architecture allows for a deeper understanding of input data than other models achieve. By ingesting voice data directly, GPT-4o can pick up on nuance, emotion, and subtleties that are lost in speech-to-text translation. This input is then used to inform the AI's response, which can be synthesized with the desired speech qualities.

(We note here that this is all still preliminary. OpenAI's demos were impressive, but the voice interface has not officially been released to customers.)

As AI continues to evolve we'll likely see new modalities and experiences open up. There are no established best practices for how to build these experiences – it's uncharted territory, and that's exciting. In the end, it's not about the technology. It's about the people using it.

— Vishnu Bashyam, ML Researcher @ Hop