Introduction
Recently many people have asked the question, “If I strap GPT-4 onto a phone call, will it work?” Given how advanced LLMs have become, you might think the answer is a resounding yes. Spoiler alert: It’s not. Luckily, the recent developments in transcription, text-to-speech, and large language models, have enabled - for the first time - for anyone to create a rudimentary AI calling system.
Today, we will define what an AI phone call is, explore what tools and technologies are available to build one, and then discuss the practical steps required to put one into production. By the end of this guide, you’ll understand how your startup, enterprise, or small business can start integrating AI phone calls right now. And for our more technical friends, stick around to see some working code examples that you can build a custom system off of!
AI phone calls & robocalls
Robocalls are the absolute plague primarily because they sound fake and are fully scripted. When you receive or make a phone call, and a tinny-sounding voice answers, you know you’re about to waste the next 10 minutes of your life dealing with an incompetent phone system - when all you want is to speak with a helpful human.
AI agents are the antidote. Because language models are trained on a wide range of subjects and can be fine-tuned on human conversations, they’re exceptionally good at holding engaging discussions, following instructions, and offering help. When super-powered with additional context - like your purchase history and size preferences - LLMs can match the best customer support agent, salesperson, and even therapist.
How to create an AI phone call agent: your own conversational AI
Let’s break conversations down to a granular level and examine how they work. In this example, let’s say Paul is talking to Sam.
Paul: says some stuff about startups
Sam: listens to Paul’s advice
Sam: interprets Paul’s statement and generates a response (likely as a spoken tweet)
Sam: vocalizes his response
The steps to building an AI phone calling system are similar:
- Shove the speaker’s audio into a speech recognition model - like Open AI’s Whisper - to convert from audio to a text transcript.
- Send the text transcript to a large language model (like GPT-3.5 Turbo) to interpret the speaker’s statement and generate a response.
- Send the response to a text-to-speech model (ElevenLabs is the state of the art) and convert the written response to audio, and finish by sending the audio back to the caller (likely via Twilio).
Unfortunately, when you chain those three components - the speech recognition, language, and text-to-speech models - the output is useless! Foundational models are ineffective until you provide them with heuristics for understanding human conversation. The next step is to add logic to detect when the counterparty finishes speaking and differentiate interruptions from affirmations (ex. “hold on!” and “hmm”). This logic sounds simple but gets complicated - especially when you throw in multiple speakers and background noise.
Code Example: combining transcription, chat, and text-to-speech models to create conversational AI
Yacine I. - AKA Kache on twitter - recently created Talk, an open source repository for AI conversations
Having reviewed the processes that underly AI phone calls we can now examine a practical application. Here is an example that takes a speaker's input, processes it using an LLM, generates a response, and then converts to speech
Code explanation
Let’s explain that *large* block of code in simple English:
- Integration with LLM: The function starts by invoking the LLM (in this case, Llama by Meta) with the provided prompt and input. Llama generates a stream of tokens (words) to send back as a response.
- Handling Interruptions and Affirmations: An optional interruptCallback allows the function to decide whether to stop the LLM inference (response generation) based on external logic, like a caller saying, "Hold on!"
- Audio Playback and Callbacks: Once the audio gets generated, it’s played back to the user, simulating the AI agent "speaking" its response.
Putting an AI phone agent into production: practical considerations
So, fantastic, let’s assume that we’ve now built an AI phone calling agent - an AI personal assistant - that can make phone calls to anyone for any task. Can my business just start using it?
Well, there are a couple of other problems you need to consider:
- Out of the box, latency will be too high. You’ll need to cut down your AI’s response time from 4 seconds to under 1 second if you want to provide a good customer experience.
- You must build enterprise-grade observability and testing tools to define and catch your agents when they inevitably do the wrong thing and quickly ship out improvements to them.
The combination of these two problems has prevented many of the "voicebot" tools we’ve seen to date from progressing from hackathon projects to production-ready applications deployed in the real world. If your agent is too slow or doesn’t have guard rails, customers will have naturally bad experiences. And if you don’t have a system for defining and tracking your agent’s behavior, you’ll never be able to spot and fix issues when they occur.
How you can integrate AI calls right now with Bland, the API for AI Calls
At Bland, we’ve built the API for AI phone calling. We use optimized foundation models, custom infrastructure, and in-house observability tools, enabling developers to easily add low-latency AI calls to both existing and new applications.
Whether you want to build an AI personal assistant, AI customer support representative, or even an AI secretary to schedule meetings, you can build it all with Bland. Click here to get started!