Demo video
Click here for a demo of this AI interaction on my channel on YouTube
Introduction
The concept of AI interacting with humans through natural language is no longer confined to the realms of science fiction. Today, we find ourselves on the brink of an era where machines can not only understand and respond to our words but can even emulate human-like conversations to fulfill real-world tasks over the phone.
The Historical Perspective: Google’s Duplex
In 2018, Google introduced Duplex, an AI system that could make reservations at restaurants, book appointments, and more, all over the phone. It was one of the first prominent attempts to bridge the gap between machine and human interaction in real-life scenarios.
Duplex’s demonstrations were nothing short of awe-inspiring. The system could understand complex speech patterns, pauses, filler words, and could even mimic human-like intonations. The goal was to make the interaction indistinguishable from a human caller.
However, the initial launch of Duplex also raised several ethical and technological questions:
- Should the recipient know that they are speaking to a machine?
- Can this be misused for negative actions such as robocalling or scams?
- Can this be practically integrated into existing solutions?
Though groundbreaking, Duplex was seen as more of a specialized tool rather than a versatile platform.
The Current State: Advancements and Opportunities
Fast forward to the present, and the landscape of AI-driven communication has evolved dramatically. With advancements in natural language processing (NLP), deep learning, and cloud computing, creating a system similar to Duplex is no longer a herculean task. In fact, it can be quite easily built using AWS and any current generative AI model.
Technologies like Amazon Lex, ChatGPT, and Amazon Connect have made it possible to build intelligent chat bots capable of placing phone calls and interacting with real people. These tools offer a range of customization, scalability, and functionality that allows anyone to leverage the power of AI in their customer interactions.
In this comprehensive guide, we’ll explore how you can create an AI system that autonomously places phone calls to perform tasks, combining the cutting-edge capabilities of Amazon Lex, ChatGPT, and Amazon Connect. We’ll delve into the architecture, implementation, and considerations to transform the way you automate essential services.
By tracing the journey from Google’s Duplex to the current state-of-the-art technologies, we’ll see how far we’ve come and where we’re headed in the exciting field of AI-powered communication.
So lets get stuck in and build the solution!
Architecture Overview
- The user interacts with a chatbot via a web interface. In our example here we will use the built in chat interface in Amazon Connect, however any chat interface could be used and designed here.
- Interactions with the chatbot are handled using the ChatGPT interface we have previously built. This uses stored prompts the engage ChatGPT and to provide responses.
- When the chatbot has sufficient information about the task it will perform, a Task Management Lambda function stores the information in DynamoDB.
- DynamoDB streams trigger an Outbound Call Manager Lambda function that is able to place outbound calls in Amazon Connect.
- The outbound call is delivered to a real person, and all interaction is now handled by the AI.
- ChatGPT (as out generative AI of choice) is now engaged again, this time to interact with the real person. Amazon connect is using Polly to speak the responses to the receptionist.
- When the conversation is complete, the Task Management Lambda is again called to update the Task status and result in DynamoDB.
- A Lambda is triggered to update the result of the task back to the original chatbot.
- The user is informed of the result in their original chat.
We can now break this interaction into four logical areas.
1. Initial Interaction with User: Gathering Task Information and Storing Details
Interaction with Amazon Lex Bot and Integration with ChatGPT
The initial interaction with the user is the first step in this process. Amazon Lex is employed to execute the conversation, while ChatGPT has been added in to allow Lex to sound more natural and human-like. Here’s a more detailed breakdown:
Understanding User Intent without Overwhelming Them:
- The Lex bot using ChatGPT is designed to ask targeted questions to gather only the essential information needed, without overloading the user with queries.
-
Leveraging the conversational ability of ChatGPT, the bot can adapt to user responses, ensuring that the conversation flows naturally.
- Example Interaction:
- User: “I need to book a doctor’s appointment.”
- Bot: “Certainly! What date works for you?”
- User: “Next Monday.”
- Bot: “Morning or afternoon?”
- User: “Morning, please.”
- Bot: “Great! I will let you know when its done.”
Using ChatGPT via Lambda and Lex to gather initial information:
I have discussed using ChatGPT to gather the information from a user in previous blog posts. In this example we are just going to use Lex and allow it to pass all requests to ChatGPT. We prime ChatGPT with some minimal context about what services it knows how to call (essentially a phone book listing), and then everything from that point on is just handled by the AI.
The basic prompt we are using to gather all of the information about the task here is something like:
You are a helpful assistant bot. You want to gather information about the task that a user wants you to perform with the minimal amount of questions.
Phone numbers available:
Doctors: +64412345678
Pizza: +6448765432
Hair: +64467567876
French Restaurant: +64434523423
The user has just said: I want to book a doctors appointment.
The JSON object returned should be in this format without deviation:
{
"PhoneNumberToCall":"String",
"ServiceName":"String",
"AdditionalQuestion":"String",
"HasAdditionalQuestion":"Boolean",
"FinalMessage":String",
"TaskToPerform":"String"
}
Do not include any explanations, only provide a RFC8259 compliant JSON response.
Note that this is a fully populated prompt, with the information about services and what the user said already populated. The actual prompt we use in our solution is templated so we can insert the list of services and the user prompts that come from Lex dynamically. See my previous blog post for more details about how this works.
When we interact with this bot, we see behaviour such as this:
Storing Information in Amazon DynamoDB and Decoupling Next Stages
Once the necessary details are collected, they are stored in a DynamoDB table. Storing information in DynamoDB decouples this interaction stage from subsequent stages of the task. The initial bot’s sole responsibility is to understand the user’s request and gather relevant information, maintaining a separation of concerns. The bot will store the following information in DyanmoDB.
{
"id": "266e9514-3650-4812-8307-0e1541199c63",
"PhoneNumberToCall": "+64412345678",
"ServiceName": "Doctors",
"TaskResult": "",
"TaskResultAdditionalContext": "",
"TaskStatus": "SUBMITTED",
"TaskToPerform": "Book doctors appointment for Tuesday afternoon"
}
2. Task Execution: Triggering Lambda Function and Initiating a Phone Call with Context
Receiving Notification Event from DynamoDB
The next stage in our process is to execute the task using the information gathered in the initial interaction. This stage commences with a notification event from DynamoDB, indicating that a new task has been added. Here’s how it works:
- DynamoDB Streams can be configured to emit an event whenever a new item is added to the specified table. This feature allows real-time tracking of changes.
- AWS Lambda integrates seamlessly with DynamoDB Streams, allowing you to trigger a Lambda function in response to the addition of a new task.
- The triggered Lambda function receives the DynamoDB event, containing all the details of the new task, including phone number and the task to perform.
Using this the lambda function we now have sufficient information about the task, so now we need to generate some context for the prompts to guide our AI that will be interacting with the real human.
Crafting a Custom Prompt with Context for ChatGPT
The next step is to formulate a custom prompt that will guide ChatGPT in performing the specific task. This prompt incorporates all the context information retrieved from DynamoDB.
We have built patterns in previous blog posts to store parameterized prompts in DynamoDB. In our case our fully populated example prompt will look something like:
You are acting as the persona of Ian Christopher Ryan.
You need to complete the following task:
“Book doctors appointment for Tuesday afternoon.”
You have just contacted the doctors office and they have said:
“Wellington Doctors reception how can I help you?”
Response messages should be as if someone is actually naturally talking. Response messages should be conversational and very natural. Make sure the conversation does not sound robotic. Keep responses formal and to the point. Do not provide all information at the same time. Use a few filler words and pauses. Replace all commas and periods with: <break time="60ms"/>
Provide a JSON response only in the following format, do not deviate in any way, do not include any explanations or additional messages, only provide a RFC8259 compliant JSON response. The single response should be in this JSON format:
{
"ResponseMessage":"String",
"Action":"continue|hangup",
"SummaryOfConversation":"String",
"SummaryOfTask":"String",
}
This is all the information we will give the AI, and we can allow the AI to now call the phone number as defined in the Task. The AI will then interact with the human, acting as if it is Ian.
Placing an Outbound Call using Amazon Connect
Once the ChatGPT session is created with the custom context, the Lambda function proceeds to initiate a phone call using Amazon Connect.
- Using Amazon Connect SDK, the Lambda function can place an outbound call to the specified number.
- Within Amazon Connect, a custom contact flow can be designed to guide the interaction with the real receptionist, utilizing ChatGPT’s responses to communicate naturally.
- The combination of Amazon Connect and ChatGPT enables real-time conversation with the receptionist, allowing the AI to act on behalf of the real user.
Placing an outbound call in Amazon connect is so simple its almost trivial, but for context a Lambda function to make an outbound call would look like this:
const AWS = require('aws-sdk');
var connect = new AWS.Connect();
exports.handler = (event, context, callback) => {
var params = {
ContactFlowId: "a94afcbb-9b3b-44b8-bba4-0c314e3d615a",
DestinationPhoneNumber: "+64412345678",
InstanceId: "42635c9d-68b4-430d-bbb5-0ae23035bf6d",
QueueId: "3eef2a62-5e85-4d55-97a4-b348b9692b91",
SourcePhoneNumber: "+6448325717",
Attributes: {
"TaskToPerform": "Book doctors appointment for Tuesday afternoon"
}
};
connect.startOutboundVoiceContact(params, function(err, data) {
if (err) console.log(err, err.stack) ;
else console.log(data);
});
callback(null, event);
};
Where the ContactFlowId is the id of the flow that will be executed when the call is connected. Also note that the SourcePhoneNumber must be a claimed number in your Connect instance.
3. Real-Time Interaction with Receptionist: Using Amazon Connect Contact Flow and ChatGPT
The third section of our process dives into the heart of the interaction – the real-time conversation between our AI-driven system and a human receptionist. Here’s how it unfolds:
Contact Flow in Amazon Connect
When the outbound call is initiated by the Lambda function from the previous step, it triggers a specific contact flow defined within Amazon Connect. This contact flow is the blueprint that guides the interaction, and it performs several key functions:
- The contact flow is designed to listen and understand the input from the receptionist at the other end of the call.
- It integrates with ChatGPT, and uses the context of the task (e.g., booking a doctor’s appointment) to engage the receptionist in a manner that mirrors human conversation.
An example flow is shown here below, with the previously discussed pattern of using lex to gather the input and engaging the generative AI for the responses. In this case when the conversation is complete the result is updated in Dyanmo and the call is disconnected.
Intelligent Decision-Making with AI
One of the standout features of this system is its ability to make decisions autonomously. Using ChatGPT, the AI does not just follow a script but adapts its responses based on the interaction. This adaptability includes:
- Handling Variations from Original Task: If the requested time or day for the appointment isn’t available, the AI will book as close to the original request as possible, without needing to consult the user.
- Complex Decision-Making: The AI is designed to negotiate, make choices, and complete the task within the bounds of predefined criteria, much like a human assistant would.
Example Conversation with Real Receptionist
Here is a real conversation that showcases the dynamic nature of this interaction, this is the conversation that is in the demo video for this blog post:
- Receptionist: “Wellington Doctors reception how can I help?”
- AI: “Hello I’m looking to book an appointment with the doctor for Tuesday afternoon. Is there any availability then?”
- Receptionist: “Are you a current patient here?”
- AI: (Adapting to the situation) “Yes I am a patient there. My name is Ian Christopher Ryan. Could you please check the availability for Tuesday afternoon?”
- Receptionist: “Ok, one moment please”
- AI: “Of course. Take your time.”
- Receptionist: “Hello are you there?”
- AI: “Yes I’m still here”
- Receptionist: “We don’t have anything on Tuesday, but can I fit you in at 11 the next day?”
- AI: “Ah that’s unfortunate about Tuesday, but Wednesday at 11 should work. Could you please book that for me?”
- Receptionist: “Hmmm, hang on a sec, OK I’ve booked you in with Dr O’Hara at 11 on Tuesday”
- AI: “I appreciate that but I believe there might be a misunderstanding. You mentioned Tuesday was not available and offered Wednesday at 11. Could you please confirm if it’s for Wednesday with Dr. O’Hara?”
- Receptionist: “Oh wait umm yes I meant Wednesday at 11”
- AI: “Great. Thank you for confirming that. I’ll see Dr. O’Hara on Wednesday at 11 then. Have a good day.”
- Receptionist: “Bye”
Here, we can see the AI adapting to the minor variations from the original task (which specified “Tuesday afternoon”) and successfully completing the task by booking an appointment as close to the original request as possible. It was easily able to handle variations in the conversation, and completed its task.
4. Storing Results and Notifying the User
Upon the successful completion of the conversation and task execution, the next step is to store the results and notify the original user. This is accomplished through the integration of DynamoDB and Lambda functions.
Updating Results in DynamoDB
Once the appointment is booked (or any other task is completed), the outcome is stored back in DynamoDB. This information would include all the final details. Our original task entry is now updated and looks like:
{
"id": "266e9514-3650-4812-8307-0e1541199c63",
"PhoneNumberToCall": "+64412345678",
"ServiceName": "Doctors",
"TaskResult": "Successfully booked an appointment with Dr. O'Hara for Wednesday at 11.",
"TaskResultAdditionalContext": "Initiated a conversation with Wellington Doctors reception to book an appointment for Tuesday afternoon. The receptionist found no availability on Tuesday and offered Wednesday at 11 instead. This was accepted, and after a brief misunderstanding, the receptionist confirmed the appointment for Wednesday.",
"TaskStatus": "COMPLETE",
"TaskToPerform": "Book doctors appointment for Tuesday afternoon"
}
Notifying the Original User
The real beauty of this system is in how it hides the complexity from the user. Regardless of how long the conversation with the receptionist takes or how many retries are required, the original user simply receives the final result. This process includes:
- A Lambda function is triggered when the status in DynamoDB is updated to ‘completed’. This function is responsible for linking back to the original chatbot conversation.
- The original conversation with the chat bot is then updated with the details of the result.
- All delays, retries, hold times, and conversation complexities are handled behind the scenes. The user interacts only with the beginning and end of the process, enjoying a streamlined experience.
Conclusion
Google Duplex will always be remembered as a pioneering effort that pushed the boundaries of what was possible with AI and human interactions in this way. However, technology evolves, and what was once an incredibly complex effort is now within the reach of many organizations and developers.
As demonstrated in this blog, today’s AI and cloud technologies have made it possible to build intelligent, adaptive, and human-like conversational agents that can perform real-world tasks over the phone. It’s a testament to the rapid advancement in the field and the democratization of AI, enabling more businesses and individuals to harness the power of AI to create value and enhance user experiences.
I was almost hesitant to write this blog post though because een though you can easily build a solution like this, it does raise the question should we build these solutions?
Potential, Ethics, and the Future of AI Voice Interactions
The AI-driven solutions, as detailed in this blog, illustrates not just technological advancement but also raises vital questions about ethics, regulation, and potential misuse.
The Potential for Misuse
As AI technology evolves rapidly, regulation often struggles to keep pace. This discrepancy opens the door for potential misuse, such as:
- The ease of implementing intelligent voice systems could be exploited for robocalling or other malicious activities.
- The advancements in AI text-to-speech make it increasingly difficult for a person to identify whether they are speaking to an AI or a human. This could be misused to deceive or manipulate individuals.
- It creates opportunities for deception and fraud, underscoring the need for regulation, transparency, and ethical guidelines.
Looking Ahead: A Future Blog Post
With the constant evolution of AI services, especially in platforms like AWS, we can anticipate even more advanced solutions in the near future. One of my future blog posts will shortly explore:
- New AWS AI Services: We’ll delve into upcoming AWS native AI services that could further streamline and enhance AI-driven voice interactions. Honestly I’m just waiting for access to these so I can build some awesome things.
- Advanced AI voices: Although the Polly voice used in Amazon Connect is quite good, there are now much more advanced AI based natural voices that have become indistinguishable from human voices. Adding a better more natural voice to this solution would improve it vastly.
Final Thoughts
This is all very exciting from a technical lens. Being able to so easily build solutions like this is a testament to how far technology has come. However I am also very mindful that out AI future may not be all bright. Like all technology advances they have the potential to be misused, and as we all soberingly know, they will be. However its not all bleak, like all new technologies there will be positives as well as the negatives, we just have to hope and believe that the positives will outweigh the negatives.