Start a conversational AI agent
Use this endpoint to create and start a Conversational AI agent instance.
API endpoint
- Method:
POST
- Endpoint:
https://api.agora.io/api/conversational-ai-agent/v2/projects/{appid}/join
- Authorization: Basic Auth
Path parameters
Parameter | Type | Required | Description |
---|---|---|---|
appid | string | Yes | The App ID of the project |
Request body parameters
The request body must be a JSON object containing the following parameters:
Parameter | Type | Required | Description |
---|---|---|---|
name | string | Yes | The unique identifier of the agent. The same identifier cannot be used repeatedly. |
properties | object | Yes | Configuration details of the agent |
⇒channel | string | Yes | The name of the channel to join. |
⇒token | string | Yes | The authentication token used by the agent to join the channel. |
⇒agent_rtc_uid | string | Yes | The user ID of the agent in the channel. A value of 0 means that a random uid is generated and assigned. Set the token accordingly. |
⇒remote_rtc_uids | array[string] | Yes | The list of user IDs that the agent subscribes to in the channel. Only subscribed users can interact with the agent. "*" means that the agent subscribes to all users in the channel. |
⇒enable_string_uid | boolean | Whether to enable String uid:
| |
⇒idle_timeout | integer | Sets the timeout after all the users specified in remote_rtc_uids are detected to have left the channel. When the timeout value is exceeded, the agent automatically stops and exits the channel. A value of 0 means that the agent does not not exit until it is stopped manually. Default value 30 | |
⇒advanced_features | object | Advanced features configuration. | |
⇒⇒enable_aivad | boolean | Whether to enable the intelligent interruption handling function (AIVAD). Default value false . info This feature is currently available only for English. | |
⇒asr | object | Automatic Speech Recognition (ASR) configuration | |
⇒⇒language | string | The language used by users to interact with the agent. The following languages are supported:
| |
⇒tts | object | Yes | Text-to-speech (TTS) module configuration |
⇒⇒vendor | string | Yes | TTS provider. Supports the following values:
|
⇒⇒params | object | Yes | The configuration parameters for the TTS vendor. See TTS vendor configuration for details. |
⇒llm | object | Yes | Large language model (LLM) configuration. |
⇒⇒url | string | Yes | The LLM callback address. |
⇒⇒api_key | string | The LLM verification API key. The default value is an empty string. Ensure that you enable the API key in a production environment. | |
⇒⇒system_messages | array[object] | A set of predefined information used as input to the LLM, including prompt words and examples. | |
⇒⇒params | object | Additional LLM information transmitted in the message body, such as the model used, and the maximum token limit. | |
⇒⇒max_history | integer | The number of short-term memory entries cached in the custom LLM. 0 means no short-term memory is cached. Users and agents log entries separately. Default value 10 | |
⇒⇒input_modalities | array[string] | LLM input modalities. Supports ["text"], ["text", "image"]. Default ["text"] | |
⇒⇒output_modalities | array[string] | LLM output modalities. Support ["audio"],["text"], ["text", "audio"]. Default ["text"] | |
⇒⇒greeting_message | string | Agent greeting. If provided, the first user in the channel is automatically greeted with the message upon joining. | |
⇒⇒failure_message | string | Prompt for agent activation failure. If provided, it is returned through TTS when the custom LLM call fails | |
⇒⇒style | string | The request style for chat completion. Supports:
| |
⇒vad | object | Voice Activity Detection (VAD) configuration | |
⇒⇒interrupt_duration_ms | number | The amount of time in milliseconds that the user’s voice must exceed the VAD threshold before an interruption is triggered. Default value 160 | |
⇒⇒prefix_padding_ms | integer | The extra forward padding time in milliseconds before the processing system starts to process the speech input. This padding helps capture the beginning of the speech. Default value 300 | |
⇒⇒silence_duration_ms | integer | The duration of audio silence in milliseconds. If no voice activity is detected during this period, the agent assumes that the user has stopped speaking. Default value 640 | |
⇒⇒threshold | number | Identification sensitivity determines the level of sound in the audio signal that is considered voice activity. The value range is (0.0, 1.0). Lower values make it easier for the agent to detect speech, and higher values ignore weak sounds. Default value 0.5 |
Request examples
Use one of the following request examples as a starting point:
Sample request:
curl --request post \
--url https://api.agora.io/api/conversational-ai-agent/v2/projects/:appid/join \
--header 'Authorization: Basic <your_base64_encoded_credentials>' \
--data '
{
"name": "unique_name",
"properties": {
"channel": "channel_name",
"token": "token",
"agent_rtc_uid": "friday",
"remote_rtc_uids": [
"*"
],
"enable_string_uid": true,
"idle_timeout": 120,
"advanced_features": {
"enable_aivad": true
},
"llm": {
"url": "https://api.openai.com/v1/chat/completions",
"api_key": "sk-xxx",
"system_messages": [
{
"role": "system",
"content": "You are a helpful chatbot."
}
],
"max_history": 10,
"greeting_message": "Hello, how can I assist you today?",
"failure_message": "Please hold on a second.",
"params": {
"model": "gpt-4o-mini"
}
},
"tts": {
"vendor": "microsoft",
"params": {
"key": "xxxx",
"region": "eastus",
"voice_name": "en-US-AndrewMultilingualNeural"
}
},
"asr": {
"language": "en-US"
},
"vad": {
"silence_duration_ms": 480
}
}
}'
Response
-
If the returned status code is
200
, the request was successful. The response body contains the result of the request. -
If the returned status code is not
200
, the request failed. The response body includes thedetail
andreason
for failure. Refer to status codes to understand the possible reasons for failure.
Response body
Parameter | Type | Description |
---|---|---|
agent_id | string | Unique id of the agent instance |
create_ts | integer | Timestamp of when the agent was created |
status | string | Current status.
|
Sample response
TTS vendor configuration
Conversational AI Engine supports the following TTS vendors:
-
Microsoft
-
Supported parameters
key
region
voice_name
rate
: Indicates the speaking rate of the text. Speaking rate can be applied at the word or sentence level. The rate changes should be within 0.5 to 2 times the original audio.volume
: Expressed as a number in the range of 0.0 to 100.0, from quietest to loudest, such as 75. The default value is 100.0sample_rate
: integer. Default value24000
.
For further details, see Microsoft TTS.
-
Sample configuration
-
-
Elevenlabs
-
Supported parameters
key
model_id
voice_id
sample_rate
: integer. Default value24000
.stability
similarity_boost
style
user_speaker_boost
For further details, see Elevenlabs TTS.
-
Sample configuration
-