Skip to content

How do I troubleshoot latency and optimize Amazon Bedrock Agents performance?

3 minute read
0

I want to troubleshoot the high Amazon Bedrock Agents latency and optimize its performance.

Short description

Amazon Bedrock Agents use reasoning capabilities to retrieve information from knowledge bases that might result in high latency. The following factors affect Amazon Bedrock Agents response times:

  • Model size
  • Prompt structure and complexity
  • Number of input and output tokens
  • Network connectivity and AWS Regional infrastructure

Resolution

Troubleshoot the agent latency

Update your model size

If you use large foundation models and experience higher latency, then use lighter models for latency-sensitive use cases. Invocation latency scales with the combined count of input and output tokens. To reduce the size of the output tokens, you can provide instructions to your agent to provide clear and short responses to user queries.

Note: Amazon Bedrock automatically publishes the Invocation Latency metric under Bedrock in Amazon CloudWatch. For more information, see Monitoring the performance of Amazon Bedrock.

Review your orchestration strategies

As the number of model invocations increases, the latency also increases. Make sure that you efficiently define your agent's instructions and your functions and parameters. If you receive slow responses and your use cases don't require orchestration features, then turn off orchestration to reduce latency. Features such as pre-processing, orchestration, and post-processing might introduce additional processing time.

To turn off the orchestration templates, complete the following steps:

  1. Open the Amazon Bedrock console in the AWS Region where your agent is deployed.
  2. In the navigation pane, expand Builder tools and choose Agents.
  3. Select the agent, and then choose Edit in Agent Builder.
  4. In the Orchestration strategy details section, choose Edit.
  5. Choose the Pre-processing tab and turn off Activate pre-processing template.
  6. Choose the Post-processing tab and turn off Activate post-processing template.
  7. Choose Save and exit.

If you use custom orchestration, then configure the orchestration to optimize performance. Provide clear instructions. If you include redundant or ambiguous instructions, then you might increase the model's cognitive load and affect response time.

Check Amazon Bedrock network configuration

If you use AWS Lambda with a virtual private cloud (VPC) and experience slow network interactions with Amazon Bedrock, then traffic might route through the public internet. To resolve this issue, use AWS PrivateLink to set up private access to Amazon Bedrock.

Activate CRIS

If you experience latency during periods of high Regional demand, then you might hit a Regional bottleneck. To resolve this issue, increase throughput with cross-Region inference (CRIS) to distribute inference workloads across multiple Regions.

Optimize the agent performance

Activate the streaming responses

If your outputs have high token counts, then users must wait for the full response. To check if you have high token counts, use the OutputTokenCount metric in CloudWatch. To activate streaming responses, use the InvokeModelWithResponseStream API so content arrives as it generates.

Configure your knowledge base

If you don't limit the number of document chunks returned, then the response generation time increases. To resolve this issue, use the numberOfResults parameter to limit the number of document chunks retrieved from your knowledge base.

Related information

Monitor model invocation using CloudWatch Logs and Amazon S3