Upgrade to Pro — share decks privately, control downloads, hide ads and more …

re:Invent 2023 CMP319 Deploy LLMs with AWS Inferentia & Ray to optimize performance and cost

re:Invent 2023 CMP319 Deploy LLMs with AWS Inferentia & Ray to optimize performance and cost

Generative AI and large language models (LLMs) inspired many organizations to reimagine the experiences they are building for their customers. As these sophisticated LLMs are integrated into more applications, developers are challenged with model serving for high-volume deployments while meeting performance targets. AWS Inferentia2 is a purpose-built accelerator optimized for performance and cost, while Ray Serve reduces serving latencies and is easy to use. In this code talk, learn how to deploy Llama 2 through Ray Serve on AWS Inferentia2 to achieve higher performance, low latency, and cost efficiency.

Keita Watanabe

February 20, 2024
Tweet

More Decks by Keita Watanabe

Other Decks in Technology

Transcript

  1. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deploy LLMs with AWS Inferentia & Ray to optimize performance and cost Keita Watanabe C M P 3 1 9 - R 1 Senior Solutions Architect AWS Scott Perry Senior Solutions Architect AWS
  2. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. What are we building today?
  3. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Purpose-built accelerators for generative AI AWS Inferentia Lowest cost per inference in the cloud for running deep learning (DL) models Up to 70% lower cost per inference than comparable Amazon EC2 instances AWS Inferentia2 High performance at the lowest cost per inference for LLMs and diffusion models Up to 40% better price performance than comparable Amazon EC2 instances AWS Trainium The most cost-efficient, high- performance training of LLMs and diffusion models Up to 50% savings on training costs over comparable Amazon EC2 instances
  4. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Llama 2 • High performance • Open source • Multiple sizes • Multiple variants Source: https://arxiv.org/pdf/2307.09288.pdf
  5. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Ray support for Trainium and Inferentia N A T I V E S U P P O R T F O R A W S E C 2 T R N 1 A N D I N F 2 I N S T A N C E S • Native support for Trainium and Inferentia available in Ray 2.7 release • Can define number of NeuronCores required in cluster, actors, and tasks • Support for Ray Serve and Ray Train N EW
  6. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture
  7. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Let’s code!
  8. © 2023, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app Thank you! © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the mobile app