Unlock the full potential of large language models (LLMs) on Amazon EKS by optimizing inference performance and cost efficiency. This chalk talk provides practical guidance on deploying and scaling PyTorch-based LLMs on EKS using AWS Inferentia, Karpenter, and KServe. Learn how to leverage the specialized hardware of Inferentia to accelerate inference, reduce latency, and lower costs. Discover how Karpenter’s advanced auto scaling capabilities optimize resource utilization and handle fluctuating workloads. Master the art of efficient model deployment and management with KServe. Through real-world examples and best practices, gain the expertise to build high-performance, cost-effective LLM inference pipelines.