Deploying Many Models Efficiently with Ray Serve

Short Summary:
This talk focuses on efficiently deploying many models using Ray Serve, a distributed serving system. Key points include model composition (combining multiple models within a single application for efficient resource sharing and independent scaling), multi-application (managing multiple independent applications on the same Ray cluster for flexible upgrades), and multiplexing (on-demand model loading to handle a large number of models with limited resources). Examples like computer vision applications, autonomous driving systems, and AnyScale endpoints are used to illustrate the benefits. The talk details the challenges of traditional monolith and microservice approaches and contrasts them with Ray Serve's solutions. Processes like resource allocation, independent scaling, and model loading are explained in detail. The implications include significant cost savings (as demonstrated by Samsara's 50% reduction in ML infra costs) and improved performance (as shown in benchmarking against SageMaker).
Detailed Summary:
The talk is divided into three main sections:
Section 1: Model Composition: This section introduces the challenges of serving many models, highlighting issues like efficient hardware usage, independent scaling, and operational overhead. The speaker contrasts the limitations of the monolithic approach (all models in one container) and the microservice approach (each model as a separate service). Ray Serve's model composition is presented as a solution that combines the advantages of both, allowing for independent scaling and upgrades while sharing resources efficiently. A computer vision example (image preprocessing, classification, detection, and business logic) is used to demonstrate how to define and link models within a single Python file for deployment. The speaker emphasizes the ability to use fractional resources and provides a resource allocation example for the computer vision application. The real-world example of Samsara switching to Ray Serve and saving 50% on ML infrastructure costs is highlighted.
Section 2: Multi-Application: This section builds upon model composition, addressing the need for independent upgrade cycles when multiple teams manage different models within a shared cluster. The speaker introduces Ray Serve's multi-application feature, enabling multiple applications (each with its own lifecycle) to coexist on the same cluster. This allows for independent upgrades and deployments without affecting other applications. The example of AnyScale endpoints adding new Llama models demonstrates the ease of adding, deleting, and updating applications. The speaker reiterates the benefits of efficient resource allocation and independent scaling while emphasizing the crucial advantage of independent upgrades.
Section 3: Multiplexing for Serving Many Models: This section tackles the challenge of serving a vast number of models, potentially exceeding the capacity of a single cluster. The speaker introduces Ray Serve's multiplexing API, which enables on-demand model loading. This avoids loading all models upfront, improving efficiency and reducing latency. The speaker contrasts this approach with a scenario where models are loaded repeatedly from S3, leading to high latency and resource waste. A benchmarking comparison with SageMaker multimodal endpoints shows Ray Serve's multiplexing achieving a 30% better throughput due to a higher model cache hit rate. The application of multiplexing to AnyScale endpoints, combined with LoRA techniques for efficient model sharing, is also discussed. Finally, a case study from Clary, a revenue operations platform, showcases how Ray Serve reduced training time by 80% and cold-start latency by more than 80%, highlighting the benefits of memory efficiency and auto-scaling.
The overall message is that Ray Serve provides a comprehensive and efficient solution for deploying and managing many models, addressing various challenges faced by traditional approaches and offering significant performance and cost improvements. The talk uses several real-world examples and benchmarking results to support its claims.