Choose an open model serving option

Vertex AI offers multiple ways to serve open large language models, including Llama, DeepSeek, Mistral, and Qwen, in Google Cloud. This document provides an overview of Vertex AI offerings for open model serving and helps you choose the right option for your use case.

Serving options

Vertex AI offers the following options for serving open models. Each of these options provides high availability and includes Google Cloud security best practices by default:

When to use MaaS

Consider using MaaS in these scenarios:

  • Rapid development and prototyping: MaaS helps you quickly integrate LLM capabilities into applications. This is especially useful for initial exploration, rapid prototyping, and when fast time-to-market is a key goal.
  • Minimizing operational overhead: Choose MaaS when your team wants to focus on application logic instead of infrastructure management. Google handles all GPU/TPU provisioning, scaling, and maintenance, which benefits teams focused on application development rather than MLOps or DevOps.
  • Variable traffic: The pay-as-you-go model supports experimental workloads or applications with unpredictable, bursty traffic patterns.
  • Out-of-the-box usage: Use a managed API for applications that need consistent performance but don't require deep customization of the underlying model or serving stack.
  • Security and compliance: MaaS lets enterprises use Google Cloud's built-in, enterprise-grade security and compliance features.
  • Standard model usage: Use MaaS when a standard, non-customized foundation model meets your needs.

When to use self-deployed models in Model Garden

Self-deployment options include deploying from Model Garden by using prebuilt or custom containers. Consider self-deployment in these key scenarios:

  • Custom weights and fine-tuned models: Self-deployment is the best choice when your application requires using custom weights or a fine-tuned version of a model, offering greater flexibility to deploy models tailored to your specific needs. You can also build and deploy your own custom serving containers. For example, use this option when a model requires unique pre-processing or post-processing logic.
  • Predictable, high-volume workloads: Self-deployment is a strategic and cost-effective option for production applications with predictable and high-volume traffic. While it requires a greater upfront engineering investment, it can lead to a lower total cost of ownership (TCO) over the application's lifetime due to optimized per-token costs at scale.
  • Granular control over infrastructure: Use self-deployment when you need to fine-tune performance and budget by choosing specific hardware configurations. This includes selecting exact machine types, GPUs (for example, NVIDIA L4 or H100) or TPUs, and optimized serving frameworks.
  • Stringent security and compliance: This approach supports applications that must adhere to specific data residency policies or stringent regulations that prohibit using a multi-tenant managed service. It lets you deploy models securely within your own Google Cloud project and Virtual Private Cloud network, providing complete control over the data path.
  • Granular control over location: Dedicated endpoints let you deploy on any Compute Engine accelerator in Google Cloud across all regions.

When to use prebuilt containers

Consider using Vertex AI prebuilt containers in these scenarios:

  • Optimized performance: Vertex AI optimizes and customizes prebuilt containers for frameworks like vLLM for enhanced performance, reliability, and seamless integration within Google Cloud.
  • Ease of use: Serve models by using popular serving frameworks such as vLLM, Hex-LLM, SGLang, TGI, or TensorRT-LLM without building and maintaining your own container images.

When to use custom vLLM containers

Consider building and using your own custom container in these scenarios:

  • Maximum flexibility: When existing serving options and prebuilt containers are insufficient for your needs, and you require full control over the container image, including dependencies and configurations.
  • Custom serving logic: When your model requires unique pre-processing or post-processing steps that are not supported by prebuilt containers.

What's next