Introducing Seldon Core 2.9 and MLServer 1.7: Enhanced Autoscaling, Streaming, and Usability

Table of Contents

Stay Ahead With the
MLOps Monthly Newsletter

Email Signup Form

Join over 25,000 MLOps professionals with Seldon’s MLOps Monthly Newsletter—your source for industry insights, practical tips, and cutting-edge innovations to keep you informed and inspired.
You can opt out anytime with just one click.

May we introduce you to our newest update, Seldon Core 2.9! This release is packed with updates that make deploying and scaling ML easier and more flexible than ever, including improvements to autoscaling, support for streamed results for GenAI models, and various other usability improvements described in more detail below.

Also tested alongside a new upgraded release of MLServer (1.7), this release includes support for Python 3.11 and 3.12, support for new data-types, as well as several new usability improvements and fixes that include contributions from the open source community. Whether you’re running large-scale inference services or exploring LLM use cases, these updates to our products have something for you!

Smarter Autoscaling 

Core 2.9 provides a new, simpler implementation of autoscaling that unlocks enhanced configurability and cost-savings. Users can now set up autoscaling of models using HPA, and Seldon will autoscale your server replicas automatically. This implementation allows for more configurability by expanding the metrics you can use to define scaling logic as you can now use Kubernetes-native or custom metrics to scale your workloads. It also supports autoscaling for Multi-Model Serving setups where you can consolidate multiple models on shared servers to save cost. 

Learn more about the new autoscaling workflow in the docs → 

Response Streaming

You can now stream inference responses for GenAI models to return results token-by-token, both via REST and gRPC. This is perfect for any real-time, low-latency GenAI applications whether they be chatbots or coding copilots.

Revamped Documentation

The docs have been migrated to a new platform, and overhauled to improve discoverability. We’ve also added new content to provide enhanced support on common workflows such as the installation process →

Explore the docs → 

Usability and Configurability Improvements 

Partial Scheduling

Core 2.9 introduces improved handling of partially available models. Instead of waiting for all replicas of a model to be ready, Seldon can now schedule as many as possible based on resource availability. That means Models can still serve traffic even if only some of the model replicas are deployed, resulting in smoother autoscaling behavior, especially in resource-constrained environments. As part of this work, we’ve also added new model status fields that show how many replicas are currently available, in order to clarify the state of models as they change with elastic scaling.

Configuration via Helm

You can now configure a wide range of settings using Helm, including:

  • Min/max replicas for MLServer and Triton
  • Enabling/disabling autoscaling features
  • PersistentVolumeClaim (PVC) retention policies
  • Log levels for all components, including Envoy

CRD Updates

Seldon’s Model Custom Resource Definition (CRD) now supports:

  • A new spec.llm field to better support prompt-based LLM deployments
  • A status field for availableReplicas to power the new partial scheduling logic (as described above)
  • Full backward compatibility, so you can upgrade with confidence

Bug Fixes

The following bugs identified by our users and customers have been fixed as part of the 2.9 release:

  • 503 errors during model rollouts caused by out-of-order routing updates
  • Prometheus metrics now correctly use model names, not experiment names
  • Kafka-based error propagation for issues in model gateway
    Pod spec overrides for seldon-scheduler now apply as expected

MLServer 1.7

MLServer 1.7 brings a range of usability enhancements and bug fixes contributed by both Seldon and our open-source community. Notably, it now includes support for Python 3.11 and 3.12, an improvement led by a community contribution. We’ve also introduced greater configurability for the inference pool used by models, helping you maintain performance standards when deploying models with varying resource requirements on shared infrastructure.

The release also expands our MLflow integration by supporting new data types provided by the MLflow runtime (Array, Map, Object, and Any) further broadening its flexibility, along with several bug fixes and dependency upgrades bundled in this version. 

For a full list of changes, explore our release notes → 

See Seldon Core in Action

Book a quick 30-minute call to cover:

  • Your key goals and objectives
  • Current and potential use cases
  • Current challenges 

The aim is to gather enough insights to craft clear, actionable next steps towards your success without taking up too much of your time.

Talk With Us