Skip to main content

Project overview

BLIS is a scalable and flexible service for standing up ML models for inferencing across heterogeneous types of hardware accelerators. BLIS has the following characteristics:

  • Designed to be an easy-to-use ML model inferencing service that can be easily deployed as an operator on a Kubernetes cluster.
  • Supports inferencing on CPUs and GPUs, and also on other hardware accelerators such as VPUs and FPGAs, which are less expensive, consume less power, and can have lower prediction latencies.
  • Provides efficient and intelligent hardware resource management that continually adjusts (auto-scales) based on the incoming load and other parameters.
  • Provides a QoS service model where the deployer of the service can specify multiple models which offer varying QoS (e.g. accuracy) at different latencies. The client for each prediction request can specify a desired QoS and latency, and BLIS will select the best model to process the request.

Project details


Objectives

We aim to deploy machine learning models at scale in production environments through development of a robust project focuses on developing a robust platform for an edge-cloud environment. The platform will does the following:

  • Offers “inferencing as a service”, which allows the lifecycle of the application code to be separate from the lifecycle of the ML Model(s).
  • Removes the need for ML-enabled applications to be concerned with scaling, resilience, and performance optimization of deployed ML model(s).
  • Provides an easy, declarative mechanism for deploying inferencing endpoints as well as web GUIs and programmatic API options.
  • Provides BLIS as both an embedded component for an application or as a stand-alone as-a-service offering with several web-based user interfaces for easy use and monitoring of the system.


Research breakthroughs

BLIS has enabled our research in several areas:

  • Use of heterogeneous accelerators: BLIS nodes may be equipped with many different types of accelerator hardware (CPUs, GPUs, FPGAs, ASIC devices, etc.). The accelerators have very different characteristics, and not every model is able to run on every type of accelerator. Our research looks at the optimization choices when there are many different models running on combinations of different accelerators (including accelerators which support sharing their resources among several different models). Traditional “Scale out” optimization misses the opportunity to rearrange which models are on which accelerators to optimize the inference serving throughput for the offered load.
  • Use of QoS in inferencing: There is on opportunity to trade off inferencing accuracy for lower latency when you have ML models of various sizes and architectures that perform the same function. BLIS supports specifying the minimum accuracy as well as a latency deadline for each inferencing request. Based on the application’s defined constraints, BLIS selects an accelerator with a corresponding model meeting or exceeding those constraints.
  • Enabling MLOps: BLIS supports “inference graphs” (IGs) which are directed graphs where BLIS enables “shadow”, “canary”, “split-model”, and “ensemble” deployments of models within an inferencing endpoint. This allows external MLOps systems to manage deployments of new model versions and construct services consisting of multiple models. BLIS also supports monitoring and logging of inference requests and responses so that MLOps systems can perform functions such as “drift detection” or detection of adversarial attacks.
  • Continuous model deployment optimization: BLIS supports a “plug-in” architecture so that it can be used as a platform for testing different optimization techniques for deciding on the best deployment of models to accelerators based on the recent offered load. This allows for trialing different optimizers, such as those based on heuristics, reinforcement learning, ILP solving, and other optimization techniques as part of our ML inferencing research.

 

Real world applications

Since BLIS is implemented as a Kubernetes operator, it can be embedded within a cluster to provide accelerators as a shared inferencing resource to distributed applications. BLIS is currently being deployed in several Nokia applications, such as factory automation for Industry 4.0 applications that depend on ML as part of their service. BLIS would also be very relevant and useful in other similar domains where separation of application and model lifecycles are important.

 

Future research

The future of ML inferencing is very promising, with an increasing amount of software using ML models in the services they offer as well as the increasing variety of accelerators that are becoming available. We also see MLOps playing an increasing role for models being monitored and continuously updated, and the inferencing systems playing a significant role in features used in model deployment and management.

There are several opportunities for research in BLIS:

  • Adding support for model parallel inferencing, as required by some Large Language Models, (BLIS can currently divide models horizontally, but not vertically).
  • Adding support for chaining models together, where the output of one LLM is used as a prompt generator or fact checker for input to a second LLM, all behind a single inferencing endpoint.
  • Applying a reinforcement learning-based optimizer for managing which models run on which accelerators.

Project members

APA style publications

Suhaib, Thomas Williams, Brian Friedman, "High-throughput inference-serving system with accuracy scaling, in proceedings of ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024.