Defin-NG

Training as a Service

Delfin-NG

Project overview

Delfin-NG Solves the Resource GPU Crunch in Enterprise Environments

Delfin-NG is a distributed platform for the enterprise environment, which ties together scattered GPU resources from different Kubernetes clusters or standalone hosts and makes them available for GPU-based workloads.

Objectives

The key objectives of our Delfin-NG research is to:

Incentivizing GPU resource owners to temporarily make them available while still preserving control over them.
Leveraging under-utilized GPU resources in the enterprise across multiple geographical regions.
Enable resource utilization that is open, available, and auditable for transparency.
Develop a system, which is agnostic to GPU workload type and includes Machine Learning (ML), video processing, etc.
Adherence to regional data and locality restrictions as well as regulatory constraints.
Management of GPU resources with widely varying capabilities, such as architecture, number of cores, memory capacity, etc.
Consideration for the varying network latencies among GPU nodes.
The ability to trial various incentivization schemes.

Real world applications

Delfin-NG enables bundling islands of scattered GPU resources, making them available to the enterprise community. This provides a higher GPU utilization across the company, and thus, a better return on investment. This is especially useful for Large Language Models (LLMs) because it enables users to run large training jobs that could not otherwise be accommodated.