Cumulus Labs builds a production inference platform for AI teams that need routing, caching, observability, evaluation, fine-tuning, and hosting in one API. Its Ion inference engine targets NVIDIA Grace and Blackwell hardware, and the company claims 30 to 50 percent higher throughput than vLLM and SGLang. The company is a Y Combinator Winter 2026 startup founded in 2025 by Suryaa Rajinikanth and Veer Shah.
Submit workloads with budget constraints; Profile resource requirements for optimal allocation; Allocate jobs across shared GPUs; Pay only for actual compute usage; Scale GPU resources instantly based on demand
Founder
Cumulus Labs offers two main products:
Cumulus Cloud: This service provides serverless GPU inference with quick cold starts of 12.5 seconds. It allows users to deploy any model, scale resources to zero when not in use, and pay only for the compute resources they actually utilize. This flexibility is particularly beneficial for data scientists and developers working on AI and machine learning projects, as it optimizes costs and resource allocation.
Cumulus OS: This is an on-premises GPU cluster operating system that includes features such as fleet management, intelligent bin-packing, and Kubernetes-native orchestration. It is designed to help organizations manage their GPU resources efficiently, ensuring that workloads are distributed optimally across available hardware.
Both offerings are tailored to meet the needs of companies requiring scalable GPU resources without the burden of overprovisioning or idle costs.