AI gateways, Kubernetes multi-tenancy loop in LLMOps (original) (raw)

There are many areas of IT infrastructure affected by the emergence of LLMOps; for each, there were products and projects rolled out this week at KubeCon.

SALT LAKE CITY -- As LLMOps dominated discussions at KubeCon, new features and projects to ease multi-cluster and multi-cloud management proliferated.

At the highest layer of the Open Systems Interconnection network model, where applications such as agentic AI microservices connect, several AI gateways have emerged in the last six months to add enterprise governance to large language model operations. More alternatives were added to that market by cloud-native vendors this week. At deeper layers of infrastructure, Kubernetes-related Cloud Native Computing Foundation (CNCF) projects and platform vendors such as Red Hat began to stitch together the building blocks of secure multi-tenancy for highly distributed infrastructure.

Many mainstream platform engineers are just getting started incorporating LLMOps. One of these new adopters said he planned to focus his first efforts on offering developers access to generative AI services through a homegrown AI gateway using Linkerd's version of the Kubernetes Gateway API.

"By building our AI architecture around the LLM gateway, we can create an evolutionary system that's capable of adapting to changes in the industry," said Kasper Borg Nissen, a staff platform engineer at Lunar, a digital bank in Denmark, during a keynote presentation here this week. "The LLM gateway allows developers to easily access language models with built-in capabilities such as enforcing company policies, providing access control, rate limiting, observability, cost management and more."

KubeCon 2024 keynote speaker Kasper Borg Nissen on stage.

Kasper Borg Nissen, a staff platform engineer at Lunar, presents during a keynote session at KubeCon 2024.

AI gateways put a new face on API management

The AI gateway market is crowded and fragmented so far, heightened by KubeCon + CloudNativeCon updates. Linkerd cloud-native networking competitor Solo.io donated its Gloo Gateway to CNCF this week, while Tetrate and Bloomberg demonstrated their own AI gateway project on the keynote stage.

By building our AI architecture around the LLM gateway, we can create an evolutionary system that's capable of adapting to changes in the industry.

Kasper Borg NissenStaff platform engineer, Lunar

Solo also made feature updates to its commercial AI gateway to bolster security with built-in semantic analysis on LLM prompts. The vendor also added the ability to prioritize regions and availability zones among public clouds and self-hosted infrastructures when load balancing workloads between LLMs.

Solo will integrate Nvidia Inference Microservices into its broader cloud-native application orchestration platform under a new partnership. The arrangement was unveiled this week, as company officials bet on NIM as the dominant building block for AI microservices.

"NIM represents the next phase of customers running their own models on Kubernetes and a standard scaffolding ... for inferencing in Kubernetes," said Keith Babo, head of product at Solo. "Only the biggest companies with the most sophisticated teams that are already doing very deep supervised machine learning and large ML teams are prepared for [that] -- this is opening it up to the rest of organizations."

Secure multi-tenancy with GPUs a work in progress

Red Hat OpenShift 4.17, shipped this week, combined security with multi-tenancy for Kubernetes clusters in new ways with a group of features in technology preview. Native network isolation for namespaces offers a harder boundary between tenants sharing the same cluster than Kubernetes' default network policies, and Red Hat officials claimed it is also more straightforward for cluster operators to use. OpenShift 4.17 also supports a beta feature introduced in Kubernetes 1.30 in April called pod user namespaces, which isolates the processes running inside a container from those running on the host to prevent privilege escalation. Support for Red Hat's confidential computing attestation operator was also included in this week's updates, to partially automate the deployment of a trusted container environment in highly regulated organizations.

Other vendors, such as Aviatrix and cloud service providers, are working at lower levels of the network to connect and secure multiple Kubernetes clusters that run in multiple clouds. However, secure multi-tenancy with GPUs in shared Kubernetes clusters remains an unsolved problem, according to platform engineers during a breakout session panel presentation here this week.

Serverless multi-tenancy similar to AWS Lambda's Firecracker, built for CPUs, could also be ported to work for GPUs, said panelist Aditya Shanker, senior product manager at Crusoe, a clean computing infrastructure service provider in San Francisco.

Meanwhile, another panelist said open source approaches to multi-tenancy at the chip level have yet to catch up to proprietary methods of time-slicing GPUs within Kubernetes clusters.

"That is absolutely not something that a lot of the cloud providers want to pay for to that particular hardware vendor, to then expose to you as an end user," said Rebecca Weekly, vice president of infrastructure at Geico, during the panel session. "With the previous, older GPUs, you could use GRID technology, and that was exposed in a more open fashion. But there's a very different business model going on today."

Kubernetes projects tackle multi-cluster headaches

Elsewhere at the Kubernetes cluster level, LLMOps requires a reconsideration of the way physical hardware is shared among server nodes, Kubernetes pods and end-user tenants, to maximize resource utilization while maintaining performance and resiliency. One of the CNCF projects showcased here this week was Kueue, a job queuing controller offering fine-grained, device-specific controls over cluster resources already incorporated into products such as OpenShift AI.

Another keynote this week introduced the beta-stage MultiKueue, a job dispatcher that offers similar controls to Kueue, but spread over multiple clusters.

"[A user] submits a task into a single cluster, [which then] automatically distributes it across the worker clusters, monitors which of them admits it, removes unnecessary copies, waits for worker completion and updates job status in the management cluster so that [the user] doesn't have to look around for the status of the job," said Marcin Wielgus, a staff software engineer at Google, during the presentation.

Beth Pariseau, senior news writer for TechTarget Editorial, is an award-winning veteran of IT journalism covering DevOps. Have a tip? Email her or reach out @PariseauTT.

Dig Deeper on DevOps