Red Hat CEO on OpenShift evolution and AI moves (original) (raw)

Red Hat made its arguably biggest move to position its open source software stack as the platform of choice for enterprises to build and run artificial intelligence (AI) applications at the recent Red Hat Summit in the mile-high city of Denver.

Emphasising the role of open source in driving advancements in AI, Red Hat CEO Matt Hicks announced a slew of capabilities around OpenShift AI and the new RHEL (Red Hat Enterprise Linux) AI, a foundation model platform for developing, testing and deploying generative AI (GenAI) models.

RHEL AI also includes the open source-licensed Granite LLM (large language model) family from IBM Research, as well as instruction-tuning capabilities by way of IBM’s InstructLab, which enables fine-tuning of AI models to improve their performance on specific tasks, along with their ability to follow instructions.

In an interview with Computer Weekly, Hicks discussed the evolution of the OpenShift platform; how OpenShift AI and RHEL AI work together to lower the cost of AI training and inferencing, and ultimately drive AI adoption in the enterprise; and the traction Red Hat has with customers looking to move away from VMware.

The announcements at this year’s Red Hat Summit build on the conversations we’ve had over the past few years about OpenShift. Tell us about the evolution of the platform and what you envision it to be.

Hicks: I sort of talked about how platforms go in layers, where if I’ve just bought a machine, what is the platform that turns the machine on? For us, RHEL is our canonical that lights up the hardware. Now, as we’re seeing the use of specialised machines with AI, RHEL AI will be that same thing that lights up the machines.

But in so many AI use cases, one machine doesn’t count – you need clusters, connectivity and complex topologies, and that’s what OpenShift provides. It has grown from taking a cluster of RHEL instances and running containers on them to running virtual machines as containers. You can add AI-specific workflows and put LLMs next to the applications. It can run on bare metal, virtualised environments and cloud. OpenShift becomes the core platform, like a vSphere, or a mainframe, or other technologies that you’ve built your application topologies around.

OpenShift is that core platform, and then at some point, it touches hardware, and RHEL brings hardware up to the platform. Our goal for OpenShift is for it to be robust enough for AI, virtualisation, or containers. It is the platform that you build skills around, and we’re trying to amplify that with Lightspeed to make it easier.

What were some of the learnings as you baked AI capabilities into OpenShift?

Hicks: The first learning we’ve seen in the market and experienced ourselves is that these really big models are amazingly capable but are also very expensive to run. And so, we saw pretty quickly that we needed to do things smaller. Smaller models are cheaper to run and train. But the training was incredibly difficult – fine-tuning required skills and knowledge in data science. That didn’t feel sustainable, so it was natural for us to build OpenShift AI first and say, if you have LLMs, this is a platform that can make them work with your applications, do the serving and optimise training costs.

Photo of Red Hat CEO Matt Hicks

“We’re not getting into the model space because we don’t have 100 people with PhDs who know this domain deeply. But IBM Research does, so the choice of taking their IP, using Red Hat as the go-to-market channel and open sourcing that, flipped the equation for us”

Matt Hicks, Red Hat

RHEL AI makes that story a lot cleaner with instruction-tuning. Our goal is to convert almost everything we’re doing now to an instruction-tuned base, because if it fits into knowledge and skills buckets, then it’s much easier and faster to run. And so, if you’re creating models, OpenShift AI becomes a very clear value proposition. As I’ve described to customers, if you bought a Dell PowerEdge with eight Nvidia H100 cards, you could run the server full throttle to train your first model. But you’re not going to stop there – you’re going to start training a second model, but how many cards do you need for inferencing and your next training? If you get a second success, you’d want to do a third model, but each of those cards costs $50,000.

With OpenShift AI, if you tell us that you want to spend 75% of your resources on training and 20% on serving, with 5% left, we’ll manage that efficiency for you, so you can get to that first experiment fast. The easy POCs [proofs-of-concept] and the big models are good to do, but very hard to put into production. Our hope is for you to do those POCs with smaller open source models. You can run them yourself in your datacentres, and with OpenShift AI, you can do that at scale as well.

You prefaced your keynote with some of the academic research that’s going on around AI. Given that OpenShift supports a distributed environment, what are your thoughts about enabling some sort of distributed training at the edge?

Hicks: The Granite models that we’ve open sourced were built in a supercomputing environment run by IBM. Those models were trained on OpenShift purely for the reason you’re talking about because OpenShift is very good at cluster distribution. So, if you have big GPU tasks to run, IBM happens to have pre-training capabilities which very few companies have. This is where we started with the research and optimised OpenShift and OpenShift AI to the highest bar that no customer is going to try to do.

If we can make this work and create a base model, instruction-tuning or fine-tuning is like a walk in the park. We used the distributed capabilities of Kubernetes and OpenShift to then build the OpenShift AI capabilities and tested those just to have product confidence of the highest level and scale. That’s why we’re very bullish – you can instruction-tune it as much as you want as OpenShift AI will scale. That’s the strength that you touched on there.

When we last spoke, you mentioned about Red Hat not getting into the model space. But you’re somewhat dipping your toes into it with Granite now?

Hicks: I feel like I’ve said it a hundred times to people in our company that we’re not getting into the model space because we don’t have 100 people with PhDs who know this domain deeply. But IBM Research does, so the choice of taking their IP, using Red Hat as the go-to-market channel and open sourcing that, flipped the equation for us.

Location-wise, we do a ton of work out of Boston where IBM has a huge relationship with MIT [Massachusetts Institute of Technology]. We have all the researchers we need who know this domain deeply. That provided the shift for us to not only do an AI operating system, but one that includes models with multiple parameter sizes. And we can support what we ship because we now have that domain expertise.

Can you give us a sense of the traction of OpenShift AI in the marketplace since its launch last year?

Hicks: OpenShift AI will be strongest if there’s demand for on-premise training because the public clouds will have the likes of Vertex AI, Bedrock and SageMaker. The best proxy of that demand is Dell’s and HPE’s last-quarter results for GPU-focused servers that were not sold to the hyperscalers. Both had very strong orders for those machines. That’s a safe external proxy and sort of our addressable install base. You’ll see some public references while you’re here, so I’ll let you connect the dots, but I do think it is a really strong market and RHEL AI will amplify that.

Red Hat’s chief technology officer, Chris Wright, alluded to efforts in making it easier for customers to migrate their VMware workloads to OpenShift. I understand you have partnerships with Nutanix to go after VMware customers. But it’s obviously a multi-year journey, and they can’t move everything overnight to OpenShift. How are those efforts working out?

Hicks: It’s not an area we’re going to push for, but if customers want to move, we’re going to serve them, and there’s been a lot of inbound demand.

Customers are also making platform bets and saying, in most cases, that they are very comfortable with vSphere and have built an ecosystem around it, but do we have a platform that’s strong enough to be their next 10-year bet? We happen to think OpenShift is a very strong platform that doesn’t just serve virtualisation capabilities, but also containers, bare metal and AI.

When you’re producing AI models, or AI agents where it’s a combination of models, knowing where they came from and having that same bill of materials that doesn’t change will be critical

Matt Hicks, Red Hat

Those two things have been playing in our favour, but there has to be customer desire because we’re not going to match vSphere feature for feature. We will build migration toolkits to help customers move. We’ll work with global systems integrators to do that, but the customer has to want this change.

KubeVirt, which is the underlying technology of OpenShift Virtualization, is already a top 10 project in the CNCF [Cloud Native Computing Foundation] today, so there’s very little risk for customers other than having to do the work of moving 10, 50 or 100,000 virtual machines. We have a strong ecosystem to get customers to a good place and we have a good platform story that goes beyond what vSphere does.

There have been some concerns that Red Hat might be deprioritising Red Hat Virtualization (RHV) in favour of OpenShift Virtualization. What are your thoughts on that? Or is it a case of being hypervisor neutral, where you’d support Nutanix’s hypervisor as well?

Hicks: Nutanix is a great partner and if people want to get off VMware but don’t like OpenShift, they can go with Nutanix which is betting on AHV and their Acropolis platform. We can also run OpenShift on that, which is a very similar like-for-like switch.

When we talk about RHV, whereas RHV has similarities to a virtualisation product, the actual hypervisor, KVM, is exactly the same for RHV and OpenShift Virtualization. OpenShift has stronger platform capabilities, so that’s where we’ll invest. But it’s not a huge worry for us. If you moved away from RHV to Nutanix, we’re fine with that. We have a business that sits on top of that at our guest and RHEL layers, and OpenShift in the areas above.

But if you’re comfortable with RHEL and KVM, OpenShift Virtualization is going to be a good destination for you because it gives you the option of being able to run OpenShift on bare metal, and if AI is somewhere in your roadmap, you’d be able to do AI experiments and fit GPUs into clusters.

We spoke previously about the lack of SBOM [software bill of materials] capabilities within OpenShift and you mentioned that Red Hat was looking at different standards and approaches. How has that evolved and how do you see SBOMs evolving with AI as well?

Hicks: That’s one of those areas where a lot ties into regulatory choices. It’s one thing to say we can give you secure manifests and it's another if the US government says this is the way you have to provide manifests, so we have been a little hesitant. The technology works and we've certainly made it more programmatic, but it’s a little unclear as to the level of specificity we’ll need in some of these markets. So, you'll always hear us leave a little openness so we can adjust or fit, but we can solve the customer need today in terms of creating SBOMs.

In AI, we know we need something similar to a CVE [common vulnerabilities and exposures] handling practice for AI models. No one knows how to do that right now. We believe instruction-tuning can solve a good chunk of it, but it won’t solve all of it. SBOMs are very similar. When you’re producing AI models, or AI agents where it’s a combination of models, knowing where they came from and having that same bill of materials that doesn’t change will be critical. It’s something we will build into tools like OpenShift AI, where you can deploy more complex topologies and models, and secure and understand what you deploy.