Nick Pandher, VP of Product at Cirrascale, explains why AI is shifting from model training to inference at scale. As AI moves into production, enterprises are prioritizing performance, latency, reliability, and cost-efficiency over raw compute power.

 
 
 
DOWNLOAD TRANSCRIPT

314 Audio.mp3: Audio automatically transcribed by Sonix

314 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.

NICK:

CirusScale is a neo cloud specializing in lots of different areas around AI, but most specifically today we're going to focus on inference. I'm VP of product and as part of my role as VP of product, is working with our development teams to build out unique capabilities for AI services around inference and other areas as well. My background comes out of the GPU space. I was at Nvidia very early on at AMD, and several startups that have touched AI models and other type of models in their use cases as well.

CRAIG:

Although I don't think of that as the cloud providers job. Is that unusual that you're getting involved in, you know, optimizing inference, uh, as opposed to just providing the compute?

CRAIG:

Okay. Yeah. So Nick, why don't you start by introducing yourself to listeners and then we'll talk about inference.

NICK:

Absolutely. So hello everybody. I'm Nick Pander with CirusScale. Ciruscale is a neo cloud specializing in lots of different areas around AI, but most specifically today, we're going to focus on inference. I'm VP of product, and as part of my role as VP of product, is I'm working with our development teams to build out unique capabilities for AI services around inference and other areas as well. My background comes out of the GPU space. I was at NVIDIA very early on at AMD and several startups that have touched AI models and other types of models in their use cases as well.

CRAIG:

Yeah, tell us first what Cirus Scale does.

NICK:

Absolutely. So we're a more white glove type of cloud services provider. What we're focusing on is trying to understand customer specific issues and when they're looking for outside cloud services. So typically we're taking customers who are coming from an on-premise environment or a hyperscaler and are just looking for services and capabilities that may not be part of what they're getting today. So if we look at it from an AI point of view, it's being able to deliver the correct AI fleets for given use cases. It's covering any sort of compute needs that might be required for pre- or post-processing data. It's delivering services that help bring high performance, reliable type of solutions, specifically, you know, if we look at inference being a differentiated area, we really wanted to, you know, increase those services and expand from our AI accelerated fleet where customers typically bring their own stacks and bring their own solutions into more ready-to-go solutions that a customer can take and utilize for training, fine-tuning, and also for inference capabilities as well.

CRAIG:

Yeah. And is that primarily your differentiator primarily in the software? Or is it that you have specialized GPUs tuned for inference?

NICK:

Yeah, so that's a great question. So we a few different facets there. One is we're all about operating a cloud service that's taking the highest performance with the capabilities the customer is looking for with the highest reliability balanced in as well. So if we look at you know trading type workloads, we want to provide what we call the highest job completion rate. If we're looking at inference, we want to provide the most robust, you know, five nines type of uptime and then unique software capabilities to deliver serverless type of solutions.

CRAIG:

Yeah. And we're talking about inference. And there was a shift maybe six months to a year ago when people stopped talking about training and started talking about inference. I've had Andrew Feldman from Cerebrusan and Rodrigo Liang from Sampanova and some of the other inference-focused chipmakers. And those guys have had trouble getting developers to adopt their hardware because developers are so married to the CUDA programming language. And they've shifted to building their own clouds or trying to get clouds to adopt their chips and providing inference service or inference as a service. Can you talk first of all about why there's been this shift to inference and how you see the explosion of inference workloads developing as AI takes gold in the economy?

NICK:

Yeah, absolutely. So if we take a look at the shift from training to inference, there's a little fact there that's really important is there's there's a large group of organizations out there building models, either building models for their own proprietary services, like look at OpenAI with their ChatGPT type of model. They're building something that works really well for a lot of different use cases. There's a wide range of different uses, uses that the model can support. Similar enough, we've seen the same with other models that are out in the space today. If we look at the, you know, probably the number one model everyone's using today from an inference point of view as part of their Office 365 package, which is, you know, Microsoft's provided copilot. So those are good at specific tasks. When you start to look at more enterprise type workflows and look at agentic use cases and agentic workflows where you're taking the model output and utilizing outside workflows and existing applications that might be in the organization, there's a more select group of models that are out there that can actually be used today without no need to pre-train a model, which is a very expensive long operation there, or no real reason to have to post-train a model. What we're really looking at is taking an existing foundational model. There's a whole series of them out there, you know, Meta, Llama, AI2s, open models, and others, and being able to take that model and do some level of RAG where the model can be more applicable to your use cases, or for the models that allow it, doing some level of fine-tuning. So there's a lot of options out there to take such those models, skip that entire training path, leverage one of these existing models, and start utilizing the models and your specific workflows and connect it to existing agentic type of use cases. So there's a lot more that can be done with the off-the-shelf models than was perceived a year ago or even six months ago.

CRAIG:

Yeah, although I don't think of that as the cloud provider's job. Is that unusual that you're getting involved in optimizing inference as opposed to just providing the compute?

NICK:

Yeah, absolutely. So there's a fundamental change happening in that people are taking those off-the-shelf type of solutions and wanting to understand that they may have specific requirements unique to their organization and how those models are being served. Let's take an example. We can take OpenAI's Chat GPT, great model, but there may be some organizations who are nervous about their data being shared to a model developer. And, you know, you're gonna you're gonna have different organizations with different security type of concerns depending on is it regulatory or is it just more protecting their proprietary information? And so, as a good example, is you can take OpenAI's open weights model, their OSS 120, deploy that in a private environment through a cloud provider, and now you have a model that is nearly as capable as ChatGPT-5, but in a private environment where your data, your prompts, and your outputs from the prompts are all protected and it's running in a private environment. So, from that assurance point of view, you're getting something that is very unique and you're leveraging a cloud provider for that.

CRAIG:

Yeah. Although, don't most of the cloud providers offer private clouds, even open AI offers private clouds environments, don't they, for those kinds of security reasons?

NICK:

Yeah. So the you know, the differentiation that we like to see ourselves doing is understanding what the customer has as far as their uniqueness. Are they connecting parts of the workflow to a hyperscaler, connecting parts of it to on-premise infrastructure as part of their end-to-end workflow? And, you know, are we able to provide a solution that could be more technically enriched for their specific needs that they may have? If they have specific workflows or specific software stacks they've already utilized and they're on-prem, we want to be able to extend those. So if they've already built an inference services layer on-prem and are happy with it and they're saying, how do I go extend that to the cloud? That's an area that we can help and we can bring some uniqueness on the Serascale side by providing that connectivity to their on-prem, providing higher performance fleet that could be cost effective for what they're trying to do and kind of giving them that predictable performance. At the other tone, if we're looking at people who are already sitting in a hyperscaler, one of the biggest challenges people are looking at when they're taking these AI workloads and these private AI deployments inside of a hyperscaler is the costs can add up really quick. So it's delivering that same level of performance you expect from a hyperscaler. Where a hyperscaler is really good is you getting on and off and not utilizing that node or those nodes 24-7. Where we're going to be better at cloud providers like ourselves, the neo clouds, is if a customer comes in and needs a certain level of compute performance for generating tokens in the model, as an example, 24-7, and then needs, you know, a higher performance level of that, you know, example, weekdays, 9 a.m. to 9 p.m., where people may be utilizing the model more, where they need, you know, five times more capacity. It's sometimes a challenge in hyperscalers today in some of these model deployments to get a reliable performance. The biggest indicator you see you're having a performance issue is time to first token dramatically increases. So you're paying for the usage of the model, but you're also starting to suffer with some of the performance issues that come with this great uptake people are seeing with AI becoming more prevalent in a lot of different organizations.

CRAIG:

Yeah, so throughput is one of your key metrics, I would imagine.

NICK:

Yeah, so I'd almost say it's providing that right compute solution for the customer, which is the right throughput, the performance, and that cost efficiency is really the key drivers there. So the throughput to really serve those continuous requests, if we look at it from an AI or LLM model, it's those millions of token requests, being able to serve those reliably, providing that predictable performance across those different model architectures. Every model behaves differently. So making sure the fleet of compute that's operating that model, you know, provides the best performance, the lowest timed first token if it's a real-time use case. And then having that cost efficiency so that you can scale, scale up. And as your user counts go up or your complexity and the model usage goes up, you're able to cover those really well. And so what we really see is it's that perfect balance of understanding all of those elements and helping a customer really balance the performance and capability needs with the economic cost type of areas and deploying a model and getting it to be used in a specific workflow.

CRAIG:

Yeah. And I didn't realize until a year or so ago that even the hyperscalers farm out a lot of their workloads to private data centers. Is that right? And do you guys serve the hyperscalers?

NICK:

Yeah, so our focus is really supporting the those who are building and delivering services on top of generative AI or AI models. So we're more talking about specific customers who are building service offerings to end customers or to organizations. You know, we want to work closely with ISDs and SaaS providers who are looking to extend their existing workflows to add AI services, but may not want to lean that in their existing hyperscaler just from a cost and performance point of view. And then also what the biggest growth area that we see coming into the tail end of this year and into next year is there's more enterprise customers who kind of understand what they want to do. They've taken a look at the solutions from the hyperscalers where they can get a you know, a model garden of different models, an API endpoint, and get out there. And they're looking for a wider variety of models, but somebody who can deliver it as a service offering. So more of a serverless offering, but it's around the models they care about. And in some cases, we're gonna see in some organizations, those are gonna be proprietary models that might be coming from a SaaS provider or a proprietary model that might be a fine-tuned version of a model that's unique to that organization. And so when you start to bring those custom or those semi-custom models into a hyperscaler, you start to break the bounds of some of the services that they offer today.

CRAIG:

Yeah. And what chips are you using? Are you using Cerebrus or Sambonova, which shows sort of blindingly fast inference? Or are you somebody else I spoke to recently who's speeding up inference simply through software optimization strategies?

NICK:

Yeah, so we're unique in that we're agnostic in the acceleration technologies for AI models. So we worked with a significant number of incumbent partners. We've started working with a series of new technologies. You just mentioned two of them where we've had relationships with them and we work closely with their teams on how we can work together on customer opportunities. At the same rate, there's a new generation of technologies that have been coming out. You know, and if we look at the existing two largest guys out there, NVIDIA and AMD, they're doing extremely well. And we do very well with the solutions that we offer in conjunction with them. But there's been a group of newer technologies in the last year or so that are more focusing on differentiated areas and looking at providing, you know, energy efficiency, performance differentiations, and just looking at taking it from a different angle. In that regard, we've been working with Qualcomm and we introduced some offerings specific around Qualcomm's AI 100 Ultra. And it fits really nicely with a software stack that they've delivered, which provides a suite of services that are built around inference to be able to make access to models much easier. In the same light, you know, one thing we do like on the Qualcomm solution is they've also just recently announced some newer technologies, they're AI-200 and AI-250. These are the kind of solutions that we want to see coming into the market. We want to have diversity. We want to figure out where each of these solutions fits a specific use case or a specific implementation of a model and how it's used in a specific agentic workflow and understand which solutions provide the best performance. So when a customer comes to us and says, I'm looking for inference with this technology, we can say, okay, let's go actually pilot the model in an environment with different accelerators and find the best for you. And so the solutions like Qualcomm is we want to position those solutions where they fit the best and are able to deliver the performance that is unique when you blend it with the costs and power efficiency that you're going to get as well.

CRAIG:

Yeah. And on the cost efficiency, where does that come from? Simply by increasing throughput, you're getting more inference per token, or not per token per minute. Yeah, where does the cost efficiency come from?

NICK:

Yeah. So the cost efficiency can come from a few different areas. Number one, is the accelerator optimized for specific type of needs? So is an accelerator more optimized for inference versus training? If we look at a lot of the accelerators out there today, they're dual purpose. They're really built to go do extremely complex cluster deployments for training. And they're going to be excellent for inference, but they're going to be costly. So there is a demand out there for solutions that may not be looking at training first, but maybe looking at inference first. And for some workflows, they're just going to be the more optimal solution out there.

CRAIG:

Yeah. But and and Qualicum, certainly I've spoken to them that's designed for inference first workloads. But again, is what is it that makes it cheaper? Is it's not the cost of the chip, obviously. It's not the bandwidth. It's there's a reason why it's cheaper than a hyperscaler.

NICK:

Why is uh Yeah, so very simply, you're getting a solution that just as we said was focused on inference first workloads. So you're looking at it being built for delivering that throughput that's required to deliver the models, having the right capabilities to support different model types, transformers, supporting different types of networks that are used inside of models, being able to support diffusion models, for example. So covering that throughput use cases that are specifically needed, but at the same time covering a more effective power profile as well. So if we look at the power profile for the Qualcomm solutions, it's still a data center capable device. It's just gonna when we look at a rack full of compute with a Qualcomm solution versus other solutions out there, you're looking at extreme energy efficiency. So that what that really comes down to is the power savings. And for a lot of these models, the difference between a higher performance accelerator for certain use cases to an accelerator that's balanced for inference first, that's balanced for power efficiency, you still get excellent performance in the serving up those models for those agentic workflows. So it really depends what the what the customer is looking for. And in a lot of cases, what you tend to see is with the higher performance accelerators, you're barely using out of a you know a server that has eight GPUs, you're barely using one GPU. And so you're gonna run a model that's fairly utilizing one GPU. Wouldn't you want to have something that's more tuned performance where you're utilizing the full GPU and you're able to then scale it into additional GPUs or run additional models that are part of your workflow as well? So better utilize the hardware where you have hardware that's dedicated to a customer, but it's still servicing cost effective. One key differentiation for Cerascale is we don't do in our bare metal type of solutions, we don't do multi-tenancy. So we really want to lean on partner technologies like we do with Qualcomm's AI inference software stack to be able to handle running multiple models for different use cases on a platform. So there's some differentiation there of the software stack as well, being able to better utilize the hardware and make sure you're getting as much performance out of it, what was still within that lower power curve.

CRAIG:

Yeah. Yeah, I yeah, I actually wrote a piece for Forbes a little bit ago. There are these companies that are appearing that that aggregate idle GPU and CPU resources and then offer them as a cloud solution because most data centers, certainly private data centers, the the chips are not being used most of the time. So yeah, I can see why that's important. The and how does well two questions. What kinds of enterprises are you seeing come to you? I mean, uh it's taken a while, you alluded to this, for enterprise to figure out what really works in the enterprise with AI. There has been a lot of pilots, and we're just now moving into production workflows in the enterprise. And is it the agentic shift or the shift to AI agents, or is it other forms of AI that are finally taking hold in the enterprise?

NICK:

Yeah, so I think we could look at two different aspects here. Number one is agentic, which we'll touch on for a second, but let's first talk about automation. So there's a lot of tasks that can automate to improve how a person works. You know, a very good example is if we look at one vertical segment, we can look at you know, a banking, finance, and insurance markets. Imagine a mortgage application packet. And I think all of us have at some point done a mortgage or a loan where you're submitting lots of documents. And you know, the mortgage provider will always say, Oh, we'll turn this around in seven days, and you know, you know, 21 days later they're still asking for documents. Wouldn't it be wonderful to be able to submit all those documents, have an automated workflow, parse through the documents, and understand are there red flags the underwriter needs to go uh look at right now to be able to go back to that customer and say, hey, you're missing. These or they need more information here. Right now, that process is by the time they look through all that material, it's you know a week or two. But so providing a some level of automation, a an LLM pre-faced in with a multimodal type of model or even an OCR type of model. There's some modern OCR models that we've been working with AI2 on as an example, being able to take, you know, computer-generated, you know, printed text or handwritten text, being able to convert that to outputs, understanding those outputs within a specific model, and understanding were certain things filled out correctly, were certain things provided that need to be provided, so that our turnaround of like 21 days for a mortgage could turn into, okay, within three days, we're telling you everything that you need to provide more information on it, and you move along there. And that's that's beneficial to an organization because you're utilizing your people, you still have humans in the loop. And there's a lot of this AI is going to replace what humans do. I truly don't believe that's 100% of the case. Can you make humans more efficient so that the people working on specific parts of the workflow, like in banking, finance, or even in the medical billing industry as an example? You're having people focusing on the more important tasks and you're having automation deal with the tasks that, you know, typically organization may outsource out of the country or whatever else to a third party. You can keep a lot of that information internal to your organization and have your internal people actually focus on looking at areas that the model identifies as needing additional understanding.

CRAIG:

Yeah. And you guys say you're the first neo cloud to work with Qualcomm's inference, the new inference chips and software stack. What exactly is a neo cloud? That's not a term I've heard a lot of, but I'm not deep in the cloud space.

NICK:

Yeah. So a neo cloud is a cloud provider that's more focused on delivering a specialized fleet of hardware. We're not a hyperscaler, but we're we're delivering compute that's more focused for what a customer wants, more tuned to specific type of compute needs and built around either a training, a fine-tuning, or an inference workflow. So something a little bit more in tune with what an enterprise organization would want for end-to-end workflows, looking for services that include not just the hardware, but also storage, any specialized networking or specialized connectivity they have to the rest of their workflows.

CRAIG:

Yeah. And is it fair to say, because these enterprises certainly still use the hyperscalers, is it fair to say that there's a class of problem of inference or training, but specifically inference in regulated industries where they need a ring-fenced private cloud? And the big hyperscalers don't necessarily uh provide the bespoke or tailored solutions that these enterprises need. So it's a portion of their inference workload they send to a neo-cloud. Or am I extrapolating too much?

NICK:

No, that's that you're hitting the nail on the head. So there's things that a hyperscaler are going to be very good at, running SaaS type applications, you know, as workloads are running for specific SaaS applications for a given customer. They're really good at that. They're really good at providing those standardized data lakes for an organization to have all of their proprietary information. Where the hyperscalers are going to struggle is on GPU availability and the compute availability and tying it in to be cost effective for a specific workflow and delivering the performance requirements that you want. So either you're looking at a customer having everything at a hyperscaler, including their applications, where they're going to need an additional lift of compute in a neo cloud to be able to cover use cases that might be more cost prohibitive on top of a hyperscaler. Or you're going to have a customer that has a workflow that truly is hybrid, not just hyperscaler to a neo cloud like ourselves, but also on-prem, where they may decide that, well, our primary data lake is actually on-prem from a security point of view, you know, or certain applications from a regulatory point of view have to run within a more constrained or controlled environment. So having that ability to mix in the different types of compute, but keeping the data where it needs to be. So the idea is your data stays in the data lake. When you're running inference, you're bringing in the prompts into the model and then the output back to wherever it's needed. So uniqueness is we can present those endpoints to the model to be part of their private fabric network that goes to the hyperscaler or to their on-prem. So we can do things that are very security constrained for those customers who need it. We can also cover air gapped, truly air gapped type of environments. So if they have security type of oversight that is pretty stringent, we can cover those special use cases as well.

CRAIG:

Yeah, yeah, that's it's really interesting. The you know, there was first everything was on-prem, then things were in local data centers, and then you know, the clouds, the hyperscalers or the early cloud, AWS and GCP and those, everyone went there. And that was a big, it's not complete yet, that that migration to the cloud, but now it seems there's a realization that you need more specific services from the cloud. And so there are these, I guess what you're calling neo clouds appearing that that provide those more specific services. And is the well, first of all, is that right? And second of all, are the enterprises that you're serving primarily in regulated industries, or is it across the board?

NICK:

It's across the board. We are trying to focus on areas that are seeing the most uptake for the whether it's a will to spend, you know, banking finance insurance, medical, especially on the billing side, medical on the research side. If we look at retail, retail could be a whole series of different customer interaction type areas. It could be, you know, computer vision models, mixing in with LLMs for, you know, looking at assumptions of how people are spending, how people are using self-checkout, those kind of areas like that. So it's really taking and arming these teams with the use cases or looking at modernizing from legacy type of models into utilizing different levels of AI to improve those workflows, improve conclusions, and improve analytics. One thing that I will touch on is what we're seeing more of is organizations learning in their first initial deployments, and they may have done it on a hyperscaler, you know, serverless type of solution with one or two models, is they're learning that they need to do what's called a proof of value before they do a POC or a pilot. Look at all of those use cases. So if an organization says, hey, here's a hundred different uses within our organization that could benefit from AI, do a proof of value to understand which are the most beneficial. And as part of that proof of value, do you have estimates of where you might see some of those savings? And, you know, do you have the justification as part of those savings or as part of the readiness that you have to actually achieve some level of, you know, applying in AI? Then do you have enough data to go start a POC? And then you do that POC to understand, okay, can I now take this from a truly a proof of concept to a lighter weight pilot? Not maybe in the entire organization, but can I prove out that my assumptions from the proof of value to the POC to the pilot, you know, still hold true so that I can take that to production and see the benefits that I get. And I think organizations really have to get in that mantra to go reinforce doing that proof of value. The ones who've shortcut it are the generally the ones who've, you know, had challenges on taking the model to production and you know, having doubts after they've deployed it.

CRAIG:

Yeah. And the do you do any well, we're I mean, I'm just curious. The government needs, you know, can't really afford to lean too heavily on the big public clouds. Do you get much government work?

NICK:

So so the what's interesting is the hyperscalers all have done significant work to go after government. So from the federal level down to the you know state-local level. And they're gonna hit the same dilemmas anybody else does in enterprise, where there's gonna be solutions that are optimal on top of their existing hyperscaler environments. There's solutions that might be optimal to run on-prem in their own private environments. And there may be solutions that are more optimal to run with a neo cloud. Where, you know, we're always growing into new markets, and you can expect that the government is an area that we will grow on. And in a lot of cases, we're gonna lean on partners. You know, one of those areas that we just spoke about Qualcomm earlier, one of those areas is we do work closely with Qualcomm to understand their downstream partners and how we can work with their partners who are utilizing their technologies and then place that in with partners that we're bringing in so that there's a holistic solution. So if a customer resonates on certain technologies from a security or other point of view, we want to be able to provide a solution that that covers the bases that they need. And we might start to see a new form of data centers deploying for inference as well as part of this. So more inference-focused data centers utilized by cloud providers like ourselves. So our inference workloads can run in environments that are deeply connected to other connectivities. So, example, it can easily be able to stand up connections to your hyperscaler, but to your on-prem. And in the government space, that's going to be very important to have that class of data centers tuned for inference so they can connect to private connectivity that might be more secure or have certain security requirements built around it as well.

CRAIG:

Yeah. The big hyperscalers, one of the attractions of at least when they first came on the scene, was they were self-served. How uh do you guys work directly with your customers on specific solutions, or are they uh logging onto a web app and configuring virtual machines and things on their own?

NICK:

Yeah, so we're a very much a different white glove approach. So it's more about understanding what the customer's looking for. If the customer has the technical acumen and is looking for specific capabilities, we can stand those up. If a customer is saying, hey, I just want something that looks like you know open AI enterprise, but with my model or my specific models, we can provide a serverless solution that gives them an endpoint and API keys that are binded to their accounts and they can have at it. But if you know, if an organization comes to us and says, hey, we have a use case, we have a model, we're thinking, how do we go take it? We're happy to advise. And we do this a lot today. Now we'll steer them to a partner, like that partner may want to help them, you know, build out a rag as part of that model or help them fine-tune a specific model, or just help them go through a POC process. So we're really all about leaning on partners who the enterprise org is already, orgs already have. So either, you know, the for the large organizations, those global system integrators are trying to get into this AI services and capabilities, we're totally supportive of that because we want to know there's a services partner within that org. But a lot of organizations don't have that level of capability to lean on. There may be larger enterprise orgs that are less tech savvy. And so we want to be able to provide serverless, ready-to-go type of capabilities for certain types of use cases, where if they come to us as an example to say, okay, I need a model that fits in for this specific use case. So my employees can ask questions for HR benefits. And here's all my HR data. How do I go throw that together into a rag and go do that? So we want to provide solutions that help them build that out easier. But at the same rate, when they go one level beyond, we want to be able to say, okay, here's a partner that we zetted who can help you take this one level beyond. It's something we absolutely can do, but we want to make sure we're scalable and that we're working with partners who are bringing us opportunities in. But at the same rate, we're also bringing them opportunities to grow the customers they're covering and also grow unique capabilities they may be building as part of Agentich workflows as well.

CRAIG:

Can you talk about the Qualcomm partnership and why that's important for inference, but AgenTech workflows in particular?

NICK:

Yeah, absolutely. So, you know, there's existing solutions out there that are wonderful in this market, but the key thing for any technology is having diversity in availability of the solutions that are out there. That always moves the needle and moves the market forward. So as Ciriscale, we always want to be that trusted partner to be able to deliver a wide variety of solutions that are right for the right use cases. And so our relationship with Qualcomm is, well, first of all, we're both San Diego-based companies. So there is a close tie and you know, rooting for our fellow sister company from San Diego. So even though we show lots of love to the other solutions out there, we also want to start to, you know, build out different technologies. And you know, Qualcomm has a unique position because a significant chunk of end user devices in their pockets are tied to some level of Qualcomm technology. So you're looking at, you know, if you look at a good majority of the Android phones out there, there's a Qualcomm device running in there, and there's AI services starting to show up on those devices, you know, device level AI services. And so there's a train of thought here of how do you take those technologies that are in a wide variety of different devices at the user's edge and start to leverage those same frameworks and those same software stacks and allow models to run not only on device, but also lean into the cloud a little bit more. So it makes a lot of sense to have the ability to grow into data center, doing it from a very different approach, is you know, growing from being in the all-encompassing, you know, edge level computing guy to wanting to go into data center. You know, it's a it's an area that's interesting to us because it brings a few different shifts. Number one, it brings a different class of accelerator into the market, and it brings a an accelerator that's built on some of a completely different power curve. So power efficiency is going to be a part of it. So, and you need that, you need that raw, burstable power for training. You don't necessarily need it for inference workflows. So it's having that balance of you know what's truly needed to run those inference type of workloads, and then it's, you know, providing the, you know, all of the software that's delivered to be able to make that easier for a cloud provider like us to give that to a customer. Just as you said earlier, there's an incumbent guy out there who's been doing very well in this space that's built some certain software technologies that everyone in the industry has gotten used to. In inference, it's very different. In inference, there is no tie-in to incumbent languages being part of the workflow. It's servicing the AI model, it's providing an endpoint, it's utilizing common technologies that are out there to serve the model and deliver that to the user. And so there's a chance here to deliver something that's a little bit more turnkey for certain types of customers. So as we look at those customers, those enterprise organizations today that are starting to do the proof of value and looking at use cases, they may very quickly decide, oh, I can utilize this specific model as long as I can get it from somewhere. I can utilize this specific model as part of my workflow. And what we really wanted to see was more solutions offering these turn, turnkey, serverless type of solutions. So for the less technically, the less technically advanced organizations but that still want to deliver great AI services, we're giving them a simpler lift of you can take a solution like Qualcomm's inference offering that we have jointly with them and be able to make an account, get on there, look at the models that are there, test the models, look at the endpoint for that specific model, integrate it into your application with an API key and have at it. And you know, within a you know, minutes we're talking from setting up an account to utilizing and finding out the endpoint to getting your API key, and you can start integrating that into your specific applications, into your workflow. So it's giving the accessibility of ready-to-go foundational models, but also we're starting to see new technologies appearing in that Qualcomm stack. Number one is dealing with fine-tuning models. So deploying a fine-tuned model and providing those fine-tuned capabilities into the platform so you can take a model and do some level of fine-tuning. So the model is more, the responses on the model are more specific to your organization and your specific use cases, and making that a lot simpler, I think, is you know, is a key positive that we're seeing from Qualcomm and building that that turnkey ready to go stack and us servicing it as a providing it as a service offering, I think is a big is a big leap forward to giving that diversity of different acceleration devices in the market.

CRAIG:

Yeah. And I can see that I can see that as inference, I mean, everybody is going to want inference, and we're just at the beginning. They don't necessarily want to have a team of developers going in and configuring GPUs with Coda and ACUDA and things like that. They just want the inference. And if all of that work is done on your end or on the chip provider's end, then they're just paying for inference. And at that point, it's speed and cost that become most important. And on why is Qualcomm architecture particularly good for agentic workflows?

NICK:

So you're looking into as part of agenc workflows, you want to have a solution that's optimized for inference, but is able to delete, deal with the correct sized models and provide a cost-effective way to go run those. So as part of any agentic workflow, you're going to have a some level of a foundational, a fine-tuned or a rag integrated model. And you want to know, number one, that you're getting the right performance out of that model, you know, time to first token, the token throughputs are good to be able to service that workflow, but that you also have a have an environment where you're not having to go pull the specifics of tuning for that specific workflow. It should be a you deploy the model, you deploy the endpoint with your API key, you integrate it into your agentic workflow, and it should just work. That's the key thing here is you shouldn't have to go deal, just as you said, with the idiosyncrasies of the underlying accelerator. It should just be a ready-to-go turnkey solution. So you're focusing on the appropriate parts of building out the right models that are utilized in your workflow. You're spending more time on that because there are going to be certain reasoning and instruct models are going to have better performance for certain use cases. You want to focus your time on understanding your workflow, not configuring the environment and having your team spend more time configuring and keeping the environment up. You want the, you want to spend your developers want to spend the time on understanding which models they benefit from, giving them an easy way to experiment. That experimentation is very important. So letting a diverse group of developers in an organization get specific API keys to an endpoint and be able to have at it to just play around. Because in some use cases, that's part of that initial conversion from proof of value to POC is you want to play around to understand are some of the models just performing better for some of these use cases? And did some of the assumptions you made when you went into the proof of value to the POC were they just Wrong, or did you underestimate some of those assumptions? So having something that you can just get on the platform quickly and your team can just have at it, that's very important. And especially for these enterprise organizations that might have, you know, developers that have been working on workflows into their SaaS applications, you know, and are being told to, you know, build a chat bot that somebody can build prompts you to go do specific actions within their SaaS applications for a user as an example. You want to focus on getting the workflow right, getting the right model, building the agentic parts of the workflow. You don't want to have to wonder about what's behind the scenes and is it doing it right? You just want to know it runs, it's cost effective, and it's giving you the best performance.

CRAIG:

Yeah. And working with Qualcomm, I imagine, I mean, they do everything from the edge to mobile to the cloud. So is that one of the advantages of the Qualcomm partnership? Is you're working with the same hardware provider all the way to the edge.

NICK:

Yeah. So I think there's a good unique differentiation here for end customers. If we look at certain customers who have workflows that touch the edge, they're able to leverage a lot of the same understandings of how models are going to perform in a data center higher performance environment, but what can be potentially supported on an edge as well. Just understanding the underlying capabilities of the software stacks that they're providing. Again, a lot of that is abstracted from a user, from a customer in these serverless type of use cases. But one unique aspect that Cerescale could do as well is if a customer comes in and says, I'm going to go deploy the Qualicom stack edge to edge, right? Edge to data center as an example. So end-to-end, we can we can provide them the Qualicom environments as bare metal with the Qualcomm stacks exactly how they would be needed. And then they can go take the Qualcomm development kits for Edge and be able to leverage those same libraries, those same frameworks for AI to be able to have some commonality there too. So for those customers who go beyond the serverless type of environments and want to do something more end-to-end for their unique workflows, you know, these are areas that we can help those customers as well.

CRAIG:

Yeah. Is this are we how where are we on the curve of adoption for both just general inference and for agentic in particular? It seems to me from people I talk to that we're still very early, that people are still just moving out of the pilot phase. How do you see that? And how is your growth going? I mean, I would imagine you know it's a it's kind of a slow burn in the beginning, but once people get their POCs and move into real production, the volume will explode. And what where on that trajectory do you see Cirascale and the enterprise in general?

NICK:

Yeah, so if we look at what we're doing today for AI overall, our you know, historically our biggest areas in AI, as we saw the transition from HPC to AI. And we were, you know, we've had extremely deep legacy in the HPC space prior to AI coming in. We saw a lot of people in the startup space building models for specific customer consumer use cases, starting with Zero Scale. Um, and as part of that startup with ZeroScale, they they grow and you know, we keep a good chunk of them we keep as customers as part of that growth. And so as they've built their models, they're now switching to inferencing. So we have a significant customer base that is utilizing our AI acceleration, you know, across a variety of different vendors for inference today. So we do serve a significant inference offering today. For enterprise, though, it is a little bit different. And an enterprise, I think there's two factors here that need to be taken into account. Number one, there's enterprise organizations that really have dug deep into putting the budgets aside for spending on AI, looking at those automation type use cases, you know, make empowering employees, potentially, you know, taking outside services and bringing them back in where some of those are being done initially with a model, but then leaning to their own employees. Just the mortgage example that I made as a good example. And so there's some organizations that absolutely see where they need to be. The sad fact is if you look at some of those organizations and the industries they're in, they may have a handful of peers who are thinking that same way. And they may be the other 80% of them are like, we need to go do AI. We don't know what that means. So there is that dilemma there of every CEO is gonna say, we need to do AI, but what does that really translate into? There's gonna be a shift. So but the second point is we're gonna see more middleware, more tools that can sit on top of gentic workflows today and provide, you know, chatbot or automation services, but that touch existing applications. So we'll see that in certain vertical customer segments. We'll see it in federal, state, local government where workflows are going to be able to be accelerated utilizing AI where somebody's decided that they've made a great piece of glue or a widget that talks to legacy applications, but they can provide a chat bot in front of that, but that's customized. You know, either they've done fine-tuning of the model or they've done a rag or other customization of that model so that it's more unique to their organization. So I think there's external services partners who really are understanding what they're gonna be doing. And I think in some cases, some of these people who really get the gist of building companies around that have come out of the large teams that have been building models. They understood a lot of these cases. And instead of working for a large organization building a huge model, they're like, I want to take my skills and maybe spin around where we either go into an existing organization and help them build out an AI services team, or we just go start a new team that can provide these services to customers. So I think we're gonna see a lift of value add, you know, what do you want to call them, you know, value add type providers or integrators or even the GSIs, where they are gonna provide stacks of ready-to-go solutions that can be integrated into customer workflows easier, where an enterprise organization can say, I have this problem, I want to do this, and here's the result I want to get. And then there's gonna be organizations that can deliver on some of those capabilities. Now, now for the ones who are totally technically savvy, they're gonna say, no, we want to do that on our own. But for the, just as I said, the other 80% who their CEOs are saying, we want to do something in AI, they're gonna be really attractive because they'll be able to take their legacy SaaS applications, you know, bring agentic AI capabilities, you know, and pair with those and deliver them really quickly. So I think enterprise is absolutely gonna go from chatbot type of, you know, hey, can I provide this information for my marketing team on this copy? And can you SEO optimize this when we put it on a web page? You know, that kind of simple stuff people are doing today to, you know, more agentic type of workflows where you're actually taking prompts and having it actually do, you know, uh services into legacy type of SaaS applications. So that's gonna be the future of it. And the players who can deliver those type of capabilities easier for enterprise orgs that may not be as savvy to go deliver, they're gonna do well. And they're gonna pick up the savvy organizations too, because the savvy orgs are gonna go, wait a minute, these guys have built out these workflows with some of the applications that I have in the SaaS space. And I don't need to go have my team, might not have my team focus on the more complicated ones, right? Um if somebody's already built some of these, you know, out of my 150, you know, proof of value type of use cases, if these guys can crack 10, that that's huge. I'll go take advantage of them doing the 10, and then I'll go look at the next 90, which are the you know, two or three I'm gonna go do next with my own team, right?

CRAIG:

Yeah, yeah. And so there there's gonna be the swell, presumably. I mean, uh the reason I ask this is I get reports a couple times a week from some research company saying how AgenTech AI is failing, how you know, 60% or whatever the percent of some high percentage of pilots don't never go into production. And you know, you begin to wonder: well, is it that it's just early days, or is the promise outstripping the reality? Do you have any thoughts on that?

NICK:

Yeah, I'll come back to proof of value. So you really need to go build that matrix of all those use cases and then start to score out and look at the which are the ones that are probably you know higher scoring for taking to a POC into a pilot. I think some organizations look at their most difficult problems and say, we want to take this one through AI automation, but when you look at it realistically, objectively, somebody, you know, who's a data scientist may look at it and go, okay, this is actually a hard one to go solve. But if you can solve it, you're gonna make a so I think sometimes organizations try to pick their hardest problems. And sometimes they need to just go baby steps and pick the first one or two, deliver some success, start to build up your complexity in what you want to go do. And I think this is where those outside services partners can help. So the inside teams are having these same problems as well. And some of these large organizations who built up significant teams do pivot and learn from their mistakes. There's nothing wrong with making a mistake because you learn from it. And so nothing's gonna be perfect, but can we as a cloud services provider working with our partners who are helping people, you know, deliver on those turnkey type of services, you know, for helping somebody fine-tune a model, build out their workflows, you know, or are us just working as a partner with like a SaaS provider or middleware provider that's working with SaaS providers, can we provide that underlying hardware framework and the cloud services that benefit so those organizations don't have to be heavily embedded into hardware or costs on operating compute where they can focus on helping customers, you know, build out new capabilities? And so so can we as a as a as a NeoCloud, can we help bring on those SaaS providers, those those middleware or those services partners and be able to say, hey, we give you an easy way to consume the compute so you can focus on building your services that get more customers onto AI and the enterprise space easier?

CRAIG:

How does someone get started with Cirrascale and Qualcomm in delivering this real-world impact that you're talking about?

NICK:

Absolutely. So for those more technically advanced organizations that you know have what they need and they're just looking for a better place to run, you know, we can provide those compute resources on top of the stacks they've already built on. But however, as we're talking about, you know, a significant chunk of the enterprise space is still probably learning and want to experiment. There's a few different ways that you could start on that. So, you know, Searscale's built out some a group of different inference technologies as platforms. We built out our inference platform. We built out a capability called endpoints. But the first place to start is the Qualcomm inference offering that we have because it's a great place to just get on, set up an account. You know, you'll get some level of free usage of the platform to be able to do some initial experimentation. You can pick some of the more leading popular type of models and just start inferencing, integrate them into your frameworks, get an endpoint and an API key and be able to utilize those into your type of applications. And because this is all AI-based, you know, those organizations who have guys who are writing internal applications are like, I don't know where to start. You can take an AI model and say, how do I connect an endpoint into my application? I'm writing my application in this language. How do I connect an endpoint and throw this prompt into there and get the output? So you can actually utilize the AI models if you're a developer and are nervous about that initial starting point. You can actually take some of these AI models and just ask prompt the model. There's some great models that are more focused on developing code that you can do that too. So you can lean on some of those models. And some of those models are in the Qualcomm platform. So you can not only utilize the platform, you can also utilize the model to help it help you understand how do you go utilize an endpoint to integrate into your application workflow and you know, have it help you there. So that's a key part of that as well. And that's starting on that initial inference journey for a customer. But if you're an organization that's already started to build out some initial use cases, you're using a hyperscaler today and you want that model to be a bit more improved, there is some additional capabilities that Qualcomm's added in the platform to help fine-tune specific models. So that's another area that could be a good experimentation to just come on the platform that's on the Serascale side with Qualcomm and go and take that and just run with it and just start playing around with it.

CRAIG:

Okay. So where does somebody go if they want to give this a try? Is it your website?

NICK:

Yeah, absolutely. So go to serascale.com, and there is a way to get to the Qualcomm offering. Basically, just come to our website, go into our menu navigation on the top under products and services, and you'll see under products and services an inference cloud, which is our offering with Qualcomm. And when you click on the links there, it'll tell you more about the product and it gives you a direct link to sign up and get started and start utilizing it right away. So all you need is to create an account and you're up and running, you know, in in like a minute. Okay. Last thing is because we're a very hands-on white glove type of organization, we love to understand for teams that are doing their AI journey and feel there may be some level of a fit between Qualcomm and us, we can absolutely do those consultative discussions with a customer, you know, a predictability briefing of where you might need to see your specific workflows or your needs going and how they can be serviced with the inference solutions that we have jointly with Qualcomm. And if things fit outside of that Qualcomm space as well, you know, we're very open to provide advisement on organizations that are just have the technical acumen, but just have never been able to corner an organization just to ask the questions of how they should optimally be deploying these type of technologies, how they should be utilizing inference and how they should be covering, you know, for how their how their IT infrastructure is today. So we're more than happy to help those organizations, have a discussion. And again, none of that is obligation. It's more around just understanding where the challenges lie and other areas that we can help.

CRAIG:

Okay. And how does somebody get a predictability briefing?

NICK:

Yeah, absolutely. Just go to our website, click the contact us, put your information in, and our team will reach out to you. Again, there's no hard sell from us. It's more understanding what your challenges are and how we can help. And we're always more than happy if there's a different approach to bring that up with a customer and lead them in a different path if that's more applicable as well. We really want to be that trusted advisory partner that helps a customer begin their journey.

CRAIG:

Yeah.

Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.

Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.

Sonix has many features that you'd love including automated subtitles, automatic transcription software, powerful integrations and APIs, enterprise-grade admin tools, and easily transcribe your Zoom meetings. Try Sonix for free today.


Learn more
 
blink-animation-2.gif
 
 

 Eye On AI features a podcast with senior researchers and entrepreneurs in the deep learning space. We also offer a weekly newsletter tracking deep-learning academic papers.


Sign up for our weekly newsletter.

 
Subscribe
 

WEEKLY NEWSLETTER | Research Watch

Week Ending 1.18.2026 — Newly published papers and discussions around them. Read more