
Leon Song, VP of Research at Together AI, joins us to explore how open-source models and cutting-edge infrastructure are reshaping the AI landscape. He shares how Together AI is building one of the fastest, most cost-efficient AI clouds—leveraging techniques like speculative decoding and FlashAttention to help enterprises fine-tune, deploy, and scale open-source models at the level of GPT-4 and beyond.
281 Audio.mp3: Audio automatically transcribed by Sonix
281 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.
LEON: Our mission of supporting open source models got so much attention, and I think today this mission is no longer something that people feel like, oh, you know. This may not be that great compared to closed source model. So this really changed when major players started releasing open source models. The platform currently is being used widely across different enterprises. Many companies may have their own models and as closed source models or whatever for their internal business. But we have customers. They use our platform to do other things right. So the key here is the flexibility that we provide.
CRAIG: Build the future of multi-agent software with agency. That's a g n t c y. The agency is an open source collective building the Internet of agents. It's a collaborative layer where agents can discover, connect, and work across frameworks. For developers, this means standardized agent discovery tools, seamless protocols for inter agent communication, and modular components to compose and scale multi-agent workflows. Join crew I lang chain, llama, index, browser base, Cisco, and dozens more. The agency is dropping code specs and services. No strings attached. Build with other engineers who care about high quality multi-agent software. Visit agency and add your support. That's a g n t y.
LEON: So my name is Leon San. I'm the VP of Research at together AI. I lead the research work that doing the R&D, uh, research, I mean, R&D elements of the company that we push out a lot of wonderful, innovative research that you're seeing powering our platform today. Um, you know, ranging from flash attention on, like, kernel works, like flash retention, uh, you know, um, data works like pajama, uh, version one, version two. Uh, we have a lot of, uh, you know, research, open source works on speculative decoding. A lot of the work that has been used by industry widely, like Medusa and others. Um, and we also, uh, you know, uh, have done a lot of the work that providing, uh, modeling data optimizations to the open source community. We recently released a, um, a small 14 b, uh, uh, coder, uh, model which performs at the level of, um, uh uh uh, mini three level. So we just released that last week was Berkeley. So there's a lot of research in our work that are going. Um, but it's like very, uh, it's a product centric, research driven, products centric, uh, org that I'm leading. Yeah. So prior to together, I was at Microsoft. Um, I was very fortunate to experience the whole, you know, the GPT ChatGPT era and Microsoft, which OpenAI I like, works, uh, in training, training system, uh, projects in, uh, in deep speed. Um, and then I created one of the first initiative on AI for science, uh, initiatives, uh, centered around Microsoft, uh, partners. Um, so using AI to address, um, you know, scientific challenges at large scale, um, and uh, uh, and then later on at Microsoft, I was working on, um, you know, different system machine learning, system optimization techniques for inference, um, so that we can serve, um, you know, the GPT series of models on Microsoft platforms.
CRAIG: Yeah. Okay. Uh, can we go back and talk a little bit to begin about, uh, AI for science? Uh, because there's been a lot you still follow? I would imagine what's happening. Uh, generally, um, anthropic done some amazing stuff. Um, yeah. Can you talk about where you guys, what you guys were focused on at Microsoft and where you see, uh, the field having moved since you left Microsoft. And you know what you see in the future for that?
LEON: Yeah. I'll talk a little bit about that. I think this is one of the things that, you know, the field is quite excited about. Yeah. For a long time. Um, until we got into the generative AI era. Uh, like in, in the, in the, or like early 2023 when ChatGPT came out. And then a lot of these models used to be supported by less generative generative AI models. Based models were now like like used to be a lot of these kind of diffusion models. Or they're trying to figure out how to put different models together to address the complex biology problems or, you know, simulating weather, uh, climate changes and things like that. So, um, when I was there, um, we had a great, uh, I think it was a it was a great initiative. So we, we started with, I think two major projects was internal folks from MSR and also external folks. So, uh, one project was, uh, we started is to provide this system support for scaling, uh, protein structure prediction training. So you probably have heard of AlphaFold from Google. Um, back then 2023, there was a open, uh, open sourced community work led by Columbia that called Open Field, which is using PyTorch lightning to with deep speed backend to train these types of models. And so trying to give the community a way of studying using open source model to study, uh, protein structure predictions. So in the past, the problem with that model itself is that it requires large activation during training. So the training cannot really scale. Well, um, you know, they couldn't do really long context window and they couldn't they couldn't do a larger parameter size.
LEON: Uh, it's well bounded by the system itself. System. I mean, like, you know, the, the hardware devices you're running, the GPUs, the CPUs, the large cloud. Right. So, um, we stepped in and we provide a lot of important kernels so that projects can now scale and can train. I remember, uh, uh, Three times longer of sequences. We reduce the memory consumption during training by like 75% or something like that, a significant amount. Um, and then on that project, like took off now. Now you guys can use it as an open source project. People are using it in academia. And then, um, I think they're startups using that as well. Now they formed, I think, a, I think a consortium actually focusing on that effort. There were other very important Microsoft, uh, more internal, um, efforts that's, uh, using very creative ways to, to do protein structure prediction in a different way. Um, and AlphaFold and then the open folder way of doing it, um, you know, the open sort of the publications you can find is related to how they're using, um, probability prediction way of like looking at the equilibrium in terms of a more complex, uh, you know, protein structure prediction rather than a more straightforward way that the other models are using. So those were very successful projects. The other aspect we're looking at is, uh, whether, uh, climate models, whether models, um, that there was a very significant impact at that time. Um, I'm pretty sure now is still the case, uh, to like the Bing service because we have a lot of users, um, are using that through being. So. Yeah.
CRAIG: Yeah. On the open fold, did you follow the same architecture as AlphaFold?
LEON: I think it not strictly follow. Uh, not strictly following that architecture. Uh, I think there were a lot of changes. Uh, even in 2023 that were when we were working on it. Um, there, uh, I haven't really follow very closely after my, you know, my, my time at work. Um, but my understanding is they're trying to pull the community, the scientists, into this project together and trying to ask them to contribute different model components that, you know, you can you can do mix of models, which is not just a one transformer architecture, uh, you know, dominating model. You can do different things in that community. I think that's what they're trying to do. Um, yeah.
CRAIG: And and one thing that's, uh, I and I've spoken to DeepMind about open about AlphaFold, but, um, one thing that I've never because at the time that I spoke to them, it was still a research project early, uh, early in, in their research. And, uh, these predict, um, protein structures. Uh, but they're not necessarily 100% accurate. Uh.
LEON: Right. There's, um, I think, um, there's always that kind of, uh, verification and validation process needs to happen post post training. Um, you can fine tune the model to be better, uh, these days with, you know, deep seek, uh, being very deep seek models, being very creative and using RL type of training strategies to learn, um, learns mistakes and uh, uh, you know, uh, and then uh, do better with the reward signals. So all of these, um, you know, if you think, if you think about protein structure prediction. Uh, is it really, um, some, some people believe it's it's very, uh, sort of a, like a restrictive structured, very like limited in terms of scope, what the protein can, you know, the, the amino acid can be, uh, structured in certain ways and other ways are not just not just garbage, but a lot of people feel like if you if you think it's a very restrictive fashion in terms of the scope, then how about make analogy to coding, right? Or math right? If that's the case, then today's model does that really, really well. Um, using RL training fashion which which, you know, can have a much more successful, um, uh, rate of prediction. So back then it was not RL based. Uh, it was not reasoning model based. Uh, a model. It was was it was quite simple. Um, what they what they studied investigated from AlphaFold. Yeah.
CRAIG: Yeah. And then, uh, because the remarkable thing is you can use these models to generate novel proteins, right? Uh, and when you do that, then and maybe this is, you know, this is beyond certainly, uh, training, uh, a model. But how then do you synthesize a novel protein? I mean, is it and. And then do you test, um, how that novel protein behaves, uh, to validate the prediction?
LEON: Yeah. I think, uh, that was that that was a question a little bit out of my, um, you know, sort of system researcher, a computational scientist that's sort of outside of my domain. Um, but what I can say is that, um, people do use these predicted structures, and then they do, uh, post, uh, post, uh, modeling. Uh, you know, works like a validation verification. It requires lab validated, uh, you know, to be validated, to be validated in the labs. Um, whether or not this legitimate or not or they're just random compounds. Um, but but I do sometimes see news and read some papers and they're saying like, oh, this, you know, this wonderful creation of the new, new structure that can mean something. And it's really, really not something we have seen sooner. Yeah, that sounds amazing.
CRAIG: Okay. And then jumping to what you're doing together. Yeah. Um, together was started when.
LEON: Um, I believe officially was in late 2022. Yeah.
CRAIG: And and it was started by Percy Wright is.
LEON: Um, I think there's several I mean, since, uh, Chris, Ray Percy and we have, uh, uh, two and we have our CEO. I think there are four co-founders, and Percy is one of them. Yeah.
CRAIG: Okay. Uh, and this latest model that you've come out with. Can you talk about that?
LEON: Uh. From us?
CRAIG: Yeah.
LEON: Yeah. Yeah. So we're, um, I want to say something like. We're not a modeling company.
CRAIG: Yeah, I.
LEON: Know. So we. So. So, uh, we work with other modeling companies to come up with, uh, you know, innovative, um, models, sometimes with alternative architectures, not transformers. Um, so that, that and then can serve on actual hardware with certain efficiency, uh, control and also, um, uh, uh, you know, uh, uh, economics. So the recent one is a, um, uh, a lab that in Berkeley, um, where we work with them and then training those 14, the coder, uh, model, the coder model, uh, performs really well, and it's based on, um, it's it's based on this kind of mixture of agents type of, uh, strategy that, uh, uh, you know, uh, or like, and then they train this model with their own data sets and the different training strategies. Um, and now the I think the coder performs really well. It's at the elementary level.
CRAIG: And what was together as contribution to that. And in the lab. The Berkeley lab is what lab is.
LEON: I'm not very familiar with this particular project because it's external. We have 1 or 2 people. We're working with them primarily on data side.
CRAIG: I see. Yeah. Okay. Well, tell me about the speculative decoder work. Is that something you're more directly involved in?
LEON: Yeah. Um, yeah, I, I saw this as about a coverage of together innovations on our platform, but I think you would like to more understand what I do in terms of involvement, but that's okay. We can we can go through those topics as well.
CRAIG: Yeah. And then we'll we'll cover it together in a in a minute.
LEON: Yeah. Sure, sure. No problem. Um, speculative decoding is something that, um, together is really good at. Um, a lot of the companies in the insurance provider realm that provide provides speculative decoding. Speculative decoding is a way that is is is a non non lossy way of doing inference that you don't lose accuracy but you can boost inference performance. Um, so the basic concepts is you have a bigger. Original model as a verifier that verifying your very very tiny model that says speculator. Speculator model where if the outcome of the verifier and the speculator agrees, then the inference. Time will be dramatically reduced. In other ways you would use speculator to perform the inference. Um, so you are basically, um, trading off, um, a lot of the efficiencies in that realm. Um, but. The, the difficulty of building speculators is you want a very good acceptance rate. If your acceptance rate is very low, then your overall performance will still be hindered by the larger model. So you will not really get to the the peak of the performance that you can gain. So to gather, um, has been working in this uh, prior to, I think prior to our, uh, my time when I joined was, uh, early 2023. Prior to that, together already start working on spectral spectral decoding, um, and then produce several open source work to the community. Like there's a work called Medusa, and, uh, we and I, together, we have our own way, our own way of training. Speculator. Um, we have automatic pipeline. We built, um, that very internally and customers that they can provide, uh, user data for customers and provide user data. We have an automatic way of, um, training that speculator and performing different kind of model optimizations.
CRAIG: Okay. So I'm going to slow it down a little bit. What is a speculator.
LEON: It's a smaller model.
CRAIG: It's a smaller model distilled.
LEON: Uh, it can be. It can be. I'll give you a couple of examples to clarify. So let's say you have a 670 B um a deep seek R1 model. It's a large M0 reasoning model, right.
CRAIG: Or a mixture of experts regions.
LEON: Where you serve this model and you run on multiple nodes. It's going to be really slow for depends on your, um, decoding DAPs. How many tokens are you generating? The decoding process is going to be really useful.
CRAIG: Yeah. And just again for listeners the decoding process, what are you decoding.
LEON: So um, autoregressive transformer is the algebraic or autoregressive way of generating tokens. So for instance you have the pre-sale phase where you take the prompts into pre-filled and it will generate a token where you start with that token feeding into your decoder. So anything logically coming after that token will be generated and attached back to that token, and you just keep generating the content context with the previous previous tokens that you're inputting. So that's sort of sort of a very easy way of explaining. So what makes Transformers so powerful is this autoregressive decoding process. That's that can actually generating the context by itself. So that's the wonderfulness of the Transformers. And all the models that we're using so powerful today are based on that particular model architecture. Yeah. All the companies are using this model architecture, OpenAI or the other company modeling companies.
CRAIG: Yeah, right. And then so you have a smaller model and then what is the speculator.
LEON: Oh so I'm trying to get to that. So let's say you have a large model, um, you of course you don't want to perform decoding on the 670 B model, right? So, uh, now I can get I can train A1B model model with 1 billion parameters, and you can use it to pair up with your verifier, which is the larger model to perform speculative decoding. Um, so the smaller model inference speed is going to be really fast because you literally reduce the parameters, you know, size by orders of magnitude. Right now you have one B model. Now how do you make this one B model that in terms of prediction capability or acceptance rate, get very close to the larger model, right. For that smaller model, you can distill from the larger model. Right. Or distill from let's say these model providers provide several sizes of of a model. You can distill from the medium size or smaller size of the model to the size that that you know, one b2, b3, b that you're looking for to support your model. Right. Um, that's what is being, uh, what the speculative model means, which is. And not as much as when you run your inference. Your computers spend on the smaller model rather than the verifier.
CRAIG: I see. Uh, and. Okay. And these, these are, um, products that you're making available to the open source community. I mean, how does together, I work together?
LEON: I is a AI accelerated cloud company that for enterprise. So that's sort of the main thread of business. So we provide a platform like together platform that people. Um, there are different tiers of services. One of the, the one that's open to the public. Whoever has a credit card doesn't matter if you're a startup company or individuals that you can use is to go to our, um, uh, um, serverless endpoint. Then you can use all the models that we're offering. Um, open source models that are now up to like 200 models that served on the platform that has the models from meta, from DPC, from Alibaba. Like all these models that from different open source providers. And you can use them. Uh, just, um, you can use them through the API where you can just type like in a, through the UI, or you can use the APIs or other types of tooling in your own work. You can build applications based on the token service. Now that's just one tier of the service that we're providing. Another tiers are including dedicated instances where um, let's say companies, smaller medium sized companies, they want to reserve GPU resources for doing different things. Um, our goal is trying to provide a entire AI lifecycle from pre-training on fine tuning to the point you want to make the inference service of your applications on our platform.
LEON: So a lot of companies are doing dedicated instances. They will buy on these On-Demand dedicated instances when they need to do that, they can use our software that for whatever business use they are looking for. Um, the key highlight or the key distinguishing feature of our business, open source supporting open source model versus the closed source companies, uh, is that the customers own their own model. They they own their data. Right. Whatever fantastic technologies will provide you to make your model, making you speculator, making your serving strategy, training strategy. Wonderful. You own that piece of the the model, the data sovereignty, also the security of the deployment. So we have to go through a lot of the compliances so that we can serve customers in North America. And, uh, some of the orders.
CRAIG: And the models, the instance of the model that you own, uh, is on your cloud or where where does it exist?
LEON: Um, yes. These are open source models.
LEON: On our cloud.
CRAIG: On your cloud. But but, uh.
LEON: Other.
LEON: People can use them to they can directly download them from hugging.
LEON: Face.
CRAIG: Right. But, I mean, the, uh, an enterprise customer can have a private instance of an open source model on your cloud.
LEON: Absolutely.
CRAIG: Fine tune it for themselves. Yeah. Yeah. And what if build the future of multi-agent software with agency? That's a g n t c y. The agency is an open source collective building the internet of agents. It's a collaborative layer where agents can discover, connect and work across frameworks. For developers, this means standardized agent discovery tools, seamless protocols for inter agent communication, and modular components to compose and scale multi-agent workflows. Join crew I, Lange chain, Llama, index, Browser base, Cisco, and dozens more. The agency is dropping code specs and services. No strings attached. Build with other engineers who care about high quality multi-agent software. Visit agency and add your support. That's a.
LEON: Different things with us. For instance you say Okay, I'm looking at, um. I want this model to be better, right? Um, but I don't know how to do it. I don't I don't know how to do it in a sort of the frontier way. So we'll have a team and helping you to get the strategies up from the the pre-training data all the way to post training data pipeline, make your model look better, performance better. And now you all the way. Coming to the inference side, we can also help you to customize the inference solutions so that you reduce your cost and you make your performance and also the accuracy to be in a very satisfactory range that the customers demand. Um, and that's what we do.
CRAIG: Yeah, yeah. And the, the, the open source models, I mean, typically have lagged the proprietary models. Um. Can you bring open source models up to compete with. You know, I think you mentioned oh three Many or those?
LEON: Yeah, absolutely. Let me talk about the field and why. Our mission of supporting open source models get so much attention. And I think today, um, you know, this mission is it's it's it's no longer, uh, you know, something that people feel like, oh, you know, this may not be that great compared to closed source model. So this really changed, um, when major players, ah, started, started releasing open source models, which is super aligned with our original vision of creating a company. So models today like Deep Seek V3 model, non reasoning model and Deep six R1 model, uh llama four models that just released last week. Multimodality state of our multimodality models in different sizes, right? Uh, super long context window. And um, and you know, all these models are very, very competitive When R1 was released from deep sea. It was the biggest news for weeks because the model was so good, super competitive to open up to closed source models like open models right in many aspects is better. So, um, open source community finally can use these models for their own way of creating their own enterprise or businesses opportunities. Right? Open source model is really hard to have with closed source modeling companies. And really, I think really making a lot of them very nervous because deep the deep sea. For example, deep sea, they they sort of not only open weights, but they also tell you exactly how they using creative way of training the models through reinforcement learning, how they created a new attention kernel that's different than everyone else was using. Um, you know, when everyone else was using Multi-head attention, they created something called, uh, multi-head latent attention with does differencing compress really, really well? Um, to a very small size. So all of these creations, open sourcing, open source libraries of communication libraries and all of that, um, really strengthened the community's, you know, mission to, you know, one day will have the the best models from open source community and these days where really, uh, customers, users are really enjoying using this open source model because they are either better or at the same level of the code source models.
CRAIG: And what do you think is going to happen to the closed source models? I mean, it's these companies spend an enormous amount of money on research, right? Um, that they have to be used through an API, which they lose a huge chunk of the enterprise market. Anybody who needs to keep their data On, um, premise.
LEON: Yeah.
CRAIG: And and now, uh, if, if these cheaper open source models can do nearly the same thing. Um, it just intuitively seems that, uh, the proprietary, uh, system will eventually, uh, disappear because you can't keep investing tremendous amounts of money. Uh, I mean, particularly if it's primarily just in in scaling, compute it.
LEON: Um, you know, um, let's first not deny the amazing, um, you know, work that on in the transformer era that opened on the all the other pioneer companies that delivered since 2023. That was really the biggest innovations, I think, um, you know, for this decade or more. Um, now in January. It was when Deepsea Models released. It was really the moment that the entire community realized that, okay, I can do this in the same level. Yeah, it's a good question. So my prediction is that these companies will have to figure out their model releasing strategies. Um, I think more and more of them, including OpenAI, will likely to release their open source models. Now, what likely happened is you will have an open source model that has a somewhat gap of a closed source model. Now, this open source model they're releasing will be more for, let's say, smaller devices, something more commodity. So they are still maintaining the sort of the cloud level, large system level reasoning models and trying to do that as much as they can, which is very competitive, very hard these days right now. But I think their business model has to transform. But, um, my understanding or my feeling is they have to compete in the open source market. Yeah, they have to maintain that top position. They have to open source model and beat everyone else so that the community understanding is okay. These companies still can do the best model, but that game just getting harder and harder and then distinguishing products is very difficult. Yeah.
CRAIG: Yeah. Uh, together you're an em. You're a, um, I guess MLOps or a gen AI ops.
LEON: A ML system.
CRAIG: Yeah. Ml system. Yeah. Uh, company does do any of the proprietary, um, model makers use together AI solutions, or are you really focused on the open source community.
LEON: Um, I think the I can that's related to the sort of the, the business and the a but I think the understanding is the platform, uh, currently is being used widely across different enterprises. Um, some companies, many companies may have their own models and as closed source models or whatever, whatever, uh, for their internal business. Um, but we have customers. They, they use our platform to do other things. Right. So, um, I think, you know, the the key here is the flexibility that we provide, the innovations we provide, as well as the the data sovereignty, security model, ownership that we provide on our platform. It's just a very different business model for these companies that unless, uh, let's exclude 1 or 2 of this top open source companies, but everyone else, um, understandings can use our platform can do work on that.
CRAIG: Yeah. Uh, I mean, it seems to me that space that you're in is increasingly crowded. I mean, I talk to a lot of companies that have, uh, you know, the accelerate training that make training cheaper, that that can elevate open source to compete with the best reasoning models. How do you operate in that market? Uh.
LEON: It is it is competitive. But, uh, we have the greatest reputations in this realm that provide the open source community, the key technologies. They can operate these models today without flash attention and other works that we contributed, we could have just closed source them. Right? We don't have to release them. But the the way that we did that, uh, you know, first, uh, we have the community trust that we always produce the best solution and a number of shows. Uh, you can, um, uh, go to the different, uh, you know, uh, analytical sites, like, you know, artificial analysis and others. And you can look at our numbers, um, and superior numbers, uh, superior cost, economic benefits to enterprise. Right. So I see this game, as you know, um, you have really have to have a really talented research team behind this operation. Um, you can sell GPU hardware, right? I can if I can just go to the GPU hardware resource work with you and just selling, building on each other's price. Um, like some companies does that. But if you look at their members supporting these models, right.
LEON: And the flexibility, they can customize these enterprise, um, like a unique enterprise solutions for different companies because different companies are having their own serving or serving conditions. Some companies want very extreme serving conditions. They want large batch size. They want large throughput regime. They care about latency, but they probably don't really care about tokens per second. All of these different models, if they fine tuned or they make, how does that really perform on today's hardware? And that's the sort of the question that that we answer and we connect. So it's both a research leading team behind this whole operation of accelerated AI cloud. It's not about, uh, you know, using just using open source, uh, framework and then get it run on your platform. Um, and that's not, you know, a enough solution for our customers, at least the ones that we have, uh, we we have seen the market. That's not the. Yeah, it's getting, uh, there more players. Um, but we're confident that, you know, we've been a leader in this community for a while.
CRAIG: Yeah, yeah. Um, and, uh, somebody's an enterprise who's working with an open source model. How how do they decide on who to use? I mean, other than your your prominence in the market and do you compete with people like CRM, AI, or a Alef Alpha? Do you know these guys?
LEON: Oh, yeah. I haven't heard of them.
CRAIG: Or run pod or.
LEON: A rum pod. I know.
LEON: Yeah.
CRAIG: Yeah. Are they a competitor?
LEON: Um, so I think the their. There is a overlapping business there. I think they're some portion of it. Um, but, um, I think we're running a very different business models in the market. Yeah. That we support. Now, I can answer your previous question about, you know, who they choose, right? Um, my experience, uh, talking to customers is especially enterprise customers. Um, their models are quite unique. Uh, even if they're based on the open source models when they provide their models, uh, you know, the process has played out where we have them to, to, uh, you know, further enhance their model anyway. Um, most of the time, they want these very extreme serving regimes, as I talked about, um, without a, a, a really sophisticated serving engine. Um, and without a team that behind it that can do can connect modeling, inference, training, fine tuning together with this kind of the entire AI lifecycle. Um, you, uh, you will have very hard time to produce that solutions for these 4200 price customers, and that's what we were really. We have been really proud of that. Our team can do that. So, um, yeah. So I think, um, that's what enterprise customers want, not just take the open source model with open weights and say, okay, now I want to just use this model and then just find somewhere to serve it. Of course, our pricing or competitive to very, very competitive. Um, and we're trying to give our customers not only the ownership but also the best economics. Um, every day we're our team. We're working very hard to get to make that happen.
CRAIG: Yeah, yeah. Um, and and then, um, uh, on the, on the cloud, you you operate a cloud? Is that right? Mhm. And is it primarily a GPU cloud? I'm very curious about these new accelerators built for inference.
LEON: Sure I can I have more knowledge in that realm.
CRAIG: Yeah, sure.
LEON: If you're modeling questions. Uh, I think, yes, uh, for today to guide our is primarily, um, operating on GPU cloud. Uh, we are, um, recently become this big partnership with Nvidia for the Nvidia universe of, uh, you know, uh, one, uh, so, uh, important partnership for cloud serving. Um, so that so the I want to talk about the technology because the technology that we create is we, we have this in mind that the future, you know, it of course, Nvidia GPUs, other GPUs are fantastic. But as a technology company, we're also looking at other possibility to mix them into a serving solutions for our customers in the near future. Right. What I mean is, in the serving situation, you not only have GPUs, but you will have other kind of assets, right? They do different. They perform different roles in terms of how you do the economics. What is the the band about the customers demand for latency and throughput? Um, all of these, uh, I think the, the complete solution would be a cloud that utilizes different kinds of group, these hardware group of hardware, like what, uh, being provided on the market today. Um, but as a company were very carefully evaluating, uh, because we're very experienced about user workload, very experienced about serverless and endpoint use cases. So we carefully evaluated all these hardware. And then eventually our hope is to integrate this into the together universe, together serving platforms. Um, and we have a couple ideas that we can, you know, really reduce the cost and getting the throughput, uh, you know, uh, significantly improved with these hardware being part of the solution.
CRAIG: Yeah. Because, Uh, you know, I've had, uh, Andrew Feldman from Cerebus on a couple of times, and I've had, uh, Rodrigo Liang from Samba Nova. And, uh, I haven't talked to, uh, to grok yet. Uh, grok with a Q but, um, but the their argument is that their inference speed is is light years ahead of, uh, Nvidia, uh, and consequently cheaper. Um, so I, I'm but it doesn't seem like there's a lot of uptake. Everyone's still depending on Nvidia and I'm kind of waiting for that.
LEON: Yeah.
CRAIG: Point at which people sort of shift the inference to another, uh, where.
LEON: My personal opinion this doesn't represent together opinions, purely my personal opinion on this is that A6 are great, but they have to overcome a lot of hurdles of being the primary and only hardware serving in a cloud setup for large enterprise. So you can just look at the decoding number. So A6 when they make A6 the the pure.
LEON: The.
CRAIG: Explain what A6 is for.
LEON: For listeners it's like a customized hardware that you understand a generic version of this model let's say transformer model. It's a you understand it like a dramatic pattern. You map that into these um um, very fine grained, uh, circuits and, uh, have electrons and all these gating, and you make it the, you know, either data flow or whatever, you make it maps that in there that does something that, um, customized for this generic version of the model so you can enhance things like. I can give more shared memory around trip memory so I know I need much way bigger. Um, and I can, if I know the current most popular models are like this that customize the pattern or hardware pattern for it. Now Nvidia is an AI supercomputer, but it's not just AI computer. It's a general purpose with something called a tensor core, which tends to core. Does the AI matrix multiplications in different kind of precision. So Asics advantage is that because you purely customize for a pattern your likelihood of doing decoding, which is we talked about auto regressive regressive generation of tokens.
LEON: That process can be really fast because you customize the hardware and you control the latency, you control the throughput, you control the size of memory for it. Um, but, uh, There are other factors for service, right? You have to consider, um, you know, the flexibility of this A6. If I have if today we're serving at a range of dense model, let's say 400 billion parameter dense model. I'm a mix of expert models. We're serving let's say 1,000,000,000,002 trillion range. Okay. Today you can spend a lot of time customizing this hardware. Now in about a month or two. Our field move on to even a different realm. Let's say I go beyond 2 trillion or I need an entirely new transformer engine or transformer architecture. Then these ASIC vendors, they have to adjust it really quickly, which is not you know, you can say it's it's very possible, right. You know, because their expertise. But sometimes it's also less flexible. So GPUs benefits. They're very flexible because they're general purpose computing plus a AI core. Does that make sense?
CRAIG: Yeah, absolutely. Yeah, yeah. And just Asics stands for what?
LEON: Um, that's a good question. I've been using this forever. I think it's, uh. I don't want to say wrong, but, um. I want to say it accurately stands for, uh. Sorry, I yeah, I don't I don't want to say this wrong. Sorry. I'm just getting.
CRAIG: This. Is it a s uh.
LEON: Hello, A6A. Chips. Um.
LEON: Oh, it's application specific integrated circuits.
CRAIG: Oh, right. Okay.
LEON: I remember the first part, I don't remember.
CRAIG: Yeah, I, uh I. And and is, uh, samba Nova and ASIC.
LEON: Yes.
CRAIG: Yeah.
LEON: All of these.
LEON: All of these companies, um, uh, you know, Brock samba, Nova. Uh, Cerebus, uh, these are, you know, exec vendors. Brock. Yeah. Mentioned. Yeah.
CRAIG: Okay. Um, okay. And then, um, what's next for, uh, together, I. Where are you going? With us?
LEON: Yeah. So, um, our goal is we're trying to provide this AI accelerators cloud. We want to expand, um, the hardware capacity or diversity that we can serve our customers. Um, we want um software solution provide better software solutions, especially focusing on inference. Um, if you go visit our website now, you can see tons of applications. Um, has been using open source models that build through together platform and together solutions. So we want to, you know, kind of encompass all of these capabilities of our research and translating that research into our platform. So, uh, not only enterprise users that we, we really, uh, put a lot of efforts for them, but also the broader community, um, that they can they can enjoy and use our applications and serverless endpoints. For instance, recently we released something called, um, together, uh, together chat. If you go to I was playing around it this morning. Right. Um, if you go to Google Together chat, you can register an account, and then you have a bunch of models that you can select and from reasoning models, um, all the way to multimodality models.
LEON: Uh, multilingual for coding tasks for image generation. So we have all these models serving within this one chat, uh, format through power, through open source models. And then we're also going to inject a lot of the things like what we create called deep research. Right. Um, which is combining search functions with mix of agents into re enhancing the power of the together chat. So users will have all of these packages in this, uh, in this application that they can use. Um, of course, these technologies can be taken apart and then put into our software stack used for enterprise customers as well. And this is just one of the examples, um, that we're trying to do. We're trying to do the cloud. Uh, really? Well. Uh, you know, through AI. Um, and then we're trying to provide this super fast. Highly efficient, both in performance control and economics, and also putting a lot of these AI research in the modeling and data into this platform and transform them into products for our customers and be competitive, uh, against the closed source models. Uh, through open source. Yeah.
CRAIG: Okay, well, that's fascinating. And and I think I'm gonna end it there, but, uh, this together chat is that, uh, is there a free layer?
LEON: Um, I think I think you can use it for AI. I'm not really sure up to what point, but right now I think it's freely, freely offered that you can you can use and test. Yes.
CRAIG: Okay. Well, I'm going to go play with it.
LEON: Yeah. You're going to go play with it. It's quite good. Yeah.
Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.
Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.
Sonix has many features that you'd love including enterprise-grade admin tools, collaboration tools, secure transcription and file storage, upload many different filetypes, and easily transcribe your Zoom meetings. Try Sonix for free today.