Robbie Goldfarb, co-founder of Forum AI and former Meta leader, explains why treating AI as a truth engine is one of the most dangerous assumptions in the industry today.
316 Audio.mp3: Audio automatically transcribed by Sonix
316 Audio.mp3: this mp3 audio file was automatically transcribed by Sonix with the best speech-to-text algorithms. This transcript may contain errors.
Robbie:
But our expectation is that I can ask it anything and it's going to give me an answer. And it's going to give me an authoritative answer that I can feel like I can trust. And I think that's a really scary expectation to have set. There was uh OpenAI had an announcement a week or two ago where they said that every week they're seeing it was it was like over a million conversations where the user demonstrated suicidal intent. That just gives you a sense of where we're headed or the health of information people have, the you know, avoiding echo chambers, avoiding misinformation. That's so important to building a healthy society. And AI can go either way on this one.
Craig:
Okay, Robbie, can you start by introducing yourself to listeners and give your background? You you have an interesting background, and uh, and then give us the uh tell us what forum AI is and how you came to found it.
Robbie:
Yeah, for sure. So my background is in software engineering and product management. I've spent most of my career at the intersection of AI and trust and safety. Um, I've done quite a bit of work in in education technology. I eventually went to Facebook where I worked on news and misinformation. Um I was actually working on misinformation through the COVID pandemic, um, through the US 2020 elections, through a really interesting period in time, um, that I think opened my eyes to the potential impact of the technology can have on people, both in a good and a bad way, and maybe more importantly, our ability to shape what those outcomes are. And so that really set me down the path that ultimately uh got me working on on forum. Um I went though from Facebook to Instagram, uh, where I worked uh on similar issues. I also did quite a bit of work on youth safety there. And then most recently was uh in Meta's AI lab uh focused on trust and safety more broadly. Um and then probably about a year ago, uh I was I was thinking deeply about these issues and the concept of trust and AI. And then uh uh a good friend, Chris Struhart, who's actually the he's the head of product for Gemini at Google now, was talking to him about it, and he said, you know, you got to talk to Campbell. Um so we connected and and co-founded uh Forum AI together.
Craig:
Yeah, and describe uh what Forum AI is and how it works.
Robbie:
Yeah. So um to put it simply, we scale the world's smartest minds to help evaluate and improve AI systems, particularly in tricky or contentious spaces like healthcare or politics and news. But to understand what we're doing, our philosophy and our belief is that there is a shift that needs to happen in how AI is being developed, um, and that experts, credible experts, need to play a bigger role in that process. It's not to say they aren't playing any role in it today, but we believe that role needs to be bigger. Because as you, if you think about it, as AI gets more and more sophisticated, the caliber of expertise needed to further improve it also gets more and more sophisticated. And so that's the future we're working towards is scaling the highest caliber of expertise. Um and so in practice, there's two things that we've done. One is we've built the networks. So we've built these incredible networks of subject matter experts. We work with um people like Farid Zakaria and Neil Ferguson. We work with academics from Stanford and MIT. We have partnerships with uh the Cleveland Clinic and Mount Sinai Health System on healthcare, the people that you want in the room helping to shape these decisions. And then the second part of it is building the technology to effectively scale those experts for evaluating um AI systems. And so a lot of the work here is about how do you extract efficiently and effectively their knowledge and expertise. Um, and then at the end of the day, what we're building are um are judges. So these are AI systems that we've built, um, which capture the judgment of these experts for factors like identifying political bias or measuring clinical safety that we can then go use to help uh evaluate and shape the broader AI systems. Um one thing maybe I'd just add, and it's still early days, but what what we started to see is when you involve real experts in the process of building AI, the results can be profound. And I think, you know, often back to um Craig, have you seen the the Dario Amade's essay, Machines of Loving Grace? No, no, but I'm I'm uh aware of it. Yeah. So he paints, I mean, it's this great um essay where he paints this almost surreal picture um of what the future could look like if we realize AI's potential. Economic equality, healthcare, governance, etc. Um, and and we can give examples, but you know, as we work with experts, as we evaluate these models, identify the gaps, and then improve them, you're starting to see glimpses of that, which I think is really exciting.
Craig:
And in in um can you talk about how you what the process is of extracting uh the thought process of an expert, generalizing it, and then applying it uh to other to or or having the the judge, the the AI judge absorb that. I mean, what what's the mechanism and uh that that process?
Robbie:
Yeah, so I mean you hit the nail. I mean, that that's the that's the most that's the trickiest part of what we're doing, and arguably the most important part is how do you effectively and accurately capture their expertise? Um, and so there are several tactics that we use, um, and I can talk through a couple of them. Um one, and this is usually where we start, is what we call thought process. Um so rather than just you know hoping that experts are gonna spend many tens of hours a week evaluating and labeling tons of data, we take a bit of a step back. And so um, you know, let's take a question like uh, you know, was it ultimately the right decision for the Venezuelan people for the US to intervene? Right. Um we will rather than asking an expert, okay, what do you say? We'll say, how do would you go about answering this question? And we'll try and understand what their thought process is to get there. And when you start to do this with several different experts and across several different topics, you can start to identify patterns and build a graph. And what's interesting that you start to see is you start to see chaining and patterns. You know, one example is if you're dealing with factuality, right? So the question to the expert would be how would you assess whether this is accurate? You know, one of the things they might often say is, um, well, I would want to cross-reference it with some reliable sources.
Robbie:
Right. So then you have a chain there. Okay, how do you think about defining or evaluating what is a reliable source? And you define that. And the result is you get this graph. And what we can do is we can build agentic systems that mirror that graph. So if the first step is, you know, whatever, I would try and find reliable sources. Okay, then we have a system that does that. Of course, in practice, it's a little bit more complicated than I'm making it seem, but we to put it simply, we try and build these systems to mirror the way that experts think.
Craig:
And then that's generalizable to other similar questions.
Robbie:
Usually it is, um, but not always, right? And so the only way to figure that out is to test. So so what we'll do is um, you know, we'll ask for their thought process across a wide range of scenarios, um, but then we'll actually test the resulting judge on an even larger set. Um, and we may find that it generalizes perfectly well. Sometimes we do, and sometimes we don't. If it doesn't generalize well, then that means you need to break, you need to actually have two separate judges here. Um and we haven't really it's sort of a little bit of trial and error there right now. Um, but generally we've found that uh these do overall generalize quite well um within a domain. So you know within let's say political topics. Um obviously if you wanted to go from political topics to sports, that's you know a bit of a bigger jump.
Craig:
Yeah. Uh and uh I mean right now um LLMs and as we spoke earlier, there are people that that regard AI as truth engines that believe that you ask an AI model uh and whatever it says is is the truth. But there are all of these uh gray areas, and in fact, what the model is doing is just giving it giving the highest probability within the distribution and its training data, and that can be manipulated or it can be biased depending on uh what's in the training data or or what's out there on the internet. I mean, I think there are people who are actively trying to uh manipulate uh training data by filling the internet up with uh with things that support their view. Uh and and I used to ask people if you could uh get all the data uh all of the inputs that would go into deciding uh what the optimal solution to the Russia-Ukraine war would be. Uh and you fed it all into a uh an AI, and this is far too complex for any human to sort of parse through and make judgments. So could you come up with an optimal solution? So uh I I I think that's really a data question, but uh but so in in these uh you're able to generalize because you build this system of agents. Uh, do you end up then with uh with different uh networks of agents for different kinds of questions? And do you then use those to evaluate uh models uh depending on what domain you're querying the model about?
Robbie:
Yeah, I think um well I'll respond to that, and then I want to touch on your other point because I think that was the the idea of truth seeking is a really important one. Um but yes, absolutely. There are um roughly three, I would say, dimensions around which we have to design these systems. One is what we call topic. So, for example, within if you consider political topics a category, you might have foreign affairs, you might have US domestic policy, economic policy, social policy. And so those differing topics can often necessitate needing different approaches to evaluation. You have um different question types or different user intent or scenario. Is the user trying to, you know, write an essay or are they trying to get an answer to a factual question, or are they trying to debate with you? You know, what the user wants creates this very, these very different, in the case of LLMs, very different conversations. Um and then so and then the other piece is um you have different, sorry, you have different topics, you have different uh different user scenarios, and then you have different dimensions, right? So what are we actually looking to evaluate? In the case of political topics, it might be you know, we look at is it biased? We look at, is it accurate? We look at, is it written in an understandable, explainable tone? There are all these different factors you can look at. So you can imagine this as sort of a matrix, and you do need different evaluators for different um different combinations, right?
Craig:
And and the uh capturing the thought process, uh, as you said, you can't expect these people to uh interact with with a model for hours on end. They're they're busy people. Do is it uh through interviews with them? Is it through surveys? How how do you capture their thought process?
Robbie:
Yeah, we've we've tried everything, right? And so and what we found, well, we try we did try surveys. Um, we also tried a more um structured approach where we would almost collaboratively collaboratively try to define these thought processes. Um, what we found works best is just ask them to talk it through. So we'd ask them a question, like the Venezuela question, for example. We'd say, just walk through how you're thinking about it. I mean, usually it's helpful if they have a piece of paper in front of them so they can actually, I mean, it depends on the person, but we found that's often helpful. Uh, and and they'll just talk through it. And then you do that with several different experts across several different scenarios, and then you start to get a sense of what the right um graphs look like or what the right thought process looks like. Um, of course, and I should add, you know, looking understanding an expert's thought process is one tactic. There are others that we do that, you know, of course, all of them together are what allow us to get to evaluators that are accurate, but that is certainly an important one.
Craig:
Yeah. Uh what can you mention another one?
Robbie:
Uh yeah, for sure. Um, there's one. I'll tell you this one because it's it's this is experimental. We haven't formally started doing this with any clients yet, but it's something we're working on internally that I think is really interesting. Um, we're calling it consequence mapping. So in traditional data labeling, you might have experts go label, let's use an LLM as an example, a conversation, and they'd say essentially whether it's good or bad, maybe across a few different dimensions. With consequence mapping, the labeling works a little bit differently. Instead of asking them to label or rank or evaluate the conversation, we ask them to tell us what the outputs or what the consequences of that conversation would be. And so I'll give you an example. Let me just pull it up because I was just looking at it. Um so this is a consequence map for mental health. So what we would do is you could imagine it's a a user having a back and forth conversation with an AI about a mental health issue they're experiencing. Well we would ask the experts, we'd say, okay, after looking at this whole scenario, what would you what would the user's resulting emotions be? How would they feel after this? What actions would they take about after this? What would their family be saying after this? And so it's very it's essentially just data labeling, but it's a much richer way of assessing it that allows us to build a deeper intuition of how they're thinking about it. It's sort of the why behind whether it was good or bad. Um, so that's something we've been playing with as well. That in practice we can use to we use some of that data for training, but also just to guide even our prompting strategies um in in the agents as well.
Craig:
Yeah. Uh and in um in mapping the thought process of experts, uh, you know, you gave the example that uh, well, first I would check trusted sources or reliable sources. And so you have an agent, and then you, you know, how do you determine what's a reliable source or trusted source? In today's world, uh, you know, there there's no objective truth, there's uh consensus, evidence-based consensus. That's basically what we all agree on. That's what the scientific method uh gives us. Uh but there are people who have extremely different opinions about what is a trusted source. And depending on who your sources are, you can come up with very different uh opinions, uh and particularly in the political sphere. So how do you how do you I mean how do you make that judgment that Fox News is not a trusted source? Uh MSNBC or MS Now, I guess it's called, uh, is a trusted source or vice versa. Uh yeah, how do you because that's a bias in itself, you know, uh giving the model a direction on what's considered a trusted source.
Robbie:
Totally. Yeah. And I remember you brought up this idea of truth before. Um, it's so what you're hitting on is I think one of the most fundamental challenges, uh, I think with AI more generally, and certainly that we're focused on, which is this idea that in subjective domains there is no ground truth.
Robbie:
Right. There is a ground truth to how many R's there are in strawberry, right? To use that example. Um and but of course, when there's not a ground truth, what are you aligning to? Right. And our position is that the best way to do that is to rely on credible experts. And so what we've done, and this is fundamentally our thesis, is we've built this network of experts who number one follow strict criteria in that we require the network to be balanced, to be representative, um, and to have a certain caliber of expertise and experience. Um, but number two, it's entirely transparent, meaning we show our network on our website, um, we welcome feedback. Um, on our network. Once you have that, then you can start to defer to the network, or that's our approach. You know, we'll look to the network to define what the ground truth is. Now, to your point, sometimes um folks will disagree, right? Um, and and and experts in the network disagree. The philosophy we have in this scenario is if there are two differing opinions that are from our credible network, both should be represented by the AI. We would encourage the AI the right answer, quote unquote, or the ground truth, is to present both opinions. Now, this, to be clear, is controversial. If you look at um even just today's chatbots, Grok will often just give you an answer, um, whereas the others tend to sort of more follow this approach where they'll share different perspectives. Our belief is that it is not the job of AI to make decisions for people or to tell them what to think. It's the job of AI to give them the information needed so they can make um the most informed decisions for themselves.
Craig:
And uh yeah, uh uh, you know, I'm just thinking uh in areas that uh that are science-based, uh, you know, climate change, whether Thailand all causes autism, you know, these things that have uh uh have really divided the political world. Uh how do you how do you protect, I mean, uh from from being uh uh labeled as bias, uh your your system being labeled as bias uh without giving credence to something that you personally don't believe?
Robbie:
I mean, I'd say first and foremost, and I think this is an important thing for anyone working in this space, what I personally believe doesn't really matter. That's part of the point of this company, is we want to defer the decisions to the people who should be making the decisions. I mean, I can't speak to specific examples here, but the short answer is we would defer to the experts in the network. Um, and we will always be entirely transparent about who those are. Um, and so in that way, um, I think there's no perfect way to approach this, but I think it's much better than the alternatives, right? Which is, I mean, there's a few, like one is you essentially just rely on internal engineering and product teams or business executives at these companies to make those decisions. Um, the other is you rely on certain massive scaled teams of labelers to collectively make those decisions. Um and our position is you know, the right answer is to find a transparent and credible group of experts and put it on them.
Craig:
Yeah, or put it on their process. Totally.
Robbie:
Yeah. I mean, what I should add, because this is important, is you know, the thought process is one part of how we develop these judges. Um, one of the other things we do that has proven very interesting, but also very effective, is separately, so separate from understanding their process, press them directly on edge cases. Um, so the most extreme topics, the most extreme questions, ask them about that. So we have a log of those decisions explicitly as well. And especially when there are disagreements among experts, it's important we get a lot of data there that we can use. So you can imagine if you go back to our system that mirrors this expert thought process, there are certain points where it may need to reference very explicit guidance from experts if it's talking about, you know, for example, one of the um contentious examples that you gave a moment ago.
Craig:
Yeah. Yeah. And uh and then the output is used to uh fine-tune the models or to uh rebalance uh the training data or even excise some of the stuff from the training data. How is the output, the report that this judge generates used?
Robbie:
So the report we give to companies gives you a very clear and specific picture of the issue. Um so it may say, you know, these sorts of questions you are responding to poorly because you're often not representing this perspective, for example. Um We ultimately then lean on the research teams at these labs to figure out the best way to address it. And that could be um uh it could be fine-tuning, it could be RL, it could be playing around with the training data. Sometimes it's you know, playing around with the system prompt as well. Um, and that's ultimately where we lean on them. What I will say, and this is an area we're just starting to get to, is developing taking the evaluators that we build and making them available for reinforcement learning. And so that's where those can have a more direct effect on or direct impact on training the models as well. Um, but short answer is the evaluation highlights the critical issues and opportunities so that the teams can then know what they need to hill climb against.
Craig:
And how many different uh qualities do you do you uh evaluate? I mean, bias is one, I don't know if quality is the right word, but bias is one. Um I guess uh factual grounding or at least grounding in in uh historical literature or scientific literature or something would be another. Um yeah, how uh it maybe uh sycophancy is is one, perhaps. I don't know. Do you do you judge for that?
Robbie:
Yeah, so within political topics, there's four dimensions that we look at. I think you got most of them. There's bias, although I will caution that bias actually breaks down into several different things, which I can talk about, but bias is one of them. Um there's factuality or accuracy. Um there is uh source selection. So for you, you know, these AI systems will often, especially when you're talking about news, refer to outside sources like news. And so uh are you looking at credible sources, a balanced set of sources, are you using the sources accurately? Are you doing attribution properly? So that's um that bucket. Um and then we also look at what we call tone and language. Um and this is even just from my previous work at Facebook and Instagram, um, you can say the exact same thing in two different ways, and it can have very different outcomes depending on whether you're using large language or not. And so that's another thing we look at in in politics as well. Yeah, what do you call that? Uh we call that tone and language.
Craig:
Uh tone and language, yeah.
Robbie:
Yeah. And it's yeah, it's just this measure of um, you know, this is a common issue, is a user may come into a conversation, we're talking about LLMs, with a strong bias. They may ask a question in a very angry way, um, and that's fine, and you can respond to it completely. Um, but sometimes the risk is it tends, I mean, even because of the sycophancy and some of the stuff behind that, it will mirror their language, and that can be really dangerous, right? To the point before, because it's the exact same thing, but saying it with some swear words or with hyperbole can mean a totally different thing to them.
Craig:
Yeah. Uh and uh how large is the expert network? Uh and uh yeah.
Robbie:
Yeah, we have we have four tiers of experts, um uh, or I should say like four groups of experts. So we have um and some of them are larger teams of more scaled labelers, um, probably a little bit more similar to what you might see at a you know a Surge or a Scale AI. Uh and then other groups are um, you know, we have a group with Fareed Sakari and Neil Ferguson and that sort of people. So depending on the group, then the sizes are are different. Um, but what I would say is our approach is very much quality over quantity. Um we are trying to find, and Campbell always says this, you know, who are the maybe a hundred to five hundred smartest people in each domain? Let's bring them on, and then we can figure out on the technical side how to scale them through some of the techniques we were talking about, right? Rather than let's get 10 or 15,000 people and then throw them straight at the work.
Craig:
Yeah. I mean, do they have to be living people? You there you could turn to you know the writings of philosophers or scientists uh to follow their thought process.
Robbie:
Yeah. Have you seen there's a video of Steve Jobs from 1985 where he talks about predicting this this future with Aristotle?
Craig:
I may have, but I yeah.
Robbie:
You gotta check it out. It's unbelievable. This is in 1985, and he says something like Um, he believes there's going to be a day where when the next Aristotle when the next Aristotle comes, we'll be able to capture their underlying worldview in a computer and be able to converse with it. And he goes on to essentially describe kind of what the way these LLMs work today, but in 1985, which is crazy. Um, but anyways, that's more of an aside.
Craig:
Yeah, I think actually, before you move on from that, but don't forget your point. Um I know a guy at Boston Consulting, a very senior guy, and he has this idea that uh, you know, there's all of this effort to uh with RHF and fine-tuning to nudge LLMs toward uh certain moral and ethical positions or or um personas or uh and that the world has five thousand years of uh of religious text that that deals with that with uh moral and ethical questions, and that you could use that to guide an LLM in its uh thought process or in its not thought process, but in its uh in its uh output. Yeah, reasoning, exactly. Yeah.
Robbie:
I mean it's very it's interesting, like I mean, because what this is ultimately getting at is the is the point we were talking about before, which is in in subjective domains there is no ground truth. So what is the ground truth? And it's a very interesting case to say, I think the ground truth should be, you know, the um entirety of these religious texts and ethical like writing or whatever it may be. Um, I mean, our position at forum, once again, is we think the best answer is to take, you know, the top few hundred smartest people in a given area and look to them to define this. Um, but there's certainly something it's a very interesting idea. Yeah.
Craig:
Yeah. And but uh do you go back and use uh experts uh through their writing to map their thought process?
Robbie:
We just started doing this two weeks ago. Um we hired a researcher who's specifically pretty good at this sort of thing. We're calling it extraction. Um and so still trying to figure it out. But what we are finding is that um when you get access to certain content, uh but you can extract certain um themes or judgments um that the author has in a way that might be generalizable. Um for, for example, for training our judges. I think there's a lot, there's more work we have to do here. There's also a lot of other research in the space that we're looking to, so this isn't something entirely new. Um, but yes, you're definitely right. That's a that's an interesting opportunity. What we haven't thought of, or we haven't actively been working on, which you're alluding to, which is very interesting, is um looking at the texts of deceased um experts, right? And that could be a really interesting way to approach it that maybe we'll take a look at.
Craig:
Yeah, I mean, uh, you know, if you could map Einstein's thought process, it may be valuable uh for research systems.
Robbie:
For sure. I mean, I can just say just from the work we've done with living experts, it's really hard when they're alive and you can actually go back and forth with them. So I imagine this would be pretty tough, but it's probably something there.
Craig:
Um yeah, yeah, yeah. How much of this can you surface using an LLM? Because uh, you know, the largest LLMs uh have Parid Zakaria's writings in their training data. Uh and presumably uh they would be able to, I mean, uh I as we said, I mean, they're really looking at probability distributions, but uh in the in the reasoning, maybe uh they're influenced by uh reasoning in the training data.
Robbie:
I mean, totally. I mean, first of all, I'll just say this is probably getting a little bit, you know, this is a deeper research question. Um, I think what is interesting though, because we I've seen this in a number of the clients we've worked with, and we've had these discussions several times. Um, LLMs, of course, do generalize quite a bit from their training data. I mean, that's essentially how they're designed. Um, and that can lead to a ton of unintended consequences. Sometimes they can be good, like what you're implying, you know, but how do you know that it's going to generalize the content from Fareed versus maybe something erroneous it picked up from a Reddit post that it was trained on? And I think that gets into the complexity of thinking through the composition of your training data and how you ultimately go about like post-training and fine-tuning your models. Um, yeah, I mean, it's I I think there is, though, and I mean, the piece that interests me a lot that you're getting at is how do we leverage these experts and potentially their content too, as much as possible to help guide and inform um these models. Um, my sense is it's probably going to be more post-training, more about evaluating, identifying, having them set the gold standard and then hill climbing towards that than pre-training, but who knows?
Craig:
Yeah. Uh and so uh you've got this uh this large network uh but focused on quality, not quantity. You haven't mentioned the quantity, but I'm guessing that's proprietary information. Uh how do you see this affecting the future of AI? I mean, presuming that that your methodology uh becomes widely adopted.
Robbie:
I mean, the thing I would just go back to here, and if we have a broader objective is trust. Right? There was a couple months ago, KPMG released this report, one of these public opinion polls, and it showed that 80-something percent of people are optimistic about the potential of AI to improve their lives. I don't remember exactly how it was written, but something like that. Yet only 40 something percent trusted it. And that is a really, I think that says a lot, right? It's because if no one was interested in it and no one trusted it, then who cares? But the fact that there is an interest and a demand for it, but the trust is low suggests that that delta there is going back to Dario's essay where he talks about this amazing, you know, potential future, trust is a blocker, right? There's actually one other thing that folks should read is that Forbes had a piece a few months ago, um, and it says uh the article is titled Consumer Trust is the next battleground for AI. I think it's for that reason exactly. And so going back to your question, I think what we're a what we're going to be able to do is to build models that people can trust. Because number one, people trust people and you want to see that there are trusted people behind them. But number two, and most importantly, the outputs speak for themselves, right? People have a good sense for what feels like a good experience and what feels like a useful, helpful, reliable, trustworthy, honest experience. And that's what you know, bringing experts into the mix creates in terms of a product. Um so that I mean, that's maybe overly grandiose, but at a high level, that's how I think about it.
Craig:
Yeah. Although the trust issue, isn't that tied more specifically to hallucinations of the people?
Robbie:
The way the way I think the way I would think about trust is um that there's three things here that have come up if we think about trust in AI. One is unverifiable, sorry, verifiable situation. So when there is a right answer, so when there is uh obviously right answer and it says something wrong, if you ask it how many straw, how many R's there are in strawberry and it only says two, it's wrong. And that is certainly a major blocker to trust. I think that's probably a little bit easier to fix because there is a ground truth and it's just about we got to figure out how to get there, but that's where a lot of hallucination falls in, and absolutely you're right. That's a part of it. The next part of it is unverifiable situate scenarios. What about when there isn't a right answer? And this is like we were talking about before, when there isn't a ground truth, and that's a lot trickier, right? And that's where you know we posit that experts can be the ground truth, and that's the right way to solve that. Um then the other piece that I just throw in there is this concept of responsibility, right? So if those things are talking about how AI acts, there's also this idea of AI knowing when not to act, which I think is really important. I think a trustworthy, responsible AI knows when it should. Should take action when it should respond, but it also should know when it doesn't have the answer or when it would be irresponsible for it to engage. Um, so it's a long way of saying I think hallucinations are certainly one part of it, but there is a bigger picture. Um and experts can probably play a bigger role on the uh latter two pieces.
Craig:
Yeah and then that's why I was asking about Sukovensi whether that's uh whether you uh score for that, because uh I was telling you when we spoke before I had a conversation with a podcaster who's a smart guy shortly after GPT-3 came out, and he was arguing that this is a truth machine. And we got into this argument because uh there's a guy out in Arizona that does these uh experiments that are have not been repeated and uh you know are very controversial about uh essentially people showing physiological reactions before the stimulus, which he argues uh proves that there that there's a separation between mind body, and that there is this uh non-physical realm that okay uh yeah, so and you know it's one guy and you know the but this guy when he would ask the LLM uh I think it was Chat GBT3 that it would say yes, you're this proves that there is uh a separation between mind and body. And it drove me crazy because I can't I was saying the LLM doesn't know it's just giving you a probability distribution or an answer out of the distribution, and you bias it depending on how you ask the question, because if I ask the question, I get no, this doesn't prove anything, you know. Uh so uh that and that and that is uh a syncopency uh problem. So do you do you guys address that?
Robbie:
The example you gave feels almost like a borderline hallucination, right? Where it's essentially saying that it's confident in something that it's not. Um and so this is where that there becomes a bit of a blurry line between verifiable and unverifiable scenarios, but I would argue that that is a verifiable scenario, and that I think we can pretty concretely say, or the AI can pretty concretely say that it doesn't know, and it didn't do that.
Craig:
You know, I mean, how do you uh train or judge an AI's answer on that question? Uh, because it depends on what whether you're looking at data, what data you're looking at, essentially.
Robbie:
Yeah, for sure. I mean, a part of the puzzle that we haven't talked about that's particularly relevant here because this is in the news is the retrieval sources, the news it's looking at, right? These models are only trained and updated every few months. And so um a lot of the information they're getting is from the news sources that they or the social media sources or whatever that they have access to reference. Um, and so that's why, I mean, I mentioned earlier part of what we do in our evaluations is look at source selection. And this is why, because I think so much of how an LLM is going to respond is going to be a product of the sources it has access to and it chooses to use and how it chooses to use them. Um that is a is a very high-leverage way to um improve how these models are responding about some of these more newsworthy topics.
Craig:
If you if the model looks at the law, looks at the uh volume of uh illegal immigrants in the United States, yeah, uh I could imagine an LLM saying, well, yes, this agency is charged with uh enforcing this law, yeah, and there are reasons why that law exists without considering the human or moral aspects to the question.
Robbie:
So yeah, I mean uh you that example is so interesting, right? Because like you see what you just did, like that's your thought process, right? Your pro your thought process is that you do go through and you consider the moral or the human implications of it. And so that's that I mean, for what it's worth, I agree. Um, but yeah, it's all about you know, how do we go about answering these questions as opposed to the answer itself?
Craig:
Yeah. Uh well let's uh move on a little bit. Uh what does it what does it mean for AI to have good judgment, in your view?
Robbie:
Yeah, um I mean this this goes back to uh the conversation we were having before, right? Which is um the verifiable and the unverifiable scenarios, right? Verifiable scenarios, I don't think that that that is an important part of the piece of the puzzle for trust. Um, so when there is a right answer, it gives you the right answer and it doesn't hallucinate. I think how it handles the unverifiable scenarios, how it behaves when there aren't rules, when there isn't a clear ground truth, that is what judgment is. Right. And that's of course really tricky. What is good judgment? That's then the broader question. And of course, you know, our approach is well, let's look to the experts to define that.
Craig:
Yeah. Uh and um what do you see as the most critical issues facing the AI industry today? Is it this trust issue or um yeah.
Robbie:
I mean, uh so yeah, I'll give you a couple specific examples and then I can tell you my biggest broader concern. But just from our experience, um, mental health is a really big one. There was uh OpenAI had an announcement a week or two ago where they said that every week they're seeing it was it was like over a million conversations where the user demonstrated suicidal intent. That just gives you a sense of where we're headed, or I suppose where we are with this, right? Um and so I think figuring that out is really tricky. Um we've done quite a bit of work in mental health, and the term I would go back to here is clinical nuance. Um, and I think the models don't quite have that now. It's really important in being able to, I'm certainly not an expert here, but work with patients and support patients in a way that's not going to harm them. Classic example here is do you remember the TESA chatbot from two years ago?
Robbie:
Yeah.so that was, I mean, this was from the National Eating Disorders Association. They launched this chatbot. Um, and there's an article that describes how it didn't have clinical nuance, they ultimately ended up shutting it down because it had adverse effects on users. But I think mental health is an area where, of course, there's enormous potential, but I think there's a lot of people using it right now and a lot of risk. Um, another area is politics. And I don't even think we have to go into that too much more because I mean it it's all the stuff we've been discussing, right? So much of this is subjective. Um, and to the point I was sharing earlier, I think the health of the information landscape, the health of information people have, the you know, avoiding echo chambers, avoiding misinformation, that's so important to building a healthy society. And AI can go either way on this one. It really can. Um, and so I think that's another specific area that we've seen directly. The broader thing I would say to you, and this like Craig relates to something we were talking about before we jumped on this call, but um the we've set a I think a pretty scary expectation around AI that and I think this started with ChatGPT, it relates to what your friend said about this is the truth model. But our expectation is that I can ask it anything, and it's gonna give me an answer. And it's gonna give me an authoritative answer that I can feel like I can trust. And I think that's a really scary expectation to have set. If you go in to a doctor, there is a natural back and forth. There, you're not gonna say one thing, and then they're just gonna give you a prescription. There, there's a back and forth. And that's the way, that's the way the world works. There's sort of you need to absorb context and understand things. Um, but with AI, we've sort of said, no, don't worry about it. You can just ask anything, and it will give you a response, as we've seen. Um and I think that's some, that's an expectation we're gonna need to shift. We're gonna need the models to be more aware about when they need more context and when they need more information in order for them to be responsible and ultimately drive good outcomes. Um, the piece that makes it particularly scary is that maybe at odds with engagement, right? Because if I go to this AI and I ask it a question and then it asks me another question, I'm just gonna go to this one that's gonna give me the answer right away. And so that's a little bit of a tricky, that's what makes it tricky.
Craig:
Yeah. I mean, the whole sucopancy disaster uh was driven by a guy tweaking the algorithms to increase engagement. Yep. Uh and it created all sorts of problems. Um who are the customers for you guys?
Robbie:
Um all the usual suspects. So we work with the we work with the big AI labs, um, as well as uh the large system integrators, we like the you know the larger consultancies that are um deploying the models through to other companies. Um so mostly them. We'll work do a little bit of work with some smaller labs um here and there, but that's the bulk of what we where we focus.
Craig:
Yeah. Uh and are you um uh not to get into the business side too much, but are you a subscription model or pay usage-based uh fees or yeah?
Robbie:
Yeah, we have both. It depends on what the service is. So what we'll often do with um clients is we'll have a baseline subscription where once a month or once a quarter we'll do an evaluation. Um, so let's take political topics, for example, where we'll do a comprehensive evaluation where we'll run all our judges on the model. We'll actually have experts come in and manually review the results. Um, so you just have this continued pulse on how your model is evolving and a pulse on any potential risks. Uh, and then on top of that, we sell uh more bespoke or a carte services. Um, the way that would usually work is, and I'm making this up as an example now, but we might run an evaluation, a quarterly evaluation. We might identify that you know the model is particularly weak in source selection when talking about breaking news related to foreign affairs. We may then engage in developing a benchmark specifically focused on that one problem. Um, and that would be more an ad hoc service.
Craig:
Yeah. Uh and we haven't uh talked about this, but uh, you know, I spent a lot of my life in China, and uh and so I follow what's happening in China uh fairly closely. Uh, you know, and they have a the Chinese government certainly has a very different world view, but the people as well have a different cultural view uh from that of the United States. And there are all these sovereign AI models uh being developed in various countries around the world that are trained on local training data. And do you think your system's judgment is objective enough and general enough that it could apply to models from other countries?
Robbie:
Yeah, certainly. I mean, we've run, and this is something we're actually hoping to publish at some point um in the near future, are the evaluations we've run on the open source models, um, particularly the ones coming out of China. Um, and I would say definitely you can capture a lot of interesting things there. Um I think the other thing there, yeah, on the I guess the national security or foreign interference piece, that there's two things. There's the models themselves and their training data, like you alluded to, um, but there's also um news sources to the conversation about retrieval we're talking about. And so the sources they're used, if you're using you know state-sponsored sources, um, you know, that can also have an influence on the models as well, which is why it's so important that we look at source selection as part of our evaluation. So um I will I will let you know because we're gonna publish something there in the future.
Craig:
Yeah, uh, because the these models uh just you know, you were at Facebook and Instagram uh incredibly powerful in shaping public opinion.
Robbie:
Yeah.
Craig:
Um and uh you know, depending on the trading data, depending on the fine-tuning, uh you know, you could have a model that reinforces uh the government's position on any particular topic. Uh and you know, a lot of people worry, given the closeness between the big tech leaders and the current administration, that you know, that that there may be some impact on how the models respond to certain questions. Is do you worry about that?
Robbie:
I mean, uh I mean so to some extent, but I would say, and this goes back to what you're talking about before on transparency, right? I think at the end of the day, what people need to know is who is behind training these models. And part of the value that we can offer at Forum AI is we can give you a very clear picture of who those people are. And then and then it's ultimately on you as a user to you know evaluate whether you don't feel good about this. And um I think certainly in our case, we have a very diverse and reputable group of people that I think people will feel much better about the models, and they'll build that trust if they see these individuals involved. Um, but I think transparency is the only answer here, right? Because you may have one opinion, someone else will have a different opinion. Um, and so we just have to be candid about how these are being built and then allow people to make the decisions from there.
Craig:
Yeah. Uh in in the notes that were sent over before the call, uh they mentioned Delphi, which is uh a market uh for evaluating uh models, but it's as I understand it, it's kind of uh you're betting on one model versus another, and there's some financial incentive.
Robbie:
If you oh duh. Well, so there's two different things here. There's like um so Delphi um is interesting in that what they're trying to, they're essentially trying to do the Steve, the thing that Steve Jobs said in 1985. They're trying to be able to, you know, take Craig Smith and put, I think they use the term a brain in a box, put your or a digital brain in a box, and then I can go and I can have a chat with you, um, even though of course it's not you. Um and so conceptually I think that's very interesting. There's another company called Super Me that's doing something similar, um, giving people the ability to clone themselves into an AI. Um, our approach is very different as we described, in that we are not focused on cloning our experts so you can have a conversation with them. We're focused on cloning them for a very specific use case. So, you know, evaluating the factual accuracy of a claim, you know, of a political claim, for example. Now, the advantage we have in focusing is that we can do this accurately, right? We can actually measure this, we can build these judges, we can have them, you know, evaluate a thousand things, we can have experts do it, and we can tell you, you know, this is 95% accurate to what experts would say. If you're going to just have a conversation with an LLM, you know, you're not really talking to LeBron James, and it's probably not 95% similar to the conversation you'd have with him. So it's just a very different model. Of course, very different use cases. We're doing evaluation for them, it's more, you know, it's just a different user experience and a different purpose.
Craig:
Yeah, uh, but you said early on that that you're not a leaderboard. Do you make your evaluations public?
Robbie:
Um yeah. So this is, and this is where there's a comparison to LM Arena, um, who are uh they are a leaderboard, and they're you know experts, the people they rely on quite heavily, is just the general public. Um the challenge there is it's created this weird incentive system where essentially what the general public defines as good is usually what's most engaging. And what's most engaging, I can tell you, as someone who worked at Facebook and Instagram, what's most engaging is not always the best thing or the safest or most trustworthy thing. And so that's the issue there. For us, um, we don't make our evaluations public. We are thinking about potentially doing some, you know, releasing some public-facing leaderboards in certain specific areas, like maybe for open source models or for certain areas of more frontier type use cases. Um, but for the most part, you know, our goal is to work with these labs and work with our clients to make their models as good as possible.
Craig:
Yeah. Um, you know, I had a conversation with Max Tegmark at MIT during a recent conference, and he has a uh I've forgotten what he calls it, but it's a scorecard for the major model producers, and he scores them on it or gives them a grade on various aspects. And his uh hope is that this will eventually become you know that that uh a standard that people want to get uh passing grade or A plus grade and all of the different uh areas that they're being evaluated as a uh as a product validation.
Robbie:
Uh and so that's one of the things we're thinking through is how what is the responsible way to potentially release you know some of our benchmarks. So that people can see these scores and use that for them for their own decisions, but without hurting the integrity of the benchmark.
Craig:
Yeah. Yeah. And I would imagine there's also a little bit of a conflict of interest if companies are paying you and then you give them a crummy score publicly.
Robbie:
Yeah, I think, I mean, the thing I've found, because I mean, of course, that is a real tension, but remember, with us, it's not me and Campbell giving anyone a crummy score.
Robbie:
Yeah. It's ultimately our experts, right? And that's a very defensible way to have a lot of these conversations. Um, and I think a way that everyone understands. Um, so but yeah, that that's fair.
Craig:
Yeah. And in your mind, you have uh you don't have to say it openly, but you have uh certain models that that you think do a better job at being uh truth-seeking than others.
Robbie:
Um yeah, it's interesting you use the word truth-seeking. This is uh there's some the government uh just in December released some guidance on the way they look at neutrality, and they say truth-seeking and ideological neutrality. So we're spending a lot of time thinking about those factors. Um I mean, no, I I don't. I it depends. There, like there's not no model is better or worse. It's so can that's one thing I have learned. Maybe this is the most the best answer to your question. I think one thing I've learned, having now run so many evaluations across so many different systems, is it's often more nuanced than people think. Um even just remember the factors that the three areas we were talking about earlier. There's different domains like you know, foreign affairs versus domestic policy and different types of users, and then there's different you know dimensions. Is it the tone versus or is it the bias? And so different ones are better and worse at different things.
Craig:
Yeah. Okay. Well, uh, I think we're over an hour. Um let's leave it there. If someone wants to follow up and learn more, where should they go and who should they talk to?
Robbie:
Yeah, I mean, feel free to so, first of all, um definitely can reach out to me. Um so you can find me on LinkedIn at Robbie Goldfarb or on X under the same name. Um and you know, feel free to go to our website, byforum.com. Uh, we have a uh there's a link there where you can um request a call and we will get back to you. ASAP would love to have conversations with whether you're developing AI systems and and you'd want to work together if you're an expert and interested in being part of the network or just someone who wants to nerd out on some of this stuff. We uh would love to chat.
Craig:
Yeah, and give me that URL, spell that out for me.
Robbie:
Yeah, BY FORUM by forum.com.
Craig:
Okay. Great, Robbie. It was really fascinating. And uh yeah, let's have a conversation again in in a few uh few months or in a year or so. See how it's working. Yeah.
Robbie:
I'd love to. Yeah, no, I really this is a lot of fun. Appreciate it, Craig. Yeah.
Sonix is the world’s most advanced automated transcription, translation, and subtitling platform. Fast, accurate, and affordable.
Automatically convert your mp3 files to text (txt file), Microsoft Word (docx file), and SubRip Subtitle (srt file) in minutes.
Sonix has many features that you'd love including collaboration tools, powerful integrations and APIs, generate automated summaries powered by AI, upload many different filetypes, and easily transcribe your Zoom meetings. Try Sonix for free today.