Gary McGraw, CEO of the Berryville Institute of Machine Learning, recently joined Dennis Fisher on the Decipher podcast to discuss his team's new architectural risk analysis of black box LLM models and the need for regulation in the AI market. This is a condensed and edited transcript of that discussion.
Dennis Fisher: For people that are more security minded than Ai or machine learning experts, can you explain the difference? How how do you sort of differentiate or relate the two?
Gary McGraw: Well artificial intelligence has been around since 1956, since Mccarthy and the people got together and said oh yeah, we can build artificial intelligence in a week during a summer camp at Dartmouth. They were wrong. It took more than a week. But artificial intelligence has spun off a bunch of interesting things and as soon as it solves something, it's no longer AI. Like search. Remember that all of search came from AI in the early days and you know there's symbolic AI, there's sub symbolic computation. There's emergent computation. There's machine learning and there are lots of different kinds of machine learning. The kind of machine learning that everybody's all excited about now is neural networks and in particular deep neural networks and we've built these autoassociative predictive generators otherwise known as LLMs now with unbelievably huge piles of data and a huge amounts of connection. So these neural networks are really, really large, but generally speaking the technology was all around in 90 when the PDP series got published by McClellan and Roma Hart and those are some great books. I got started coding neural networks in 1989 when those books came out and so I wondered, gosh how much progress have we made in this kind of little sub aspect of AI, so it's a subset. It's just a specific kind of approach to AI machine learning.
So why would you use machine learning if you're a computer guy all right? Well I'm a computer guy, I know how to do something, so I write a program for that. I just type in the how, one instruction at a time and I get a program. Now if I don't know how to do something I can't write a program for it. But if I have a big pile of what, I can use what to train a program to train a computer to do that just by training it on the what. Every time you see one of these, one of those. Every time you see one of these, one of those. Millions and millions of times over actually. Now data sets are trillions, fourteen trillion things, so those kind of associations define the what. We make a machine become the what through machine learning and then the machine will do it. That's so cool but you know what the problem is we still don't know how.
Dennis Fisher: This is the thing that sort of blows my mind is that even the people that are building these LLMs and training these models and doing all this work don't completely understand the scope of what's going on or how it's all happening.
Gary McGraw: Yeah, well that's because the numbers are so big. So you can understand a neural network and convolutional networks and auto association and things like attention models and all that stuff we can understand even theoretically. But then when we scale it too huge we lose track of what exactly is going on. I want you to really grasp this. So today's LLMs have trillions of parameters. Trillions not billions. One point seven trillion connections, that's a lot of numbers to keep track of in a representation and um and they are trained on fourteen trillion examples from the internet. So in order to get a data set that big, you have to go scrape everything and you don't really have time to clean it all up so you end up scraping in some garbage and nonsense and wrong stuff and poison and heinosity and evil and you eat that too. There's not much of it. But there's some. And as a result these foundation models today all include that horrible stuff.
Dennis Fisher: We don't know what those models actually look like because they're black boxes.
Gary McGraw: Exactly. Yeah black box is the right term if you think about it in terms of our kind of process model that we introduced in 2020 there were nine components in that model. It's a generic model of all of ML. Like, here's how ML works generically. We adjusted that for the LLM case and we were like, holy cow four of the components just got put in a black box. You know those are decisions about the way machine learning in general works that have been made by your vendor of the foundation model. They decided all the decisions about what training set to use, how to divide it up, what sort of analysis to do to decide whether you're done, all of those things, they made those decisions. And then they released a black box and you bought it or rented it. So guess what: You're going to just go along with whatever risk decisions they made for that black box.
Dennis Fisher: If you've ever even on the lowest level interacted with one of those chat bots on some website that wants to ask you if they can help you today, you know very well that they might be able to help you with a small list of tasks. But in general, no they can't help you.
Gary McGraw: Yeah, and they're not very trustworthy because you know think about it as this. It's kind of like an API, but it's an unstructured, unpredictable API where you put in some stuff that's unstructured called a pile of words, some text and what comes out is some other words which may be relevant but they could be wrong. But if you put in the exact same bag of words twice you're going to get back different answers sometimes from the same exact machine in the same exact state. And that's because the input is natural language. It's very slippery and the output is also natural language and that's just wildly unstructured. It's just a bag of words predictor. Don't fall in love with a bag of words, but you know Cyrano de Bergerac says that you can do that.
Dennis Fisher: I'm a writer. I love words but I know their limitations, that's for sure.
Gary McGraw: Ah, well Dennis, let me tell you something serious. You love words and you write well and you are a writer. And you can't be replaced by an LLM. Somebody could make a Dennis LLM novel writer that tries to write the kind of novels that you write and it would do a terrible job. It would be flat, there'd be no sparkle, there'd be no sort of intrigue among the characters. It'd be just kind of like the bell curve pablum and that's what we get when we rely on these LLMs today.
Dennis Fisher: Let's talk a little bit about the risks that you guys identified. You mentioned a couple of them. The one that really seems the most relevant and maybe even the hardest to address is the one that you sort of described earlier, and that is that recursive data pollution one.
Gary McGraw: Yeah there's a mathematical way of thinking about this, you know how Gaussian bell curves have these tails on both sides. So there's the fat part in the middle and the tails. When you're doing this autoassociative learning you just basically cut off the tails. So the bell curve smoothes and the part under the fat part gets big which is fine unless it's wrong. But all the subtlety that was in the tails, like all of the things that experts argue about when they're thinking about medicine. It's the interesting stuff that’s in the tails that's gone and the more you eat, the more it goes away. If you ingest it some more those attractors they take up all the room in your representation and ao all of the subtlety disappears and you know that's one way of thinking about it. Another way of thinking about it that's easier is like if you've ever played the guitar with an amp and you do feedback, you know you stand by the amp and you get the feedback thing which rock and roll guys love to do. It's very much like that I think, so that's a good metaphor for thinking about it.
Dennis Fisher: Is there any effort to sort of figure out a way around it or is it too late?
Gary McGraw: It would be to train some foundation models that don't have wrongness in there and then you need somebody to decide what's right and what's wrong. And then we need some philosophers we're going to have to have some wars and arguments over it.
Dennis Fisher: Another part of this is the data feudalism. The big companies such as Google, they own their own ocean of data. They're not letting anybody else swim around in that ocean really.
Gary McGraw: That's right and they used to share amongst each other and now they're not even doing that, which is kind of interesting. So yeah, where do we get enough data? How much does it cost? Who owns it? Who's generating it? But the real question behind it is well are we really, in fact, producing enough models? But enough data for models that we want to build now or not. And data feudalism certainly cuts against that grain you know?
Dennis Fisher: What do you see in terms of the way that Ai can be used in the security field specifically?
Gary McGraw: You know, probably configuration and you can put these things inside of guidelines that are really really tight and and have them help you do things like a process that takes a few steps. That's a little complicated. One of my companies that I'm advising is called Redsift and they built a technology that uses network stuff like DNS records and things like that to figure out whether or not spoofers are building a fake website for say Bank of America in Thailand or whatever and it doesn't really look at the text so much as it looks at the network configurations and the certificates and all the stuff that you have to prove that you really are who you are. And it uses machine learning to find things that are probably not legit and say, hey excuse me. But you might want to look at this one. This looks like somebody trying to spoof it when you got to shut them down. And so that's a use of machine learning to do good from a security perspective.