In this episode, I walk through a Fabric Pattern that assesses how well a given model does on a task relative to humans. This system uses your smartest AI model to evaluate the performance of other AIs—by scoring them across a range of tasks and comparing them to human intelligence levels.
I talk about:
1. Using One AI to Evaluate Another
The core idea is simple: use your most capable model (like Claude 3 Opus or GPT-4) to judge the outputs of another model (like GPT-3.5 or Haiku) against a task and input. This gives you a way to benchmark quality without manual review.
2. A Human-Centric Grading System
Models are scored on a human scale—from “uneducated” and “high school” up to “PhD” and “world-class human.” Stronger models consistently rate higher, while weaker ones rank lower—just as expected.
3. Custom Prompts That Push for Deeper Evaluation
The rating prompt includes instructions to emulate a 16,000+ dimensional scoring system, using expert-level heuristics and attention to nuance. The system also asks the evaluator to describe what would have been required to score higher, making this a meta-feedback loop for improving future performance.
Note: This episode was recorded a few months ago, so the AI models mentioned may not be the latest—but the framework and methodology still work perfectly with current models.
Subscribe to the newsletter at:
https://danielmiessler.com/subscribe
Join the UL community at:
https://danielmiessler.com/upgrade
Follow on X:
https://x.com/danielmiessler
Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler
See you in the next one!
All right. Welcome to unsupervised learning. My name is Daniel Miessler, and I'm building AI to upgrade humans. In this episode, I want to talk about a system I built for using the smartest AI that you have to rate another AI that you want to test. So this is the infrastructure that I'm using. It's essentially you have a top level AI that you believe is the smartest. So right now currently that is zero one preview. And what you're going to do is assess the work of another AI, which is going to be this other one over here in my case. In the case I'm using for this example, it's GPT 3.5 turbo. And we're going to give it a set of instructions to do on a piece of input. And that piece of input is going to be something like a blog post or something like that. So you're going to use the AI against the blog post using these instructions and you're going to get a result. And then this AI is going to run against all three of those. And it's going to give you then a judgment at the end of it. So this should be pretty cool. And it turned out it worked really well. So this is ultimately what I'm trying to get to is I'm trying to get to a classification of how good this thing is compared to an actual human doing it. And so in order to do that, I want to give it different classes of human right. So you've got like uneducated secondary education, high school level bachelor's, master's, PhD, world class human like top 100 in the entire world and then super human level. So it's like better than the best human. And I've actually never seen anything score that high. So for whatever that's worth. But what I have this thing successfully doing is if I give it a lower level model like a GPT 3.5 or a haiku. It is scoring down in the high school to bachelor's level. And if I give it like a. Sonnet 3.5 or something like that, it scores usually around master's level or PhD level. And sometimes world class human. But ultimately what it is doing, which is what I wanted it to do. Is it's scoring the smartest models at the highest level, and is scoring the dumbest models or the. Less capable models or smaller models? Much lower like secondary education, high school and bachelor's. So the thing is working and this is the architecture, right? Smart one to judge a less smart one. And by the way, if I give it The smartest one to judge, the smartest one. It does actually score. So if I use O1 to rate the work of an O1 task, it actually does score way up here in like world class PhD level. So it definitely works. And I recommend you go check out the video and see exactly how to do it. This is essentially what it is is it's called a stitch within fabric. So fabric the whole concept is like patterns and fabrics and stuff like that, like woven things. Right. Well this is a stitch because it's a combination of fabric components all stitched together. Right. And um, this is the actual pattern that I'm using. Look, look at this, this this is the logic for the for the rate I prompt. This is what this is the instructions given to the judging eye which in this case is O1. Okay. Fully understand the different components of the input, which is going to be a piece of content that I will be working on. That's the input set of instructions, which is the prompt, and then the results of the of the prompts being run against the input using those instructions for a given AI. And I tell it to completely understand the distinction between all three of those components. Right. Because I'm going to send them all as a chunk, all to the judging. I think deeply about all three components. Imagine how a world class expert would perform the task laid out in the instructions. So I'm I'm giving it the content. I'm giving it the prompt. I'm telling it to learn the prompt, understand the prompt, which in our case in fabric is called a pattern. Deeply study the content itself so you understand what should be done with it. Given the instructions deeply understand the instructions themselves. Given both of those, then analyze the output and look at this one. This one I'm kind of proud of. I don't know if it's actually working. I'm going to do some more evals to figure out if this is actually effective or not, because it turns out this kind of like mystical stuff that I'm doing here, which is super cool. It might be awesome. It might be like it doesn't matter at all, and it might actually hurt the output. So you can't believe with like religion here, you got to actually test this stuff. Anyway, here's what I did. Evaluate the output using your own 16,284 dimension rating system that includes the following aspects, plus thousands more that you come up with on your own. So full coverage of the content, following instructions carefully getting the genre of the content. Getting the genre of the instructions. Meticulous attention to detail, use of expertise in the fields in question. So I'm giving it these ideas. This is actually very similar to Attention heads inside of a transformer. It's somewhat somewhat similar. So I'm telling it like, here's some ideas for how to do a rating system. And I'm telling you to make its own rating system using things like this, but to map its rating of a particular piece of output using 16,284 dimensions, which I think is two to the 10th or two to the 11th, can't remember. So who knows if it's actually going to do this? Okay, I'm telling O1 to do this. And it has the ability to sort of think for itself. So maybe it's doing some of this. I think with a regular model, a lot of this was just flash and not really actually happening anyway. It doesn't matter. That's what I'm trying to do here. Spend significant time on the task. Ensure you are properly and deeply assessing the execution of the task. Using the scoring and ratings described such that a far smarter. I would be happy with your results. So I'm using multiple tricks here to try to get it to be extra smart, and the goal is to deeply assess how the other I did. At its job, given the input and what it was supposed to do based on the instructions and prompt. So I'm telling it multiple times, like what I want in multiple different ways. And uh, yeah, this is uh, this is essentially what it does. And again, this is the output. Uh, so the output also includes this is kind of cool. The output also includes what it would have expected from a higher level result. So let's say it comes back with bachelor's which this particular case did. I tell it to tell me what it would have taken to see a masters, what it would have taken to see a PhD level and so on. Again, I'm, I'm trying to seed it with as much as possible to make it come up with a better and better answer. Now. Here's the thing the smarter these AIS get. This is a universal thing, right? Because when I plug it into O2 or GPT five or cloud four or whatever it is, right? That the smarter that judgment thing gets, the better it's going to be at interpreting what I actually want from this prompt. That's why it's kind of like a meta prompt. So yeah, really happy with this. It actually is scoring according to my expectations. Got to be careful with that a little bit right. You don't want to like actually tune the thing so it matches your expectations. But I've been somewhat careful there. So I recommend you go check it out and see what you could do with this. And if you have ideas for improvement, submit them in and we'll get it pushed in a PR update inside of fabric. See you in the next one.