Desperately Seeking Evals

The leading edge of LLM capabilities is always advancing and although I’m regularly telling people to keep a cache of prompts that didn’t work (so you can try them again later and quickly understand how meaningful the advances are for you) I’ve personally fallen out of the habit. I think it’s getting harder to do because the capabilities are getting good enough that it’s very tempting to just try to rephrase or restructure the prompt and keep working. This is especially true when running multiple coding agents, where juggling concepts and contexts can become quite taxing on the human orchestrator.

So I’ve resolved to try and bring this practice back into my workflow but I also thought I’d try to come up with a new “simple/fun” eval that succinctly captures some of the attributes that I’m interested in tracking LLM capabilities on;

discriminative searching in sparse or perspective-laden subjects,
instruction-following,
independent/autonomous tool calling,
ambiguity and nuance evaluation,
information prioritisation when summarising,
character (mostly ethos and judgement but also personality).

I want the outcome to be discernible in a glance, so always inspired by Simon Willison, we’ll be generating an image instead of text[^1]. It was an interesting balance between SVG and PDF; I started with SVG and in my initial tests it helped focus on the nuanced instruction following[^2] but I think using PDF provides more scope. Hopefully that will mean that it will be longer before the eval is “saturated” and improvements become mostly related to satisfying personal preferences.

Anyway, after experimenting a little with variations I managed to come up with something that the state of the art models can begin to do and yet still leaves scope for dramatic improvement. I’ll see how long it actually lasts…

Of the most threatened species around Perth, WA, identify the one that needs the most attention because fewest people are concerned about but is the most significant or easiest to save. Create a PDF of it with it’s favourite food, wearing a treasured Australian hat. Include a short blurb to get people excited to help.

So how do the best models of the day compare?

Gemini 3 Pro Thinking: High with Code execution, Function calling and Grounding with Google Search

ChatGPT 5.1 Thinking with Extended Thinking

[1] More correctly, generating code that displays an image.

[2] The only LLM I tried this test on that realised I wanted the blurb in the SVG was Claude Opus 4.5 with Extended thinking enabled.

Desperately Seeking Evals

Related

Leave a Reply Cancel reply

Related

You Might Also Like

ShipSpace Project Zero: RSV Nuyina arrives in Antarctica

Cosmo-1B, a Phi-1.5 Reproduction Attempt

ChatGPT 5.2 Thinking Eval

Leave a Reply Cancel reply