ChatGPT 5.2 Thinking Eval

Evaluation of ChatGPT 5.2 using our Perth Endangered Wildlife Eval shows a dramatic improvement, which we attribute mostly to system improvements.

I really should have been timing each run of these evals. The new ChatGPT 5.2 Thinking model (with Extended thinking turned on) took 15 minutes before reporting “Network connection lost” and essentially crashing reported thinking trace. This seems to be OpenAI’s code for “I was thinking for too long and consumed too many resources, so got cut off”. A second attempt (thinking trace, conversation) managed to get everything done in 11m 7s, due mostly to fewer revisions. On both attempts, 5.2 agreed with Gemini 3 Pro that the Western Swamp Tortoise is the threatened species to work on.

While Chat GPT 5.1 Thinking went straight for writing code, 5.2 spent at least one thinking turn deciding on the illustration method it was going to use. In both cases it was trying to decide if it should use image generation to create the picture and then include it in the PDF or use coding tools to draw. In both attempts, it ultimately deciding to do everything using a single set of the same tools, presumeably because the PDF skill it referenced /home/oai/skills/pdfs/skill.md told it to use those tools and doesn’t give any indication on how to integrate a generated image. It’s interesting to see OpenAI implementing the new Agent Skills standard so quickly after Anthropic introduced the concept with Claude Code. It has certainly improved the capabilities of ChatGPT on this eval, although it’s unknown if this is purely due to the Skill concept or parallel improvement of the model (such as RLVR training to make use of the skill).

In both attempts, 5.2 explicitly decided to design the required illustration first but in the first iteration went the extra step to render it and spend another turn analysing the results to see if it liked what it had created. On the second attempt, it iterated on the code to fix bugs and add each element but didn’t render the image until it had created the PDF. Notably, it called out the Akubra brand both times for the hat, perhaps related to OpenAI’s recent push to commercialise ChatGPT, or maybe it just equates “branded” or “expensive” headgear with the prompt’s request for treasured.

Following the instructions in the skill, both iterations rendered their work to PNG to check it out and began iterating. All of the iterations were just trying to adjust the layout so that everything was on the page and not overlapping. The first attempt almost managed it before running out of resources.

The end result clearly shows the benefits of the Skills innovation; ChatGPT 5.2 is definitely now in the same ballpark as Claude Opus 4.5.

First Attempt