The Takeaway: The real frontier isn’t chatty AI — it’s models that can work for days, verify progress, and discover things.
- Math and coding became the proving ground because they’re hard but checkable; that same logic is now being pushed into messier domains like science, medicine, and law.
- The next leap isn’t “more prompts,” it’s longer autonomy: models that can evaluate partial progress, use more compute at test time, and keep going on open-ended tasks.
- OpenAI’s internal focus has shifted from benchmark bragging rights to practical research leverage, because “the models are going to drive a lot of that.”
Ako Paioki, OpenAI’s chief scientist, sounds less interested in hype than in the mechanics of making AI useful. His view is that coding tools like Codex are a signal, not the destination: OpenAI already uses them for most actual coding, and he expects the pattern to extend into research workflows. The same goes for math. Benchmarks like IMO problems mattered because they were a clean North Star — “Math is very measurable,” he says — but the deeper value was training models to reason over long, difficult, verifiable tasks.
That’s why he keeps returning to horizon length. A model doesn’t need to be told “go solve alignment” tomorrow; it needs to get better at making partial progress on a long project, checking itself, and staying useful over time. He thinks RL will matter beyond code, but not as a copy-paste of today’s pipelines. The bigger shift may be models that adapt through context and existing interfaces — Slack, tools, workflows — rather than forcing companies to build bespoke harnesses around them.
The most revealing line: “We are no longer really purely building brains in the sky.” The message is clear: the company is optimizing for models that can touch the real world, accelerate research, and eventually become collaborators, not just assistants.