Anthropic Engineering
Anthropic: Infrastructure can swing agentic coding scores by 6 points
Anthropic says agentic coding benchmarks are far noisier than leaderboard gaps suggest: infrastructure choices alone can move scores by as much as 6 percentage points on Terminal-Bench 2.0. In internal tests, the gap between the most- and least-resourced setups was 6 points (p < 0.01), and strict resource enforcement produced 5.8% infra failures versus 0.5% when uncapped. The company found that Kubernetes enforcement details mattered: treating per-task resources as both a floor and a hard ceiling caused avoidable OOM kills, while the benchmark’s own leaderboard uses a more lenient sandbox that allows temporary overallocation.
The key takeaway is that resource headroom changes what the eval measures. Up to about 3x the benchmark’s recommended resources, extra capacity mainly reduced infra noise without materially changing scores; beyond that, more memory and CPU started helping agents solve tasks they couldn’t before. Anthropic saw the same pattern on SWE-bench, though smaller: scores rose monotonically with RAM, reaching only a 1.54-point lift at 5x.
Their recommendation: benchmark maintainers should publish both guaranteed allocation and hard kill thresholds, and users should treat sub-3-point leaderboard differences with skepticism unless the eval setup is documented and matched. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.”
Anthropic Engineering
Anthropic’s 3-agent harness boosts long-running coding
Anthropic says it built a GAN-inspired, three-agent harness that materially improves both frontend design and autonomous app development. The system splits work into a planner, generator, and evaluator, using separate grading criteria and Playwright-based testing so the model can be judged by another agent rather than itself. For design, the team turned subjective taste into concrete criteria—design quality, originality, craft, and functionality—and found that separating generation from evaluation pushed Claude away from generic “AI slop” toward more distinctive outputs. One example: a Dutch art museum site evolved from a polished but conventional landing page into a 3D room-based experience by the 10th iteration.
For coding, the harness expands a 1–4 sentence prompt into a full spec, then builds in sprints with contract-based handoffs and QA checks. In a retro game maker benchmark, a solo run took 20 minutes and cost $9, while the full harness ran 6 hours and cost $200—over 20x more expensive, but far better. The solo app was broken; the harness version delivered a more polished interface, richer editors, built-in AI features, and working gameplay. As the post puts it, “the evaluator kept the implementation in line with the spec.”
Claude Blog
Claude adds interactive charts and diagrams in chat
Claude now creates interactive charts, diagrams, and visualizations directly inside conversations to help users understand topics as they’re being discussed. Unlike Claude’s existing Artifacts, which are permanent, shareable documents in a side panel, these visuals are temporary, appear inline, and update as the conversation changes. Users can also request them explicitly with prompts like “draw this as a diagram” or “visualize how this might change over time.”
Examples include an interactive compound-interest curve and a clickable periodic table visualization. Claude will decide when to generate a visual automatically, and once it does, users can ask for adjustments or deeper exploration. The company says the feature is on by default and available on all plan types.
This launch fits into a broader push to make Claude’s responses more structured and useful: earlier this year, recipes started appearing in ingredient-and-steps format, weather requests began returning visuals, and Claude gained direct interaction with apps like Figma, Canva, and Slack. As the post puts it, “These charts, diagrams and visualizations serve a different purpose: Claude builds them to aid users’ understanding as it’s discussing the topic at hand.”