AI Builders Brief
?

Follow builders, not influencers.

2026.04.02

25+ builders tracked

TL;DR

Steinberger called plan mode training wheels, while Thariq gave Claude Code a mouse-friendly renderer and Cat Wu showed sessions jumping phone-to-laptop. Masad framed Replit as an OS for agents, Rauch said Vercel signups compounded fast, and Anthropic’s infra tweaks swung coding scores by 6 points.

BUILDER INSIGHTS
8
01
Peter Steinberger Peter Steinberger openclaw

Plan mode is training wheels, not the workflow

He says he never uses plan mode, and thinks it was added to Codex mostly for people who are used to Claude-style workflows and need help changing habits. His take is simple: stop over-structuring it and just talk to the agent.

X
02
Thariq Thariq anthropicai

Claude Code gets a mouse-friendly renderer

They rewrote the Claude Code renderer around a virtual viewport, so you can use your mouse while the prompt stays pinned to the bottom. It’s experimental, but the point is clear: Anthropic is sanding down the rough edges with a bunch of small UX wins people have been asking for.

X
03
Amjad Masad Amjad Masad CEO, replit

Replit is turning into an OS for agents

He says Agent 4 has pushed Replit into something closer to an operating system: endlessly customizable with skills. He also argues we’re in an unprecedented era of rapid wealth creation — the kind of macro take founders love, but the OS claim is the more concrete product shift.

X
04
Guillermo Rauch Guillermo Rauch CEO, vercel

Vercel signups are compounding fast

Vercel signups are growing 52% month over month, after already climbing at 23% and then 17%. That’s a pretty loud signal that the platform’s growth is accelerating, not just staying hot.

X
05
Cat Wu Cat Wu anthropicai

Claude Code sessions jump from phone to laptop

Claude Code now lets you start an idea on Claude mobile and teleport the session straight into your local CLI later. It’s a neat workflow pitch from Anthropic: mobile for capture, desktop for real work, with the handoff built in.

X
06
Nan Yu Nan Yu head of product, linear

Linear Agent turns code into instant product answers

If you're a PM, sales, or support, you shouldn't have to ping an engineer just to figure out how the app works. He says Linear Agent can read the code and answer things like default settings for users, so non-engineers can self-serve instead of interrupting the team.

X
07
Dan Shipper Dan Shipper CEO, every

SaaS isn’t dead — it’s becoming agent-native

He says Linear is the template: keep the core mission, stop chasing pointless AI gimmicks, and make agents first-class users alongside humans. The big shift is that software now has to manage AI work, not just human workflows — which is why tools like Linear are turning into the control plane for agents.

X
08
Zara Zhang Zara Zhang

Agents that do your to-dos, not just track them

She’s pitching OpenClaw as a real task manager: you dump a quick task into chat, and the agent actually does it, then sends you a morning report on what’s finished and what still needs you. She also launched a "Follow builders" skill that curates 25 AI-builder accounts and podcasts into a personalized daily newsletter, already pulling 2k+ GitHub stars.

X
BLOG UPDATES
3
Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Anthropic: Infrastructure can swing agentic coding scores by 6 points

Anthropic says agentic coding benchmarks are far noisier than leaderboard gaps suggest: infrastructure choices alone can move scores by as much as 6 percentage points on Terminal-Bench 2.0. In internal tests, the gap between the most- and least-resourced setups was 6 points (p < 0.01), and strict resource enforcement produced 5.8% infra failures versus 0.5% when uncapped. The company found that Kubernetes enforcement details mattered: treating per-task resources as both a floor and a hard ceiling caused avoidable OOM kills, while the benchmark’s own leaderboard uses a more lenient sandbox that allows temporary overallocation. The key takeaway is that resource headroom changes what the eval measures. Up to about 3x the benchmark’s recommended resources, extra capacity mainly reduced infra noise without materially changing scores; beyond that, more memory and CPU started helping agents solve tasks they couldn’t before. Anthropic saw the same pattern on SWE-bench, though smaller: scores rose monotonically with RAM, reaching only a 1.54-point lift at 5x. Their recommendation: benchmark maintainers should publish both guaranteed allocation and hard kill thresholds, and users should treat sub-3-point leaderboard differences with skepticism unless the eval setup is documented and matched. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.”

Anthropic Engineering

Harness design for long-running application development

Anthropic’s 3-agent harness boosts long-running coding

Anthropic says it built a GAN-inspired, three-agent harness that materially improves both frontend design and autonomous app development. The system splits work into a planner, generator, and evaluator, using separate grading criteria and Playwright-based testing so the model can be judged by another agent rather than itself. For design, the team turned subjective taste into concrete criteria—design quality, originality, craft, and functionality—and found that separating generation from evaluation pushed Claude away from generic “AI slop” toward more distinctive outputs. One example: a Dutch art museum site evolved from a polished but conventional landing page into a 3D room-based experience by the 10th iteration. For coding, the harness expands a 1–4 sentence prompt into a full spec, then builds in sprints with contract-based handoffs and QA checks. In a retro game maker benchmark, a solo run took 20 minutes and cost $9, while the full harness ran 6 hours and cost $200—over 20x more expensive, but far better. The solo app was broken; the harness version delivered a more polished interface, richer editors, built-in AI features, and working gameplay. As the post puts it, “the evaluator kept the implementation in line with the spec.”

Claude Blog

Claude now creates interactive charts, diagrams and visualizations

Claude adds interactive charts and diagrams in chat

Claude now creates interactive charts, diagrams, and visualizations directly inside conversations to help users understand topics as they’re being discussed. Unlike Claude’s existing Artifacts, which are permanent, shareable documents in a side panel, these visuals are temporary, appear inline, and update as the conversation changes. Users can also request them explicitly with prompts like “draw this as a diagram” or “visualize how this might change over time.” Examples include an interactive compound-interest curve and a clickable periodic table visualization. Claude will decide when to generate a visual automatically, and once it does, users can ask for adjustments or deeper exploration. The company says the feature is on by default and available on all plan types. This launch fits into a broader push to make Claude’s responses more structured and useful: earlier this year, recipes started appearing in ingredient-and-steps format, weather requests began returning visuals, and Claude gained direct interaction with apps like Figma, Canva, and Slack. As the post puts it, “These charts, diagrams and visualizations serve a different purpose: Claude builds them to aid users’ understanding as it’s discussing the topic at hand.”

PODCAST HIGHLIGHTS
1

The Takeaway: Mistral bets on specialized, efficient voice models over bloated generalists.

The Takeaway: Pavan Kumar Reddy, who leads audio research at Mistral, and chief scientist Guillaume Lample are pushing a very Mistral idea: don’t build one giant model that kind of does everything—build small, sharp systems that crush one job and fit real production constraints. Their new Voxtral TTS is their first speech-generation model, but it sits on top of a broader audio stack that already covers transcription, real-time ASR, and fine-tuning for customer data. The contrarian point is simple: most companies are leaving money on the table by using closed models on their own data. As they put it, customers have “trillions of tokens” of domain knowledge that never make it into a public model. Mistral’s pitch is to bring that data inside the model, not just stuff it into context windows forever. That’s where their Forge platform comes in: on-prem or private-cloud deployment, continued pretraining, SFT, RL, and custom tuning for niche needs like medical jargon, noisy environments, or even offline in-car voice systems. On the model side, Voxtral is interesting because it’s not just another TTS wrapper. Mistral built a new neural audio codec and paired it with an autoregressive flow-matching architecture. That matters because audio is messy: the same word can be spoken in many valid ways, and averaging those possibilities produces mush. Flow matching gives them a cleaner way to model that distribution while keeping latency low enough for streaming use cases. They’re already thinking in steps toward full duplex voice agents—models that can listen and speak naturally at the same time—but they’re not pretending the field is solved. Their philosophy is more pragmatic: ship the most useful piece first, make it efficient, then keep tightening the stack.

STAY UPDATED

Daily builder insights, straight to your inbox.

Prefer RSS? Subscribe via RSS

ARCHIVE