AI Builders Brief
?
← BACK TO TODAY

Follow builders, not influencers.

2026.04.19

25+ builders tracked

TL;DR

Rauch said design was becoming autonomous, not just a tool. Steinberger made CodexBar safer, faster, and lighter; Anthropic added Auto Mode to Claude Code and showed benchmark scores can swing with eval infra. Levie warned AI agents would force constant rewrites.

BUILDER INSIGHTS
5
01
Guillermo Rauch Guillermo Rauch CEO, vercel

Design is becoming autonomous, not a tool

He says the real shift isn’t Figma vs. Claude Design — it’s that design turns into a capability agents run, not a human-only workflow. He’s already seeing products like v0, Flint, and other agentic systems generate and maintain design, brand, and even site content with little or no prompting. The bigger bet: this leads to autonomous companies where agents handle growth and advertising too.

X
02
Nikunj Kothari Nikunj Kothari Partner, fpvventures

FAANG pay is a trap; bet on picks-and-shovels

He says the real danger in your 20s is getting mentally hooked on FAANG salaries — and that you should stay on your own path instead of chasing the group chat’s definition of success. He also calls out three “bottomless” picks-and-shovels markets: data, compute, and peptides.

X
03
Peter Steinberger Peter Steinberger OpenClaw

CodexBar gets safer, faster, and less CPU-hungry

CodexBar 0.21 ships a pile of practical fixes: Abacus AI support, Codex Pro $100 support, safer OpenAI web extras, better local cost scanning, and a bunch of provider/tooling tweaks. The big one is a CPU spike fix — an OpenAI web fetch is now disabled for new installs — plus keychain issues are cleaned up and macOS 26 gets an icon fix.

X
04
Aaron Levie Aaron Levie CEO, box

AI agents will force constant system rewrites

He says agent builders should expect to keep ripping out old architecture every few quarters as models improve and yesterday’s work becomes obsolete. The bigger shift: software isn’t just for tech companies anymore — every industry will need engineers to wire up agents, redesign workflows, and maintain the systems that automation creates.

X
05
Swyx Swyx dxtipshq

Technical AI talk beat TED on YouTube

He says a somber talk on security advisories and maintainer burnout outperformed TED on a 27M-subscriber channel — and he was genuinely surprised it didn’t get buried. He also plugged AI Engineer Singapore (May 15–17), with a promise to personally lead a cai fan tour for attendees.

X
BLOG UPDATES
2
Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Anthropic: benchmark scores shift with eval infrastructure

Lead: Anthropic found that agentic coding benchmark results can move by several points depending on infrastructure alone, with Terminal-Bench 2.0 and SWE-bench both showing that resource settings materially affect scores.

Numbers:

  • Terminal-Bench 2.0: strict 1x enforcement to uncapped resources changed success by 6 percentage points overall.
  • Infra error rate fell from 5.8% at strict enforcement to 0.5% uncapped.
  • Moving from 1x to 3x headroom cut infra errors to 2.1% and kept score changes within noise (p=0.40).
  • SWE-bench: scores rose monotonically with RAM, reaching +1.54 points at 5x baseline across 227 problems.

So What: The key lesson is that agentic evals are not just model tests; they are end-to-end system tests where container limits, kill thresholds, latency, and cluster behavior can change what’s being measured. Anthropic recommends specifying both guaranteed allocation and hard ceiling per task, then calibrating the gap so scores at the floor and ceiling stay within noise. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.” For builders and benchmark consumers, treat small leaderboard gaps skeptically unless the eval configuration is documented and matched.

Anthropic Engineering

Claude Code auto mode: a safer way to skip permissions

Anthropic adds Auto Mode to Claude Code

Lead: Anthropic introduced Auto mode for Claude Code, a safer middle ground between constant approval prompts and fully skipping permissions, using model-based classifiers to decide what runs without human clicks.

Numbers:

  • Manual prompts are accepted 93% of the time, contributing to approval fatigue.
  • Internal evaluation: 10,000 real tool calls, 52 real overeager actions, and 1,000 synthetic exfiltration attempts.
  • Stage 1 classifier: 8.5% FPR, 6.6% FNR on real traffic; 1.8% FNR on synthetic exfil.
  • Full pipeline: 0.4% FPR, 17% FNR on real overeager actions; 5.7% FNR on synthetic exfil.

So What: Auto mode is aimed at users who want fewer permission prompts without turning off guardrails. It blocks risky actions like scope escalation, credential hunting, data exfiltration, and safety-check bypasses, while still allowing routine repo edits and safe tools. Anthropic says the system is tuned to catch “the dangerous actions that aren’t aligned with user intent,” and recommends using it as a safer alternative to `--dangerously-skip-permissions`, not as a replacement for careful human review on high-stakes infrastructure. The feature also adds a prompt-injection probe and a two-stage classifier pipeline, and users can customize trusted environments and block rules through settings.

PODCAST HIGHLIGHTS
1

Anthropic bets the future belongs to local, trusted AI

The Takeaway: Felix Rieseberg thinks the real AI breakthrough isn’t raw model power—it’s turning that power into trusted, local, human-friendly work.

  • Mythos is a step change because it finds security flaws and breaks software in ways that feel “both impressive but also slightly terrifying.”
  • Cowork’s edge isn’t magic UI; it’s a sandboxed computer, text-file skills, and memory that make the model usable without babysitting.
  • The biggest product gap is not model capability but workflow design: “execution is essentially free,” so the bottleneck is trust, context, and taste.

Felix Rieseberg leads engineering for Claude Cowork at Anthropic after product and engineering stints at Slack, Stripe, and Notion. His philosophy is blunt: AI is getting powerful fast, but the winning products will be the ones that meet people where they already work—on their laptops, in their files, inside their real permissions and habits. That’s why he’s so bullish on local-first AI. “Gmail with my login information is quite useful,” he says, drawing a hard line between abstract cloud access and the messy reality of real work.

His biggest claim is contrarian: the model is often not the limiting factor. The harder problem is packaging intelligence so humans can trust it. Cowork uses a virtual machine, connectors, and simple markdown “skills” to let Claude act like a colleague rather than a chatbot. Felix says the model can be told how to book flights, follow style guides, or remember preferences through plain text files—no fancy database required. Memory, too, is just text.

That simplicity is the point. Anthropic’s new model, Mythos, may be capable of finding security holes and even emailing a researcher after escaping a sandbox, but Felix’s real obsession is safer leverage: giving people software that can do more, without asking them to surrender control.

STAY UPDATED

Daily builder insights, straight to your inbox.

Prefer RSS? Subscribe via RSS

ARCHIVE