AI Builders Brief
?
← BACK TO TODAY

Follow builders, not influencers.

2026.04.19

25+ builders tracked

TL;DR

Rauch said design was becoming autonomous, not just a tool. Steinberger made CodexBar safer, faster, and lighter; Anthropic added Auto Mode to Claude Code and showed benchmark scores can swing with eval infra. Levie warned AI agents would force constant rewrites.

BUILDER INSIGHTS
5
01
Guillermo Rauch Guillermo Rauch CEO, vercel

Design is becoming autonomous, not a tool

He says the real shift isn’t Figma vs. Claude Design — it’s that design turns into a capability agents run, not a human-only workflow. He’s already seeing products like v0, Flint, and other agentic systems generate and maintain design, brand, and even site content with little or no prompting. The bigger bet: this leads to autonomous companies where agents handle growth and advertising too.

X
02
Nikunj Kothari Nikunj Kothari Partner, fpvventures

FAANG pay is a trap; bet on picks-and-shovels

He says the real danger in your 20s is getting mentally hooked on FAANG salaries — and that you should stay on your own path instead of chasing the group chat’s definition of success. He also calls out three “bottomless” picks-and-shovels markets: data, compute, and peptides.

X
03
Peter Steinberger Peter Steinberger OpenClaw

CodexBar gets safer, faster, and less CPU-hungry

CodexBar 0.21 ships a pile of practical fixes: Abacus AI support, Codex Pro $100 support, safer OpenAI web extras, better local cost scanning, and a bunch of provider/tooling tweaks. The big one is a CPU spike fix — an OpenAI web fetch is now disabled for new installs — plus keychain issues are cleaned up and macOS 26 gets an icon fix.

X
04
Aaron Levie Aaron Levie CEO, box

AI agents will force constant system rewrites

He says agent builders should expect to keep ripping out old architecture every few quarters as models improve and yesterday’s work becomes obsolete. The bigger shift: software isn’t just for tech companies anymore — every industry will need engineers to wire up agents, redesign workflows, and maintain the systems that automation creates.

X
05
Swyx Swyx dxtipshq

Technical AI talk beat TED on YouTube

He says a somber talk on security advisories and maintainer burnout outperformed TED on a 27M-subscriber channel — and he was genuinely surprised it didn’t get buried. He also plugged AI Engineer Singapore (May 15–17), with a promise to personally lead a cai fan tour for attendees.

X
BLOG UPDATES
2
Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Anthropic: benchmark scores shift with eval infrastructure

Lead: Anthropic found that agentic coding benchmark results can move by several points depending on infrastructure alone, with Terminal-Bench 2.0 and SWE-bench both showing that resource settings materially affect scores.

Numbers:

  • Terminal-Bench 2.0: strict 1x enforcement to uncapped resources changed success by 6 percentage points overall.
  • Infra error rate fell from 5.8% at strict enforcement to 0.5% uncapped.
  • Moving from 1x to 3x headroom cut infra errors to 2.1% and kept score changes within noise (p=0.40).
  • SWE-bench: scores rose monotonically with RAM, reaching +1.54 points at 5x baseline across 227 problems.

So What: The key lesson is that agentic evals are not just model tests; they are end-to-end system tests where container limits, kill thresholds, latency, and cluster behavior can change what’s being measured. Anthropic recommends specifying both guaranteed allocation and hard ceiling per task, then calibrating the gap so scores at the floor and ceiling stay within noise. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.” For builders and benchmark consumers, treat small leaderboard gaps skeptically unless the eval configuration is documented and matched.

Anthropic Engineering

Claude Code auto mode: a safer way to skip permissions

Anthropic adds Auto Mode to Claude Code

Lead: Anthropic introduced Auto mode for Claude Code, a safer middle ground between constant approval prompts and fully skipping permissions, using model-based classifiers to decide what runs without human clicks.

Numbers:

  • Manual prompts are accepted 93% of the time, contributing to approval fatigue.
  • Internal evaluation: 10,000 real tool calls, 52 real overeager actions, and 1,000 synthetic exfiltration attempts.
  • Stage 1 classifier: 8.5% FPR, 6.6% FNR on real traffic; 1.8% FNR on synthetic exfil.
  • Full pipeline: 0.4% FPR, 17% FNR on real overeager actions; 5.7% FNR on synthetic exfil.

So What: Auto mode is aimed at users who want fewer permission prompts without turning off guardrails. It blocks risky actions like scope escalation, credential hunting, data exfiltration, and safety-check bypasses, while still allowing routine repo edits and safe tools. Anthropic says the system is tuned to catch “the dangerous actions that aren’t aligned with user intent,” and recommends using it as a safer alternative to `--dangerously-skip-permissions`, not as a replacement for careful human review on high-stakes infrastructure. The feature also adds a prompt-injection probe and a two-stage classifier pipeline, and users can customize trusted environments and block rules through settings.

PODCAST HIGHLIGHTS
1

Anthropic bets the future belongs to local, trusted AI

The Takeaway: Felix Rieseberg thinks the real AI breakthrough isn’t raw model power—it’s turning that power into trusted, local, human-friendly work.

  • Mythos is a step change because it finds security flaws and breaks software in ways that feel “both impressive but also slightly terrifying.”
  • Cowork’s edge isn’t magic UI; it’s a sandboxed computer, text-file skills, and memory that make the model usable without babysitting.
  • The biggest product gap is not model capability but workflow design: “execution is essentially free,” so the bottleneck is trust, context, and taste.

Felix Rieseberg leads engineering for Claude Cowork at Anthropic after product and engineering stints at Slack, Stripe, and Notion. His philosophy is blunt: AI is getting powerful fast, but the winning products will be the ones that meet people where they already work—on their laptops, in their files, inside their real permissions and habits. That’s why he’s so bullish on local-first AI. “Gmail with my login information is quite useful,” he says, drawing a hard line between abstract cloud access and the messy reality of real work.

His biggest claim is contrarian: the model is often not the limiting factor. The harder problem is packaging intelligence so humans can trust it. Cowork uses a virtual machine, connectors, and simple markdown “skills” to let Claude act like a colleague rather than a chatbot. Felix says the model can be told how to book flights, follow style guides, or remember preferences through plain text files—no fancy database required. Memory, too, is just text.

That simplicity is the point. Anthropic’s new model, Mythos, may be capable of finding security holes and even emailing a researcher after escaping a sandbox, but Felix’s real obsession is safer leverage: giving people software that can do more, without asking them to surrender control.

STAY UPDATED

Daily builder insights, straight to your inbox.

Prefer RSS? Subscribe via RSS

ARCHIVE
2026-04-20 9 items

Rauch said an AI-accelerated attack exposed Vercel’s weak link, while Kothari warned AI will supercharge attacks too. Garry Tan called Claude Code the new app factory, and Peter Yang noted agents still flaked on boring cron jobs.

2026-04-18 13 items

Weil folded OpenAI for Science into core teams, while Google split Flow into music-making and Josh Woodward added remix control. Albert and Peter Yang showed Claude Design turning taste into production-grade assets, and Levie, Ryo Lu, and No Priors all argued AI wins when it serves workflows, not replaces them.

2026-04-17 15 items

Anthropic launched Managed Agents to decouple agent infra, while Claude Code defaulted to xhigh effort and got a usage-focused upgrade. Rauch said agents need durability over clever prompts, and Swyx split AI engineering into slop vs rigor.

2026-04-16 14 items

Rauch said teams were building their own design factories, while Steinberger called open-source AI security a full-time arms race. Masad priced OSS trust in compute, and Woodward shipped Gemini on Mac in 100 days.

2026-04-15 15 items

Woodward said Gemini’s turning into a test-prep machine, Albert called Claude Code the whole workspace, and Cat Wu shipped a desktop control center with parallel sessions and review tools. Rauch also argued agent builders need elastic Postgres, not vibes.

2026-04-14 10 items

Rauch said the moat moved from code to the code factory, while Levie argued every team now needed an agent wrangler. Cursor leaned into customizable multi-agent views, Replit added region controls, and No Priors backed Periodic Labs’ bet that AI could learn atoms by running experiments.

2026-04-13 10 items

Amjad Masad said Apple’s 50th has turned into a PR disaster, while Aaron Levie argued agents would create more work, not cut jobs. Rauch pushed engineers into the customer hot seat, and Claude warned teams to harden security fast.

2026-04-12 11 items

Thariq said Claude Code now handles TurboTax pain, while Rauch called microVM sandboxes the new compute layer. Aditya Agarwal pushed memory over loops, and Levie argued AI won’t shrink law—it’ll inflate it.

2026-04-11 16 items

Claude pushed into Word with tracked edits, and Claude Code moved planning to the web with auto mode approvals. Garry Tan called agents the Altair BASIC era, while Aaron Levie warned software without a real API gets left behind.

2026-04-10 12 items

Karpathy said free ChatGPT lagged while frontier coding models didn’t. Albert pushed cheap-to-smart escalation, Rauch said cloud infra went agent-native, and OpenAI’s next leap looked like autonomy—not chat.

2026-04-09 16 items

Woodward gave Gemini a second brain with Notebooks, while Anthropic shipped Managed Agents to move Claude from prompt to production. Rauch called the web AI’s native OS, and Levie, Masad, and Shipper all bet agents will do the work, not the people.

2026-04-08 12 items

Albert teased Anthropic’s Mythos Preview, Cat Wu juiced Claude Code’s CLI tricks, and Peter Steinberger patched CodexBar with 2 providers plus billing fixes. Levie said agents are eating knowledge work, while Nikunj Kothari preached retention over launch hype.

2026-04-07 8 items

Levie said agents won’t erase work, just push it up a layer; Yang argued they’ll shrink teams, not ambition. Garry Tan flagged an unpatched file leak in Claude’s coding env, while Kothari called Anthropic’s revenue ramp absurdly fast.

2026-04-06 10 items

Rauch said v0 now builds physics, not just UI, while Karpathy noted GitHub Gists have weirdly good comments. Levie argued AI efficiency creates more work, not less, and Tan called open source’s golden age.

2026-04-05 4 items

Karpathy pushed “your data, your files, your AI.” Levie argued context beat raw model IQ in enterprise AI. Garry Tan said GStack kept shipping security fixes fast, while No Priors spotlighted Periodic Labs’ bet on atoms, not just text.

2026-04-04 9 items

Claude plugged into Microsoft 365 everywhere, Swyx said Devin one-shot blog-to-code, and Peter Steinberger called out GitHub’s API as still not built for agents. Aaron Levie hit the context wall, while Garry Tan shipped a DX review tool from his own stack.

2026-04-03 10 items

Claude landed computer use on Windows, Karpathy argued LLMs should build your wiki, and Amjad Masad pushed Replit deeper into enterprise sales. Peter Yang said Cursor 3 got out of the agent’s way, while Peter Steinberger warned AI slop was flooding kernel security with real bugs.

2026-04-02 12 items

Steinberger called plan mode training wheels, while Thariq gave Claude Code a mouse-friendly renderer and Cat Wu showed sessions jumping phone-to-laptop. Masad framed Replit as an OS for agents, Rauch said Vercel signups compounded fast, and Anthropic’s infra tweaks swung coding scores by 6 points.

2026-04-01 4 items

Levie said AI productivity hit the enterprise risk wall, while Weil argued proofs got cleaner, not just better. Agarwal floated public source code as the new prod debugging, and Data Driven NYC claimed one founder could run a company if agents handled the layers below.

2026-03-31 15 items

Karpathy warned unpinned deps can turn one hack into mass pwnage, while Rauch and Levie said agents still need human guardrails and redesigned workflows. Meanwhile Claude Code got enterprise auto mode, Replit added built-in monetization, and Swyx spotted “Sign in with ChatGPT” already live.

2026-03-29 7 items

Andrej Karpathy highlighted how LLMs can argue any side, suggesting we use it as a feature. Guillermo Rauch finally shipped his dream text layout, bringing his vision to life. Meanwhile, Amjad Masad claimed AI is democratizing app building and elevating top engineers.

2026-03-28 7 items

Andrej Karpathy suggested leveraging LLMs' ability to argue any side as a feature. Guillermo Rauch turned text layout dreams into reality with Vercel's latest feature. Meanwhile, Amjad Masad claimed AI is democratizing app building, liberating top engineers for bigger challenges.