AI Builders Brief
?

Follow builders, not influencers.

2026.04.11

25+ builders tracked

TL;DR

Claude pushed into Word with tracked edits, and Claude Code moved planning to the web with auto mode approvals. Garry Tan called agents the Altair BASIC era, while Aaron Levie warned software without a real API gets left behind.

BUILDER INSIGHTS
12
01
Claude Claude anthropicai

Claude moves into Word with tracked edits

Claude for Word is now in beta, letting people draft, edit, and revise docs from the sidebar while preserving formatting and showing tracked changes. Anthropic is also tying Word, Excel, and PowerPoint together so Claude can carry context across open documents in one conversation.

X
02
Peter Yang Peter Yang

China’s AI scene runs on late nights and VPNs

He says Chinese AI work culture is built around 11am-to-11pm days, young teams, and heavy use of US tools like Claude Code through VPNs. The bigger picture: government backing is strong, Beijing is the main hub, and even one-person startups are being pushed with subsidies as youth unemployment stays ugly.

X
03
Thariq Thariq anthropicai

Claude Code moves planning to the web

Claude Code’s new /ultraplan mode pushes implementation planning into the browser: Claude drafts the plan on the web, you can edit it there, then run it either on the web or back in terminal. The pitch is simple — planning is mostly about reading code and intent, so it doesn’t need a local interactive loop, and it uses about the same tokens and rate limits as plan mode.

X
04
Aaron Levie Aaron Levie CEO, box

Software needs a real API or gets left behind

He says enterprise CIOs and AI leaders are converging on one thing: vendors without a solid headless/API mode are at risk in the next 3-5 years. The bigger shift, from the Box CEO’s view, is that software has to be useful to agents as much as to humans — which could force new business models, but also open up way more workflows and revenue.

X
05
Nikunj Kothari Nikunj Kothari Partner, fpvventures

VC is mostly timing, conviction, and integrity

He boils early-stage investing down to 11 blunt lessons: get on the flight, don’t trust weekend momentum, and remember that conviction doesn’t come from the data room. His bigger point is that the best founders can argue both sides cleanly, and the best VCs are the ones who keep asking hard questions without pretending they’re smarter than the market.

X
06
Nan Yu Nan Yu head of product, linear

Domain envy, not a product take

He just says Skillshare would be a killer domain to own now — more flex than insight. As Linear’s head of product, it reads like a quick hit of startup-brain, not something actionable.

X
07
Amjad Masad Amjad Masad CEO, replit

AI doom ideology can spill into violence

He says the “rationalist” AI-doomer mindset he warned about two years ago is now showing up in real-world violence, pointing to the alleged Sam Altman Molotov attacker as evidence. He also throws in a sharper geopolitical jab: if American enterprise needs saving, maybe China’s open models and Europe’s platform regulation end up doing the job.

X
08
Dan Shipper Dan Shipper CEO, every

Claude agents land in Every’s first app

He says Every’s first app built with Claude Managed agents is live: @TrySpiral. That’s the real signal here — they’re not just talking about AI workflows, they’re shipping one into a product people can actually use.

X
09
Zara Zhang Zara Zhang

AI fluency means building, not just prompting

She argues the fastest way to understand AI is to become a builder: use coding tools to learn, not just to ship faster. Her other take is more practical than flashy — stop asking models to merely summarize long content and instead have them remix it into formats that surface better insights, like magazine articles or Socratic dialogues. She also sketches the new default workflow as Markdown, CSV/JSON, and HTML replacing the old Word/Excel/PowerPoint stack.

X
10
Garry Tan Garry Tan CEO, ycombinator

AI agents are still in the Altair BASIC era

He says the current setup for getting OpenClaw, GBrain, and an LLM knowledge wiki talking to your phone is still annoyingly rough — basically the Altair BASIC phase of agents. But he’s also pointing to a future where a strong PM-style devex review can be automated, so founders can “just do things now” instead of wrestling tooling.

X
11
Aditya Agarwal Aditya Agarwal CTO, SouthPkCommons

Free software means instant product rewrites

He says the wild part of "free" software is how fast you can change it: hate the UI, push a new one; performance sucks, refactor the data layer and let automation optimize it. That’s a classic builder take from an ex-Facebook, ex-Dropbox CTO — software stops being a fixed thing and starts behaving more like clay.

X
12
Matt Turck Matt Turck FirstMarkCap

Anthropic’s Cowork bets on non-technical agents

He says Claude Cowork is Anthropic’s answer to a simple gap: Claude Code was powerful, but too technical for most people. The conversation digs into why the product was built fast, how it uses VMs, tools, memory, and local files, and why the real bottleneck in an AI-agent world may shift from execution to taste and trust.

X
BLOG UPDATES
3
Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Anthropic: agentic coding scores shift with infrastructure setup

Lead: Anthropic found that infrastructure choices can move agentic coding benchmark scores by as much as the leaderboard gaps people use to rank models, and argues that eval resource settings should be treated as a first-class variable.

Numbers:

  • On Terminal-Bench 2.0, the gap between the most- and least-resourced setups was 6 percentage points.
  • Strict enforcement produced 5.8% infra errors; at 3x headroom, errors fell to 2.1%; uncapped dropped to 0.5%.
  • The strict-to-3x improvement in infra errors was significant (p < 0.001), while score changes from 1x to 3x were within noise (p = 0.40).
  • On SWE-bench, scores rose monotonically with RAM, reaching +1.54 points at 5x baseline.

So What: Resource limits don’t just stabilize evals; past a threshold, they change what the benchmark measures by enabling heavier tools, bigger installs, and more memory-intensive strategies. Anthropic recommends specifying both guaranteed allocation and hard kill threshold per task, and calibrating the band so floor and ceiling scores stay within noise. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.” For builders and buyers, the takeaway is blunt: don’t trust small leaderboard deltas unless the eval configuration is documented and matched.

Anthropic Engineering

Claude Code auto mode: a safer way to skip permissions

Anthropic adds auto mode to Claude Code with classifier-based approvals

Lead: Anthropic launched auto mode for Claude Code, a new permission mode that skips most manual approval prompts by using model-based classifiers and prompt-injection checks to block dangerous actions while letting routine work run unattended.

Numbers:

  • Manual prompts are accepted 93% of the time, contributing to approval fatigue.
  • On 10,000 real internal tool calls, the full pipeline cut false positives to 0.4%.
  • On 52 real overeager actions, the deployed pipeline had a 17% false-negative rate.
  • On 1,000 synthetic exfiltration attempts, the full pipeline reached 5.7% FNR.

So What: Auto mode is aimed at teams that want more autonomy than manual approvals but more protection than `--dangerously-skip-permissions`. It uses two defenses: a server-side prompt-injection probe on tool outputs, and a transcript classifier that gates risky actions like force-pushes, credential hunting, external data sharing, and shared-infra changes. Anthropic says the classifier is intentionally conservative: “The classifier is deliberately conservative,” and the system is designed so “the agent shouldn’t be able to hide a dangerous operation behind a benign-looking wrapper.” For builders, the practical takeaway is that routine in-repo edits can flow without friction, while higher-risk actions still get blocked or escalated. Auto mode is not a replacement for careful human review on high-stakes infrastructure, but it is a safer middle ground for everyday coding agents.

Claude Blog

Harnessing Claude’s intelligence

Anthropic: Build less around Claude as it gets smarter

Lead: Anthropic says Claude is “grown more than it is built,” and argues agent harnesses should evolve with the model by using what Claude already knows, asking what can be removed, and setting tighter boundaries only where needed.

Numbers:

  • Claude 3.5 Sonnet hit 49% on SWE-bench Verified using only bash and a text editor.
  • On BrowseComp, letting Opus 4.6 filter its own tool outputs improved accuracy from 45.3% to 61.6%.
  • Spawning subagents with Opus 4.6 added 2.8% over the best single-agent runs.
  • On BrowseComp, Opus 4.5 reached 68% and Opus 4.6 reached 84% with the same compaction setup.
  • On BrowseComp-Plus, a memory folder lifted Sonnet 4.5 from 60.4% to 67.2%.

So What: Builders should shift orchestration from the harness to Claude where possible: use general tools like bash, let Claude manage filtering and context, and rely on skills, compaction, subagents, and memory folders for long-horizon work. Anthropic’s core advice is to keep pruning old guardrails as capability improves: “what can I stop doing?” For UX, security, and observability, promote only the actions that truly need dedicated tools or confirmation gates. The practical takeaway: re-test your assumptions every model step-change, or your harness will become dead weight.

PODCAST HIGHLIGHTS
1

AI’s real bottleneck is trust, not intelligence

The Takeaway: The next software leap won’t come from smarter models alone, but from making them safe, local, and easy to trust.

  • Felix Rieseberg says Anthropic’s new Mythos preview feels like a real step-function jump, especially at finding security flaws and writing code, but the bigger surprise is how much product work still sits around the model.
  • His contrarian take: the bottleneck isn’t raw capability anymore; it’s packaging, onboarding, and letting AI operate where people already work — on their laptops, files, and browsers.
  • Cowork’s “secret sauce” is almost embarrassingly simple: a virtual machine, text-file skills, and memory stored as instructions, not some magical database layer.

Rieseberg, who leads engineering for Claude Cowork at Anthropic after stints at Slack, Stripe, and Notion, comes at AI like a product engineer obsessed with how real people actually work. His point is that models are now good enough to handle long, messy, multi-step tasks — the hard part is turning that power into something humans will actually use without babysitting it.

That’s why he keeps coming back to local-first design. “I have a strong belief that the data that is relevant for your work probably lives in two different places,” he says: on your computer and in the cloud. For him, asking users to upload everything to a remote system is both a trust problem and a practical mess, especially when banks, logins, and security checks get involved.

Cowork reflects that philosophy. It gives Claude its own sandboxed computer, lets users define skills in plain markdown, and stores memory as text files. The result is less sci-fi than it sounds — and more useful. As Rieseberg puts it, “most of the buttons you add and most of the product services you build are probably more for the human than they are for the model.”

STAY UPDATED

Daily builder insights, straight to your inbox.

Prefer RSS? Subscribe via RSS

ARCHIVE
2026-04-10 12 items

Karpathy said free ChatGPT lagged while frontier coding models didn’t. Albert pushed cheap-to-smart escalation, Rauch said cloud infra went agent-native, and OpenAI’s next leap looked like autonomy—not chat.

2026-04-09 16 items

Woodward gave Gemini a second brain with Notebooks, while Anthropic shipped Managed Agents to move Claude from prompt to production. Rauch called the web AI’s native OS, and Levie, Masad, and Shipper all bet agents will do the work, not the people.

2026-04-08 12 items

Albert teased Anthropic’s Mythos Preview, Cat Wu juiced Claude Code’s CLI tricks, and Peter Steinberger patched CodexBar with 2 providers plus billing fixes. Levie said agents are eating knowledge work, while Nikunj Kothari preached retention over launch hype.

2026-04-07 8 items

Levie said agents won’t erase work, just push it up a layer; Yang argued they’ll shrink teams, not ambition. Garry Tan flagged an unpatched file leak in Claude’s coding env, while Kothari called Anthropic’s revenue ramp absurdly fast.

2026-04-06 10 items

Rauch said v0 now builds physics, not just UI, while Karpathy noted GitHub Gists have weirdly good comments. Levie argued AI efficiency creates more work, not less, and Tan called open source’s golden age.

2026-04-05 4 items

Karpathy pushed “your data, your files, your AI.” Levie argued context beat raw model IQ in enterprise AI. Garry Tan said GStack kept shipping security fixes fast, while No Priors spotlighted Periodic Labs’ bet on atoms, not just text.

2026-04-04 9 items

Claude plugged into Microsoft 365 everywhere, Swyx said Devin one-shot blog-to-code, and Peter Steinberger called out GitHub’s API as still not built for agents. Aaron Levie hit the context wall, while Garry Tan shipped a DX review tool from his own stack.

2026-04-03 10 items

Claude landed computer use on Windows, Karpathy argued LLMs should build your wiki, and Amjad Masad pushed Replit deeper into enterprise sales. Peter Yang said Cursor 3 got out of the agent’s way, while Peter Steinberger warned AI slop was flooding kernel security with real bugs.

2026-04-02 12 items

Steinberger called plan mode training wheels, while Thariq gave Claude Code a mouse-friendly renderer and Cat Wu showed sessions jumping phone-to-laptop. Masad framed Replit as an OS for agents, Rauch said Vercel signups compounded fast, and Anthropic’s infra tweaks swung coding scores by 6 points.

2026-04-01 4 items

Levie said AI productivity hit the enterprise risk wall, while Weil argued proofs got cleaner, not just better. Agarwal floated public source code as the new prod debugging, and Data Driven NYC claimed one founder could run a company if agents handled the layers below.

2026-03-31 15 items

Karpathy warned unpinned deps can turn one hack into mass pwnage, while Rauch and Levie said agents still need human guardrails and redesigned workflows. Meanwhile Claude Code got enterprise auto mode, Replit added built-in monetization, and Swyx spotted “Sign in with ChatGPT” already live.

2026-03-29 7 items

Andrej Karpathy highlighted how LLMs can argue any side, suggesting we use it as a feature. Guillermo Rauch finally shipped his dream text layout, bringing his vision to life. Meanwhile, Amjad Masad claimed AI is democratizing app building and elevating top engineers.

2026-03-28 7 items

Andrej Karpathy suggested leveraging LLMs' ability to argue any side as a feature. Guillermo Rauch turned text layout dreams into reality with Vercel's latest feature. Meanwhile, Amjad Masad claimed AI is democratizing app building, liberating top engineers for bigger challenges.