AI Builders Brief
?
← BACK TO TODAY

Follow builders, not influencers.

2026.04.11

25+ builders tracked

TL;DR

Claude pushed into Word with tracked edits, and Claude Code moved planning to the web with auto mode approvals. Garry Tan called agents the Altair BASIC era, while Aaron Levie warned software without a real API gets left behind.

BUILDER INSIGHTS
12
01
Claude Claude anthropicai

Claude moves into Word with tracked edits

Claude for Word is now in beta, letting people draft, edit, and revise docs from the sidebar while preserving formatting and showing tracked changes. Anthropic is also tying Word, Excel, and PowerPoint together so Claude can carry context across open documents in one conversation.

X
02
Peter Yang Peter Yang

China’s AI scene runs on late nights and VPNs

He says Chinese AI work culture is built around 11am-to-11pm days, young teams, and heavy use of US tools like Claude Code through VPNs. The bigger picture: government backing is strong, Beijing is the main hub, and even one-person startups are being pushed with subsidies as youth unemployment stays ugly.

X
03
Thariq Thariq anthropicai

Claude Code moves planning to the web

Claude Code’s new /ultraplan mode pushes implementation planning into the browser: Claude drafts the plan on the web, you can edit it there, then run it either on the web or back in terminal. The pitch is simple — planning is mostly about reading code and intent, so it doesn’t need a local interactive loop, and it uses about the same tokens and rate limits as plan mode.

X
04
Aaron Levie Aaron Levie CEO, box

Software needs a real API or gets left behind

He says enterprise CIOs and AI leaders are converging on one thing: vendors without a solid headless/API mode are at risk in the next 3-5 years. The bigger shift, from the Box CEO’s view, is that software has to be useful to agents as much as to humans — which could force new business models, but also open up way more workflows and revenue.

X
05
Nikunj Kothari Nikunj Kothari Partner, fpvventures

VC is mostly timing, conviction, and integrity

He boils early-stage investing down to 11 blunt lessons: get on the flight, don’t trust weekend momentum, and remember that conviction doesn’t come from the data room. His bigger point is that the best founders can argue both sides cleanly, and the best VCs are the ones who keep asking hard questions without pretending they’re smarter than the market.

X
06
Nan Yu Nan Yu head of product, linear

Domain envy, not a product take

He just says Skillshare would be a killer domain to own now — more flex than insight. As Linear’s head of product, it reads like a quick hit of startup-brain, not something actionable.

X
07
Amjad Masad Amjad Masad CEO, replit

AI doom ideology can spill into violence

He says the “rationalist” AI-doomer mindset he warned about two years ago is now showing up in real-world violence, pointing to the alleged Sam Altman Molotov attacker as evidence. He also throws in a sharper geopolitical jab: if American enterprise needs saving, maybe China’s open models and Europe’s platform regulation end up doing the job.

X
08
Dan Shipper Dan Shipper CEO, every

Claude agents land in Every’s first app

He says Every’s first app built with Claude Managed agents is live: @TrySpiral. That’s the real signal here — they’re not just talking about AI workflows, they’re shipping one into a product people can actually use.

X
09
Zara Zhang Zara Zhang

AI fluency means building, not just prompting

She argues the fastest way to understand AI is to become a builder: use coding tools to learn, not just to ship faster. Her other take is more practical than flashy — stop asking models to merely summarize long content and instead have them remix it into formats that surface better insights, like magazine articles or Socratic dialogues. She also sketches the new default workflow as Markdown, CSV/JSON, and HTML replacing the old Word/Excel/PowerPoint stack.

X
10
Garry Tan Garry Tan CEO, ycombinator

AI agents are still in the Altair BASIC era

He says the current setup for getting OpenClaw, GBrain, and an LLM knowledge wiki talking to your phone is still annoyingly rough — basically the Altair BASIC phase of agents. But he’s also pointing to a future where a strong PM-style devex review can be automated, so founders can “just do things now” instead of wrestling tooling.

X
11
Aditya Agarwal Aditya Agarwal CTO, SouthPkCommons

Free software means instant product rewrites

He says the wild part of "free" software is how fast you can change it: hate the UI, push a new one; performance sucks, refactor the data layer and let automation optimize it. That’s a classic builder take from an ex-Facebook, ex-Dropbox CTO — software stops being a fixed thing and starts behaving more like clay.

X
12
Matt Turck Matt Turck FirstMarkCap

Anthropic’s Cowork bets on non-technical agents

He says Claude Cowork is Anthropic’s answer to a simple gap: Claude Code was powerful, but too technical for most people. The conversation digs into why the product was built fast, how it uses VMs, tools, memory, and local files, and why the real bottleneck in an AI-agent world may shift from execution to taste and trust.

X
BLOG UPDATES
3
Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Anthropic: agentic coding scores shift with infrastructure setup

Lead: Anthropic found that infrastructure choices can move agentic coding benchmark scores by as much as the leaderboard gaps people use to rank models, and argues that eval resource settings should be treated as a first-class variable.

Numbers:

  • On Terminal-Bench 2.0, the gap between the most- and least-resourced setups was 6 percentage points.
  • Strict enforcement produced 5.8% infra errors; at 3x headroom, errors fell to 2.1%; uncapped dropped to 0.5%.
  • The strict-to-3x improvement in infra errors was significant (p < 0.001), while score changes from 1x to 3x were within noise (p = 0.40).
  • On SWE-bench, scores rose monotonically with RAM, reaching +1.54 points at 5x baseline.

So What: Resource limits don’t just stabilize evals; past a threshold, they change what the benchmark measures by enabling heavier tools, bigger installs, and more memory-intensive strategies. Anthropic recommends specifying both guaranteed allocation and hard kill threshold per task, and calibrating the band so floor and ceiling scores stay within noise. As the post puts it, “a 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware.” For builders and buyers, the takeaway is blunt: don’t trust small leaderboard deltas unless the eval configuration is documented and matched.

Anthropic Engineering

Claude Code auto mode: a safer way to skip permissions

Anthropic adds auto mode to Claude Code with classifier-based approvals

Lead: Anthropic launched auto mode for Claude Code, a new permission mode that skips most manual approval prompts by using model-based classifiers and prompt-injection checks to block dangerous actions while letting routine work run unattended.

Numbers:

  • Manual prompts are accepted 93% of the time, contributing to approval fatigue.
  • On 10,000 real internal tool calls, the full pipeline cut false positives to 0.4%.
  • On 52 real overeager actions, the deployed pipeline had a 17% false-negative rate.
  • On 1,000 synthetic exfiltration attempts, the full pipeline reached 5.7% FNR.

So What: Auto mode is aimed at teams that want more autonomy than manual approvals but more protection than `--dangerously-skip-permissions`. It uses two defenses: a server-side prompt-injection probe on tool outputs, and a transcript classifier that gates risky actions like force-pushes, credential hunting, external data sharing, and shared-infra changes. Anthropic says the classifier is intentionally conservative: “The classifier is deliberately conservative,” and the system is designed so “the agent shouldn’t be able to hide a dangerous operation behind a benign-looking wrapper.” For builders, the practical takeaway is that routine in-repo edits can flow without friction, while higher-risk actions still get blocked or escalated. Auto mode is not a replacement for careful human review on high-stakes infrastructure, but it is a safer middle ground for everyday coding agents.

Claude Blog

Harnessing Claude’s intelligence

Anthropic: Build less around Claude as it gets smarter

Lead: Anthropic says Claude is “grown more than it is built,” and argues agent harnesses should evolve with the model by using what Claude already knows, asking what can be removed, and setting tighter boundaries only where needed.

Numbers:

  • Claude 3.5 Sonnet hit 49% on SWE-bench Verified using only bash and a text editor.
  • On BrowseComp, letting Opus 4.6 filter its own tool outputs improved accuracy from 45.3% to 61.6%.
  • Spawning subagents with Opus 4.6 added 2.8% over the best single-agent runs.
  • On BrowseComp, Opus 4.5 reached 68% and Opus 4.6 reached 84% with the same compaction setup.
  • On BrowseComp-Plus, a memory folder lifted Sonnet 4.5 from 60.4% to 67.2%.

So What: Builders should shift orchestration from the harness to Claude where possible: use general tools like bash, let Claude manage filtering and context, and rely on skills, compaction, subagents, and memory folders for long-horizon work. Anthropic’s core advice is to keep pruning old guardrails as capability improves: “what can I stop doing?” For UX, security, and observability, promote only the actions that truly need dedicated tools or confirmation gates. The practical takeaway: re-test your assumptions every model step-change, or your harness will become dead weight.

PODCAST HIGHLIGHTS
1

AI’s real bottleneck is trust, not intelligence

The Takeaway: The next software leap won’t come from smarter models alone, but from making them safe, local, and easy to trust.

  • Felix Rieseberg says Anthropic’s new Mythos preview feels like a real step-function jump, especially at finding security flaws and writing code, but the bigger surprise is how much product work still sits around the model.
  • His contrarian take: the bottleneck isn’t raw capability anymore; it’s packaging, onboarding, and letting AI operate where people already work — on their laptops, files, and browsers.
  • Cowork’s “secret sauce” is almost embarrassingly simple: a virtual machine, text-file skills, and memory stored as instructions, not some magical database layer.

Rieseberg, who leads engineering for Claude Cowork at Anthropic after stints at Slack, Stripe, and Notion, comes at AI like a product engineer obsessed with how real people actually work. His point is that models are now good enough to handle long, messy, multi-step tasks — the hard part is turning that power into something humans will actually use without babysitting it.

That’s why he keeps coming back to local-first design. “I have a strong belief that the data that is relevant for your work probably lives in two different places,” he says: on your computer and in the cloud. For him, asking users to upload everything to a remote system is both a trust problem and a practical mess, especially when banks, logins, and security checks get involved.

Cowork reflects that philosophy. It gives Claude its own sandboxed computer, lets users define skills in plain markdown, and stores memory as text files. The result is less sci-fi than it sounds — and more useful. As Rieseberg puts it, “most of the buttons you add and most of the product services you build are probably more for the human than they are for the model.”

STAY UPDATED

Daily builder insights, straight to your inbox.

Prefer RSS? Subscribe via RSS

ARCHIVE