Instead of lab benchmarks, they analyzed millions of real-world interactions across their public API and Claude Code.
The key findings:
The longest autonomous Claude Code sessions nearly doubled from under 25 minutes to over 45 minutes between October 2025 and January 2026. But the median session is still just 45 seconds.
New users auto-approve about 20% of agent actions. Experienced users with 750+ sessions approve over 40%. But here's the interesting part, those same experienced users also interrupt the agent more frequently. They're not blindly trusting it. They've learned when to step back and when to step in.
Software engineering accounts for about 50% of all agentic tool calls on the API. But they're now seeing early usage in healthcare, finance, and cybersecurity. Only 0.8% of observed actions appeared irreversible, like sending customer emails or processing financial transactions.
Why this matters: There's a significant gap between what agents can do and what users let them do. Anthropic calls it a "deployment overhang." The autonomy models are capable of exceeding what's granted in practice. That gap is where governance, trust, and organizational design live, and it's exactly what I've been seeing in enterprise conversations.
Also this week:
xAI launched Grok 4.20 Beta and it's architecturally different from anything before it. Instead of a single model answering your question, Grok 4.20 uses four specialized AI agents (Grok, Harper, Benjamin, Lucas) that work in parallel, debate each other, fact-check, and synthesize a final answer. The Heavy tier scales this to 16 agents. It's the first major consumer-facing implementation of multi-agent collaboration at this scale.
When "Supervising AI" Replaces "Writing Code"
Two stories dominated my feed this week. Spotify's engineers not writing code since December. And Anthropic's research shows that experienced users let agents work longer but also interrupt them more.
On the surface, these look like separate stories. One about productivity. One about safety research.
But if you're running a technology team or advising one, which I do every week, it's the same story.
They're both about the same shift: humans moving from doing the work to supervising the system that does the work.
What Spotify Actually Did
What Spotify did is probably more significant.
They spent over a year building infrastructure, a system called Backstage that catalogs every component, tracks ownership, and standardizes how software gets built. On top of that, they deployed Claude Code through an internal platform called Honk.
The result: their best engineers now describe what they want done, the agent does it, and they review the output.
An engineer on their morning commute tells Claude via Slack to fix a bug. Claude writes the code, tests it, and pushes a new app build back to the engineer's phone. The engineer reviews and merges it into production. Before arriving at the office.
That's not hype. That's a workflow change. And it only works because Spotify spent the time mapping its codebase, clarifying ownership, and building the infrastructure to let agents operate safely.
Most companies skip that part.
What Anthropic's Data Actually Shows
Anthropic's study confirms something I've observed in a different context, enterprises working with GenAI agents.
Trust isn't binary. It builds gradually.
New users start cautiously.
Experienced users grant more autonomy. But experienced users also develop better judgment about when to intervene. Their internal data showed success rates on hard tasks doubled while human interventions per session dropped from 5.4 to 3.3.
That's not humans checking out. That's humans getting better at supervising.
But here's what keeps me up at night. Only 0.8% of observed agent actions were irreversible. That sounds comforting until you realize that at the API scale, 0.8% of millions of actions is still a lot of emails sent, records updated, and transactions processed without a clear rollback path.
The Real Question for Your Organization
Most of the leaders I talk to are still asking: "Should we let AI write code for our team?"
That's the wrong question. The real question is: "Do we have the infrastructure for our people to supervise AI effectively?"
Because supervision isn't just watching. It requires:
Mapped systems. Spotify couldn't deploy Honk without Backstage. You can't let an agent loose on systems you haven't cataloged. Do you know what every AI-accessible service in your org does, who owns it, and what data it touches?
Clear ownership. When the agent produces output, someone has to make the decision to ship it. Not a committee. Not "the AI team." A person. Spotify's engineers own the merge. Who owns the merge in your org?
Intervention skills. Anthropic's data shows experienced users interrupt more, not less. That's a skill. Your team needs to know what to look for, not just how to prompt. Are you training your people to review AI output, or just to generate it?
Something You Can Do This Week
Pick one AI workflow your team is currently running or planning. It doesn't have to be coding. It could be document generation, data analysis, customer response drafting, or anything where AI produces output that a human approves.
Now run this diagnostic:
Map the scope. Write down everything the AI touches in this workflow. Every system, every data source, every output channel. If you can't list it in under five minutes, you don't understand your own workflow well enough to supervise it.
Define "wrong." Write down the three most likely ways this workflow produces a bad outcome. Not catastrophic failure, mundane mistakes. Wrong data pulled. An outdated template was used. Confidential info included in a draft that gets forwarded. For each one, write down how you'd know it happened and how long it would take to find out.
Time your response. From the moment the AI produces a bad output to the moment a human catches it. How long is that window? Minutes? Hours? Days? If you don't know, that's your first problem to solve. Because the difference between "AI-assisted" and "AI-unsupervised" is the length of that window.
I've started running this exercise with customers.
It takes 30 minutes. And it consistently reveals that the gap between "we use AI" and "we supervise AI well" is wider than anyone assumed.
The organizations that close that gap are the ones that will look like Spotify in a year.
The ones that don't will be the ones still debating whether to start.
Gen AI Maturity Framework:
The GenAI Maturity Portal is live at GenAIMaturity.Net. You can try out Maturity Assessments and access resources to benchmark where your organization stands. Content is being reviewed and added frequently.
MIT AI Agent Safety Index: Researchers evaluated 30 of the most deployed AI agents. Only 4 of 13 frontier agents disclose safety evaluations. 25 of 30 provide no details about safety testing. The report highlights the industry's governance gap.
Deloitte's Agentic AI Strategy Report Warns of "agent washing" where vendors rebrand existing automation as agents, and "workslop" where poorly designed agents actually add work to processes. Recommends treating agents like workers: with onboarding, governance, and performance metrics. source
The Journal of Accountancy published a practical guide on AI workflow automation for client accounting services. The use cases go beyond chatbots: intelligent forms for client onboarding, AI-powered invoice processing that extracts details from PDFs and emails, automated categorization that learns client-specific rules, and AI-generated monthly reporting packages with tailored commentary based on variances and KPIs.
Their recommendation: pick one high-friction workflow (accounts payable, monthly reporting, document requests), run a 30-60 day pilot, and measure time savings, error reduction, and client feedback before expanding.
PwC's latest predictions outline a practical organizational model for scaling AI: a centralized "AI Studio" that brings together reusable tech components, frameworks for assessing use cases, a sandbox for testing, deployment protocols, and skilled people. Senior leadership picks focused areas where AI payoffs can be largest, then applies talent, technical resources, and change management through this hub.
If you're figuring out how to move from scattered AI pilots to coordinated deployment, this is a useful blueprint.
Anthropic Study: How AI Assistance Impacts the Formation of Coding Skills
Published January 29, Anthropic found that developers using AI assistance scored 17% lower on coding comprehension tests despite completing tasks slightly faster. This is one of the most important studies for anyone leading technical teams; the productivity gains are real, but the skill-building trade-offs need active management. Read the study
Things to Know...
Dubai’s Human-Machine Collaboration Icons (HMC)
Dubai Future Foundation, under the guidance of His Highness Sheikh Hamdan bin Mohammed bin Rashid Al Maktoum, has introduced the world’s first Human–Machine Collaboration (HMC) icon system. This visual framework allows creators to declare the level of AI involvement, ranging from “All Human” to “All Machine,” and identify specific content stages where AI contributed, such as ideation, data analysis, writing, visuals, and more.
Implementation is mandatory for all Dubai government entities, while creators worldwide are encouraged to adopt the icons voluntarily for transparency and accountability.
My Take
The HMC icons are more than labels; they’re a trust layer. As GenAI becomes ubiquitous in content creation, everyone needs clarity, not catchy slogans. These icons deliver that clarity: simple, standardized, and scalable.
Therefore, AI Tech Circle has started adopting HMC icons across this newsletter. I am committed to declaring human vs. AI involvement explicitly.
The Difference Between "Using AI" and "Supervising AI"
Most organizations say they're using AI. Very few are supervising it. The difference matters:
Using AI means generating outputs, drafts, summaries, code snippets, and manually checking them. It's productivity. It works fine at small scale.
Supervising AI means building the infrastructure and processes so that AI can act semi-autonomously, with humans intervening only when needed and catching failures fast.
That's transformation.
And it requires mapped systems, clear ownership, and trained human judgment, not just better prompts.
If your team is generating output with AI but doesn't have a clear answer for "what happens when the AI gets it wrong and nobody notices for 48 hours," you're using AI. You're not supervising it yet.
The Opportunity...
Podcast:
This week's Open Tech Talks episode 182 is "Why 95% of AI Pilots Fail and How to Be in the 5% with Mindaugas Maciulis". He works with CEOs, COOs, and operating partners in the $20M–$250M range who are ready to go beyond pilots and turn AI into real EBITDA growth. His proven 90-day sprint framework, AImpact OS, delivers measurable lifts across productivity, customer service, and sales.
Emergent: Full-stack AI workflow builder with deep observability, designed for production-grade agentic systems, not just simple automations. Tracks every AI decision and system interaction.
n8n: Open-source workflow automation with AI agent capabilities. Build complex multi-step processes that combine AI, APIs, and business logic. Self-hostable.
Vellum AI: Prompt orchestration platform for teams that want structured LLM experimentation, evaluation, and controlled deployment into production workflows.
The Investment in AI...
Anthropic closed a $30B Series G round at $380B valuation, the largest private AI funding round, driven by enterprise agent demand.
Resolve AI reached $1B valuation with $125M in funding, building AI agents that detect and fix issues in live software systems
That’s it for this week - thanks for reading!
Reply with your thoughts or favorite section.
Found it useful? Share it with a friend or colleague to grow the AI circle.
Until next Saturday,
Kashif
The opinions expressed here are solely my conjecture based on experience, practice, and observation. They do not represent the thoughts, intentions, plans, or strategies of my current or previous employers or their clients/customers. The objective of this newsletter is to share and learn with the community.
You are receiving this because you signed up for the AI Tech Circle newsletter or Open Tech Talks. If you'd like to stop receiving all emails, click here. Unsubscribe · Preferences
AI Tech Circle
Kashif Manzoor
Learn something new every Saturday about Generative AI #AI #ML #Cloud and #Tech with Weekly Newsletter. Join with 592+ AI Enthusiasts!
Your Weekly AI Briefing for Leaders Welcome to this week’s AI Tech Circle briefing, clear insights on Generative AI that actually matter. Today at a Glance: AI Weekly Executive Brief The Enterprise AI Agent Readiness Gap Tip of the Week Podcast Courses and events to attend Tool / Product Spotlight Executive Brief The Agent Management Platform That Changes the Game This week, OpenAI launched Frontier, a new platform designed to help enterprises build, deploy, and manage AI agents capable of...
Your Weekly AI Briefing for Leaders Welcome to this week’s AI Tech Circle briefing, clear insights on Generative AI that actually matter. Today at a Glance: AI Weekly Executive Brief Situation of the Generative AI Pilots Courses and events to attend Executive Brief The OpenClaw Phenomenon: From Viral AI Agent to Emergent Bot Society In the fast-evolving landscape of AI agents, the past week has seen explosive attention to OpenClaw (formerly Clawdbot and Moltbot). This open-source tool enables...
Your Weekly AI Briefing for Leaders Welcome to this week’s AI Tech Circle briefing, clear insights on Generative AI that actually matter. While writing any post nowadays, I always think, 'Is it worth writing?' When all the tips, knowledge, and how-tos are available with the touch of a button via any LLM app, and, frankly, I am still thinking: should I keep curating and sharing this or just stop? This also reminded me that my first public post on any tech topic was back on August 20th, 2011,...