Coding Weekly AI News
June 23 - June 30, 2025A major study released this week revealed significant challenges for AI agents. Researchers at Carnegie Mellon University in the United States created a test called TheAgentCompany to measure how well AI handles office tasks like coding and web searches. The results showed AI agents failed about 70% of the time, completing only 34% of tasks successfully despite six months of improvements. Graham Neubig, one of the researchers, explained they started the project after seeing claims about jobs being automated by AI. Their real-world testing proved much harder than expected.
In related news, the IT consulting firm Gartner released predictions suggesting over 40% of agentic AI projects will be cancelled by 2027. They identified widespread "agent washing" - a practice where companies rebrand existing products like chatbots as advanced AI agents without actual new capabilities. Gartner estimates only about 130 legitimate agentic AI vendors exist among thousands claiming to offer such technology. This indicates the industry faces significant hurdles in delivering functional AI agents.
Despite these challenges, simpler AI coding tools continue helping small businesses. These tools allow small companies to build software without full engineering teams by describing desired functions in plain language instead of writing complex code. For example, SleekFlow uses AI to help businesses with limited technical resources improve customer communications. This approach speeds up development and lets small players "do more with less" while competing with larger companies.
However, experts caution there are limits to how far non-technical users can go with AI coding tools alone. Ilia Badeev of Trevolution Group noted it's "quite difficult to write a fully functional solution that will scale" without engineering skills. This aligns with the Carnegie Mellon findings showing current AI agents struggle with complex tasks. The gap between theoretical promises and practical performance remains substantial for sophisticated agentic AI.
The market offers various AI coding assistants like Devin and Cursor, which show promise in controlled environments. Devin works as a fully autonomous coding agent in a sandbox environment, handling tasks from website creation to deploying machine learning models. Cursor provides an AI-augmented editor with "agent mode" that attempts to meet high-level goals through file generation and editing. While these tools demonstrate progress, the recent study suggests real-world effectiveness still lags behind marketing claims.
Looking ahead, researchers expressed disappointment that major AI labs haven't adopted their rigorous testing benchmark, possibly because it reveals performance gaps. The slow improvement rate - from 24% to 34% success over six months - highlights the technical challenges. As Meta's plans to release engineer-level AI assistants this year proceed, the Carnegie Mellon study provides crucial reality check about current limitations. The next year will show whether these hurdles can be overcome.