In the fast-evolving landscape of artificial intelligence, the release of new large language models by organizations like OpenAI, Google, and Anthropic generates significant anticipation and scrutiny. A pivotal element in this discourse is a graph produced by the Model Evaluation and Threat Research (METR) nonprofit, which has gained notoriety since its debut in March of the previous year. This graph illustrates a trend suggesting that certain AI capabilities are advancing at an exponential rate, a claim underscored by recent model releases, such as Anthropic’s Claude Opus, which reportedly outperformed prior expectations. According to METR, Claude Opus was able to complete tasks that typically require human effort of about five hours, showcasing a remarkable leap beyond earlier models.

However, the interpretation of this graph is not as straightforward as it may seem. METR’s assessments come with significant uncertainty, as evidenced by the error margins in their evaluations. The organization’s estimates indicate that while Opus might complete tasks that take humans around five hours, it could also only be effective on tasks requiring as little as two hours. Additionally, the graph primarily assesses AI models based on their performance in coding tasks, which is not universally accepted as a definitive measure of broader AI capabilities. Sydney Von Arx, a member of METR’s technical team, notes that many observers misinterpret the implications of the graph, which does not aim to gauge the overall capabilities of AI systems.

METR’s mission is to evaluate the risks associated with frontier AI systems, and while the exponential trend graph has contributed to its reputation, the organization recognizes the complexity of its findings. As Thomas Kwa, one of the lead authors of the graph, points out, the y-axis reflects the ‘time horizon’ metric—an innovative measure developed by METR that indicates how long a model can successfully perform tasks compared to human benchmarks. This metric often leads to misconceptions, such as the belief that the y-axis denotes the duration for which a model can operate independently. In reality, it reflects how long it takes humans to complete tasks that a model can handle. With ongoing efforts to clarify these nuances, the METR team hopes to refine public understanding while acknowledging that the excitement surrounding AI development can often overshadow critical context.


Source: This is the most misunderstood graph in AI via MIT Technology Review