← Back to Vault

Plastic Man Benchmark Fail

Tom Spencer · Category: stories_and_anecdotes

When testing the BrowserComp benchmark prompt about a 1960s fictional character, many advanced models like Claude 4.5 and GPT-4.5 failed before the browser extension succeeded.