Evaluating Claude Opus 4.5

Dec 17, 2025
1 min read

Claude Opus 4.5 came out 2 days ago so we benchmarked Opus 4.5, Sonnet 4.5, and Gemini 3 Pro on research tasks at Elicit - extracting answers from papers and writing systematic review reports.

For question-answering and data extraction from papers, Opus 4.5 is the new state of the art.

  • Opus 4.5 has 96.5% accuracy vs Gemini's 89.4%

  • Opus is also best on our combined "accurate, supported, and direct" metric (76% vs 71%).

  • Gemini is slightly better on claim supportedness (an measure for hallucination).

For report writing, Opus 4.5 produces significantly better-supported reports than Sonnet 4.5, the previous best model for this task:

  • 62% of Opus' claims were well-supported vs Sonnet's 54%

  • Only 31% of Opus' claims were poorly-supported vs Sonnet's 40%

  • Opus is less verbose and writes approximately 20% fewer claims per report.

We didn't compare Gemini since Sonnet 4.5 already wins 75% of head-to-head comparisons vs Gemini, and Gemini is 6x slower than Sonnet.

We also did deeper manual comparisons of 5 reports, and found that Opus and Sonnet reach the same conclusions with no dramatic differences in output. Sonnet writes longer reports with more extensive commentary by default.

Opus 4.5 is now live to Elicit users on Pro, Teams, and Enterprise.

Read next

Save time, think better.

Try Elicit for free