New Finance Agent Benchmark Released

Task Type:

MMMU Benchmark

Last updated

Task type :

Best Performing
$Best Budget
⚡︎Best Speed
Reasoning ModelReasoning Model

Key Takeaways

  • GPT 5 delivers the highest performance on this benchmark, very narrowly edging out Gemini 2.5 Pro Exp .
  • Models have now surpassed the performance of the worst human experts (76.2%), but still are not at the level of the best (88.6%). However, progress on this benchmark seems to be approaching an asymptotic limit.
  • Most of the top models are reasoning models. Generally, the reasoning models take noticeably longer than their non-reasoning counterparts.

Results

According to the original MMMU research, human expert performance ranges from 76.2% for the worst-performing experts to 88.6% for the best-performing experts. Our evaluation shows that the leading AI models are now approaching the lower bound of human expert performance but remain substantially below the upper bound.

MMMU Performance vs. Cost

/


Dataset and Context

The Multimodal Multi-task Benchmark (MMMU) follows a similar methodology to its predecessor, MMLU, but the multiple-choice questions asked include both text and images. MMMU encompasses over 1,000 high-quality tasks spanning 30 subjects in 6 major disciplines:

  • Arts & Design
  • Business
  • Science
  • Health & Medicine
  • Humanities & Social Sciences
  • Tech & Engineering

We based this benchmark on the standard 4-option multiple-choice format containing approximately 1,700 questions from the official Hugging Face dataset. The benchmark focuses specifically on how well models can process and reason about problems where images are interleaved with text, requiring sophisticated visual understanding and cross-modal reasoning capabilities.

MMMU is particularly valuable because it tests the models’ abilities to solve graduate-level questions where visual information is critical to finding the correct answer.

Methodology

We adhered closely to the official MMMU evaluation protocol with the following implementation details:

  1. Prompt Structure: We used the chain-of-thought prompt from the original MMMU repository. Each question-answer set followed this format:
Which of the following best explains the overall trend shown in the <image>?
A. Migrations to areas of Central Asia for resettlement
B. The spread of pathogens across the Silk Road
C. Invasions by Mongol tribes
D. Large-scale famine due to crop failures
Answer the preceding multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of options. Think step by step before answering.
  1. Image Processing: All <image> tags were replaced with actual image bytes during model inference, replicating the original methodology exactly.

  2. Standardization: We established a consistent baseline by using a default configuration of 8192 maximum output tokens for all models to ensure that outputs were not truncated. All models were ran with a temperature of 0.

  3. Parsing Adaptation: We modified the answer extraction regex to handle markdown output from some models, ensuring reliable parsing across all responses.

  4. Statistical Validity: This was a pass@1 evaluation (one attempt per question), with the large dataset size (1,700 questions). We found a standard deviation approximately 1% (calculated using the methodology from Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations)

  5. Thinking Models Configuration: For models with “thinking” capabilities (Claude 3.7 Sonnet Thinking, o1, etc.), we set the maximum output token limit to 16,384 to accommodate extended reasoning.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.