Task Type:

MortgageTax Benchmark

Last updated

1

Claude 3.7 Sonnet (Nonthinking)

★

80.6%

$3.00 / $15.00

5.67 s

2

GPT 5 Mini

$

79.2%

$0.25 / $2.00

24.85 s

3

Claude 3.7 Sonnet (Thinking)

79.2%

$3.00 / $15.00

13.58 s

4

GPT 5

79.0%

$1.25 / $10.00

62.83 s

5

GPT 4.1

⚡︎

79.0%

$2.00 / $8.00

5.03 s

6

Gemini 2.5 Pro Exp

78.8%

$1.25 / $10.00

8.91 s

7

o3

78.4%

$2.00 / $8.00

21.22 s

8

Claude 3.5 Sonnet Latest

78.1%

$3.00 / $15.00

4.22 s

9

GPT 4.1 Mini

77.7%

$0.40 / $1.60

4.60 s

10

o4 Mini

77.1%

$1.10 / $4.40

13.00 s

Task type :

★Best Performing

$Best Budget

⚡︎Best Speed

Reasoning Model

Partners in Evaluation

Key Takeaways

Anthropic’s Claude 3.7 Sonnet (Nonthinking) is the best performing model overall 80.6% accuracy, excelling in both semantic and numerical extraction tasks.
The benchmark tests the multimodal capabilities of models, as the documents are provided as images, including both computer-written and handwritten parts.
The best models perform similarly on the Semantic Extraction task, while the gaps are more pronounced in the Numerical Extraction task.

Dataset and Context

The MortgageTax benchmark evaluates the ability of language models to extract information from mortgage tax certificates. Vontive provided both the mortgage documents and the labeled queries, which we used as a foundation for the dataset.

When prompting the models, we provide the documents as images, testing their multimodal capabilities. We test them on two tasks:

Semantic Extraction: We ask the models to extract the year, the parcel number, and the county from the tax certificate.
Numerical Extraction: We ask the models to calculate the annualized amount due based on the tax certificate.

For both tasks, the models are asked to reply in a JSON format containing both an explanation and the final answer for each question. Not all models follow this format precisely, so some regex pattern matching is used to extract the final answers.

The dataset consists of 1258 documents, divided into three sets:

Public Validation (20 samples): Publicly accessible, available on request.
Private Validation (300 samples): Available for purchase to evaluate your own models/agents on.
Test (938 samples): A hold-out set that we never share externally. Only this set is used for the numbers reported on this page.

To receive access to the Public or Private Validation sets, please reach out to us at contact@vals.ai.

Results

MortgageTax

/

MortgageTax Input Example

Below is a typical example of the images passed to the models. For confidentiality reasons, we redacted the names and addresses of the owners and properties.

Page 1 of Mortgage Tax Page 2 of Mortgage Tax Page 3 of Mortgage Tax

Here are examples of the questions asked of the models.

Numerical Extraction:

I will give you a question about a document, that is presented as several images.
You must output your answer in a structured way, strictly writing in the json format below (do not add text before or after writing the json). The answer field must only contain the exact string or number, no explanations.

{
    "annualized_amount_due": {
        "reasoning": "Reasoning for annualized_amount_due",
        "answer": "Answer to annualized_amount_due"
    }
}

--START OF annualized_amount_due QUESTION--
What is the total annualized amount of tax due? First look at the payments made and amount due/paid for the prior year. It is possible this is split between the city and county. If there were multiple payments made in the prior year, then the same amount of payments should be expected for the current year. Then find the total amount due for the current year. This number should be within 25% of the prior years total amount due/paid. If the amount due is not within 25% of the prior years total amount due/paid, then we can assume that only part of the total annual amount due has been paid so far. In this case, you should extrapolate the total amount due for the current year based on the number of payments required and the amount paid in each payment. This number should also be within 25% of the total amount due for the prior year. If this is still not within 25% of the total amount due for the prior year, then return null.
--END OF annualized_amount_due QUESTION--

Semantic Extraction:

You must output your answers in a structured way, strictly writing in the json format below (do not add text before or after writing the json). The answer field must only contain the exact string or number, no explanations.

{
    "tax_year": {
        "reasoning": "Reasoning for tax_year",
        "answer": "Answer to tax_year question"
    },
    "county": {
        "reasoning": "Reasoning for county",
        "answer": "Answer to county question"
    },
    "parcel_number": {
        "reasoning": "Reasoning for parcel_number",
        "answer": "Answer to parcel_number question"
    }
}

--START OF tax_year QUESTION--
What is the most recent year of taxes being paid? The year cannot be in the future, so the maximum value should be the current year. If the tax year spans 2 years, return the earlier year. For example, if the tax year is 2023-2024, or 2023-24 return 2023.
--END OF tax_year QUESTION--

--START OF county QUESTION--
What is the name of the county the property is in? Exclude the word 'County' in the answer. This must be a real county name in the United States. If you found a city, return the county the city is located in.
--END OF county QUESTION--

--START OF parcel_number QUESTION--
What is the parcel or map number or APN or tax ID of the property? This is a unique identifier for the property, but is not the address of the property.
--END OF parcel_number QUESTION--

MortgageTax Benchmark

Partners in Evaluation

Key Takeaways

Dataset and Context

Results

MortgageTax Input Example

Join our mailing list to receive benchmark updates on