fireworks/deepseek-r1 DeepSeek R1

Released Date: 1/20/2025

Avg. Accuracy:

73.9%

Latency:

58.62s

Performance by Benchmark

Benchmarks

Accuracy

Rankings

LegalBench

69.9%

( 28 / 43 )

CorpFin

63.2%

( 3 / 19 )

CaseLaw

81.0%

( 11 / 39 )

ContractLaw

62.0%

( 35 / 47 )

TaxEval

76.7%

( 4 / 25 )

MedQA

90.7%

( 5 / 24 )

Academic Benchmarks
Proprietary Benchmarks (contact us to get access)

Cost Analysis

Input Cost

$8.00 / M Tokens

Output Cost

$8.00 / M Tokens

Input Cost (per char)

$1.72 / M chars

Output Cost (per char)

N/A

Overview

DeepSeek R1 is a large language model developed by DeepSeek, a Chinese company. Like Open AI’s o1, it was engineered for reasoning. It works through problems by first “thinking” through the possible steps it can take before producing a final answer.

Released in January 2025, the model caught the Western media cycle by a storm because of its allegedly very strong performance - confirmed on our most recent benchmarks TaxEval and MedQA where it performed similarly as GPT-4o and o1. However, that level of performance is not uniform. The model shows its significant shortcomings on our CaseLaw benchmark. When released, the market cap of the S&P 500 dropped by nearly a trillion dollars — $600 billion of which was from NVIDIA alone.

Key Specifications

  • Number of Parameters: 671 Billion
  • Context Window: 128,000 tokens
  • Pricing:
    • $8 per 1M tokens (input and output) on Fireworks AI
    • $7 per 1M tokens (input and output) on Together AI

Other Notes

Users of the model should have extreme caution when using the DeepSeek platform and API. At the time of writing, data sent to the API may be retained and used for training future models. As such, we did not use the DeepSeek API for our evaluations.

This is one of the largest open source models ever released. It takes at least 8 NVIDIA H100 GPUs to host the model yourself. Many of the US-based inference providers are seeing extremely high load, resulting in a high latency for responses.

The model appears to be fine-tuned to prevent discussing certain topics or having a certain opinion on topics. For example, questions like “Is Taiwan part of China?” elicit results as one might expect.

The way R1 “thinks” seems somewhat nonsensical. Here is a sample response we received for the input “What is 1+1?”:

< think > Okay, so the user is asking what 1 plus 1 is. Hmm, that seems pretty straightforward, but I should make sure I’m not missing anything. Let me think.

First, I know that addition is one of the basic arithmetic operations. When you add two numbers together, you’re combining their values. In this case, we’re adding 1 and 1.

Let me visualize it. If I have one apple and someone gives me another apple, how many apples do I have now? That’s right, two apples. So 1 apple plus 1 apple equals 2 apples. Translating that into numbers, 1 + 1 equals 2.

Wait, but maybe the user is testing me with a simple question to check if I can handle basic math. I should confirm there’s no trick here. Sometimes people use binary or other number systems to make simple questions tricky. For example, in binary, 1 + 1 equals 10. But the question didn’t specify a base, so I should assume it’s in the decimal system, which is the default for most everyday questions.

Another angle: could there be a philosophical angle? Like, “1 + 1” in terms of synergy or something, where the whole is greater than the sum of its parts. But the question is likely straightforward, given the way it’s phrased. The user probably wants the mathematical answer.

Let me double-check with the actual arithmetic. Starting at 1, if I add another 1, I count up by one: 1, then 2. So yes, 1 + 1 = 2. That’s consistent with the basic addition facts taught in elementary school.

I don’t think there’s any ambiguity here. The answer should be 2. Unless there’s some context I’m missing, but the user hasn’t provided any. So I’ll go with the standard answer.

< /think >

The result of 1 + 1 is 2. This is a fundamental arithmetic operation where adding the number 1 to another 1 gives a sum of 2.

Join our mailing list to receive benchmark updates on

Stay up to date as new benchmarks and models are released.