It’s Lunar New 12 months in China and the world is celebrating! Due to the launch of 1 superb mannequin after the opposite by Chinese language corporations. Alibaba too just lately launched Qwen2.5-Max – a mannequin that supersedes giants from OpenAI, DeepSeek & Llama. Filled with superior reasoning, and picture & video era, this mannequin is about to shake the GenAI world. On this weblog, we’ll examine the efficiency of Qwen2.5-Max, DeepSeek-R1, and Kimi k1.5 on a number of fronts to seek out the perfect LLM at current!
Introduction to Qwen2.5-Max, DeepSeek-R1, and Kimi k1.5
- Qwen2.5-Max: It’s a closed-source multimodal LLM by Alibaba Cloud, skilled with over 20 trillion parameters and fine-tuned utilizing RLHF. It reveals superior reasoning capabilities with the flexibility to generate pictures and movies.
- DeepSeek-R1: It’s an open-source mannequin by DeepSeek, that has been skilled utilizing reinforcement studying with supervised fine-tuning. This mannequin excels in logical considering, advanced problem-solving, arithmetic, and coding.
- Kimi k1.5: It’s an open-source multimodal LLM by Moonshot AI that may course of massive quantities of content material in a easy immediate. It may well conduct real-time internet searches throughout 100+ web sites and work with a number of recordsdata . The mannequin reveals nice leads to duties involving STEM, coding, and normal reasoning.
Qwen2.5-Max Vs DeepSeek-R1 Vs Kimi k1.5: Technical Comparability
Let’s start evaluating Qwen2.5-max, DeepSeek-R1, and Kimi k1.5, beginning with their technical particulars. For this, we can be evaluating the benchmark performances and options of those 3 fashions.
Benchmark Efficiency Comparability
Based mostly on the obtainable knowledge, right here is how Qwen2.5-Max performs in opposition to DeepSeek-R1 and Kimi k1 on varied customary benchmark assessments.

- Dwell Code Bench: This benchmark determines how every mannequin handles coding duties, together with writing, debugging, or understanding code. Kimi k1.5 and Qwen2.5-Max are virtually tied, indicating they’re each very able to producing and parsing code snippets.
- GPQA (Common-Objective Query Answering): This benchmark evaluates a mannequin’s skill to grasp and remedy questions round a number of domains like reasoning, context-based understanding, and factual information. On this benchmark, DeepSeek R1 leads over Qwen2.5-Max with a major margin, highlighting its dominance.
- MMLU: This benchmark assessments multi-subject information and language understanding throughout varied domains (arithmetic, sciences, humanities, and many others.). Qwen2.5-Max leads over DeepSeek R1, displaying a greater grasp of numerous tutorial and real-world matters.
- C-Eval (Complete Analysis): This benchmark covers nuanced, higher-level reasoning questions. Qwen2.5-Max reveals a greater efficiency on this parameter in comparison with Kimi k1.5, making it a greater reasoning mannequin.
Function Comparability
Every of the three fashions has its personal internet chat interface which presents completely different options. Right here’s how Qwen2.5-Max performs in opposition to its friends on the subject of options:
| Function | Qwen2.5-Max | DeepSeek-R1 | Kimi k1.5 |
|---|---|---|---|
| Picture Evaluation | No | Sure | Sure |
| Internet Interface | Sure | Sure | Sure |
| Picture Era | Sure | No | No |
| Internet Search | No | Sure | Sure |
| Artifacts | Sure | No | No |
| Paperwork Add | Single | A number of | A number of |
| Frequent Phrase | No | No | Sure |
Qwen2.5-Max Vs DeepSeek-R1 Vs Kimi k1.5: Utility-based Evaluation
Now let’s examine the presently reigning prime 3 Chinese language fashions – Qwen2.5-max, DeepSeek-R1, and Kimi k1.5 – primarily based on their purposes. For this, we can be testing their efficiency on three completely different duties involving reasoning, multi-step processing, and coding.
We are going to give the identical immediate to all three fashions and examine their responses, giving them scores of 0, 0.5, and 1 primarily based on their efficiency. On the finish, we’ll tally the scores to seek out out which ones is the perfect. So let’s get began.
Activity 1: Superior Reasoning
Immediate: “ Show mathematically that Earth is spherical”
Outputs
| Qwen2.5-Max | DeepSeek-R1 | Kimi k1.5 |
|---|---|---|
| Utilizing geometry, gravity, satellite tv for pc movement, and historic experiments, now we have proven mathematically that the Earth is spherical. The curvature of the horizon, uniformity of gravity, and consistency of satellite tv for pc orbits all assist this conclusion.The Earth is a sphere (or extra exactly, an oblate spheroid). | By calculating Earth’s circumference utilizing angular variations in shadows and distances between two places, Eratosthenes mathematically proved Earth’s spherical form. Fashionable measurements and observations (e.g., horizon curvature, lunar eclipses) additional affirm this. | The consistency of measured distances and angles with spherical geometry (e.g., d=Rθ) proves the Earth is spherical. Flat-Earth fashions fail to elucidate these relationships. |
My Evaluation
| Qwen2.5-Max | DeepSeek-R1 | Kimi k1.5 |
|---|---|---|
| The mannequin offered some key observations supporting the declare. It then lined varied parameters just like the geometry of a sphere, horizon curvature, rotational and centrifugal power, and others that weren’t required. It did cowl “Eratosthenes’ Experiment” as a part of these parameters. | The mannequin defined the idea briefly together with all of the mathematical calculations to present the proof for the given job. It additionally gave some extra proofs together with some key observations to ascertain its consequence. The mannequin used Eratosthenes’ Experiment to ascertain the proof. | This mannequin gave essentially the most clear and concise response. It proved the speculation with a small mathematical calculation. It used the measurements from Eratosthenes’ Experiment, with out explicitly mentioning it. |
Essentially the most related proof of Earth’s curvature could be the one involving Eratosthenes’ methodology (circa 240 BCE), because it’s one of many oldest and most complete proofs involving the mathematical evaluation of the angular distinction between two locations. All three fashions used that strategy in somehow.
Qwen2.5-Max offered 8 other ways to show that the Earth is spherical with no correct clarification of any. DeepSeek-R1 took Eratosthenes’ methodology – defined its idea and arithmetic in concise and clear phrases. Kimi okay 1.5 used the only strategy, primarily based on the required methodology with out even explicitly mentioning it.
Rating: Qwen2.5-Max: 0 | DeepSeek-R1: 0.5 | Kimi k1.5: 1
Activity 2: Multi-step Doc Processing & Evaluation
Immediate: “Summarise the lesson in 1 line, create a flowchart to elucidate the method occurring within the lesson, after which translate the abstract into French.
🔗 Lesson“
Outputs

My Evaluation
| Qwen2.5-Max | DeepSeek-R1 | Kimi k1.5 |
|---|---|---|
| The abstract was concise and laid out the matters lined within the lesson. | The abstract of the lesson was crisp, concise, and to the purpose. | The abstract lined all of the matters and was fairly easy, but a bit lengthy in comparison with the others. |
| The flowchart lined all important headings and their subheadings as required. | The flowchart lined all important headings however had greater than the required content material within the sub-headings. | As an alternative of the flowchart in regards to the lesson, the mannequin generated the flowchart on the method that was lined within the lesson. Total this flowchart was clear and crisp. |
I wished a easy, crisp, one-line abstract of the lesson which was generated by DeepSeek-R1 and Qwen2.5-Max alike. However for the flowchart, whereas the design and crispness of the consequence generated by Kimi k1.5 was the precise ask, it lacked particulars in regards to the move of the lesson. The flowchart by DeepSeek-R1 was a bit content-heavy whereas Qwen2.5-Max gave a very good flowchart masking all necessities.
Rating: Qwen2.5-Max: 1 | DeepSeek-R1: 0.5 | Kimi k1.5: 0.5
Activity 3: Coding
Immediate: “Write an HTML code for a wordle form of an app”
Be aware: Earlier than you enter your immediate in Qwen2.5-Max, click on on artifacts, this fashion it is possible for you to to visualise the output of your code inside the chat interface.
Output:
Qwen2.5-Max:
DeepSeek-R1:
Kimi k1.5:
My Evaluation:
| Qwen2.5-Max | DeepSeek-R1 | Kimi k1.5 |
|---|---|---|
| The mannequin generates the code rapidly and the app itself seems quite a bit just like the precise “Wordle app”. As an alternative of alphabets listed on the backside, it introduced us the choice to straight enter our 5 letters. It might then robotically replace these letters within the board. | The mannequin takes a while to generate the code however the output was nice! The output it generated was virtually the identical because the precise “Wordle App”. We will choose the alphabets that we want to strive guessing and they’d put our choice into the phrase. | The mannequin generates the code rapidly sufficient. However the output of the code was a distorted model of the particular “Wordle App”. The wordboard was not showing, neither had been all letters. In truth, the enter and delete options had been virtually coming over the alphabets. |
| With its artifacts characteristic, it was tremendous straightforward to research the code proper there. | The one concern with it was that I needed to copy the code and run it in a distinct interface. | In addition to this, I needed to run this code in a distinct interface to visualise the output. |
Firstly, I wished the app generated to be as much like the precise Wordle app as attainable. Secondly, I wished to place minimal effort into testing the generated code. The consequence generated by DeepSeek-R1 was the closest to the ask, whereas Qwen-2.5’s pretty good consequence was the simplest to check.
Rating: Qwen2.5-Max: 1 | DeepSeek-R1: 1 | Kimi k1.5: 0
Closing Rating
Qwen2.5-Max: 2 | DeepSeek-R1: 1.5 | Kimi k1.5: 1.5
Conclusion
Qwen2.5-Max is an incredible LLM that offers fashions like DeepSeek-R1 and Kimi k1.5 robust competitors. Its responses had been comparable throughout all completely different duties. Though it presently lacks the ability to research pictures or search the online, as soon as these options are reside; Qwen2.5-Max can be an unbeatable mannequin. It already possesses video era capabilities that even GPT-4o doesn’t have but. Furthermore, its interface is sort of intuitive, with options like artifacts, which make it less complicated to run the codes inside the identical platform. All in all, Qwen2.5-Max by Alibaba is an all-round LLM that’s right here to redefine how we work with LLMs!
Incessantly Requested Questions
A. Qwen2.5-Max is Alibaba’s newest multimodal LLM, optimized for textual content, picture, and video era with over 20 trillion parameters.
A. In comparison with DeepSeek-R1 and Kimi k1.5, it excels in reasoning, multimodal content material creation, and programming assist, making it a powerful competitor within the Chinese language AI ecosystem.
A. No, Qwen2.5-Max is a closed-source mannequin, whereas DeepSeek-R1 and Kimi k1.5 are open-source.
A. Sure! Qwen2.5-Max mannequin helps picture and video era.
A. Sure, each DeepSeek-R1 and Kimi k1.5 assist real-time internet search, whereas Qwen2.5-Max presently lacks internet search capabilities. This provides DeepSeek-R1 and Kimi an edge in retrieving the most recent on-line data.
A. Relying in your use case, select:
– Qwen2.5-Max: In the event you want multimodal capabilities (textual content, pictures, video) and superior AI reasoning.
– DeepSeek-R1: If you’d like the flexibleness of an open-source mannequin, superior question-answering efficiency, and internet search integration.
– Kimi k1.5: In the event you want environment friendly doc dealing with, STEM-based problem-solving, and real-time internet entry.
